The epigenomics tech stack

Transcript#

This transcript was generated automatically and may contain errors.

Hey there, welcome to the Paws at Data Science Hangout. I'm Libby Herron, and this is a recording of our weekly community call that happens every Thursday at 12 p.m. U.S. Eastern Time. If you are not joining us live, you miss out on the amazing chat that's going on. So find the link in the description where you can add our call to your calendar and come hang out with the most supportive, friendly, and funny data community you'll ever experience.

I am very excited to introduce our featured leader today, Varun Dwaraka, Director of Bioinformatics and Principal Investigator at True Diagnostic. Varun, thank you so much for joining us today. We would love to learn a little bit about you. If you could tell us about yourself, what you do, and what you do for fun.

Yeah, thank you, everyone. This is the first time that I'm part of this Hangout, and I'm looking at the numbers, and it just keeps getting higher and higher. So if I start going through puberty again, my bad.

So my name is Varun. I serve as the Director of Bioinformatics. I'm actually maybe a little bit of a sheep in a wolf's den. I was a self-taught coder and data analytics type of person because my background was very much in molecular biology. I was trained mainly on the bench, so doing a lot of DNA extractions, PCR, ELISAs. And so as a biologist, starting out, a lot of where I come from is trying to understand the beauty of biology, but now taking it from the lens of data science. I'm a native of San Francisco Bay Area, which I think probably had a huge impact on why I went into the tech route. And then through that, went to do my undergrad at UC Santa Cruz in molecular cellular developmental biology, and fell in love with bioinformatics there.

If people don't know yet, UC Santa Cruz was actually the first institute to go through and build the first human assembly of the human genome. And so a lot of bioinformaticians there actually played a huge role on a lot of that development and getting the bug, and then realizing that, oh, crap, I need to learn data science. And so I had to switch gears to take more of those roles, and then did my PhD at University of Kentucky. And so I am now based in Lexington, Kentucky in biology. So with all that said, yeah, I'm a biologist, but have spent a good deal of the past, like, I don't know, 12 to 15 years kind of learning, failing, and somehow getting to the place I'm at in data science and using R and Python specifically.

What I do for fun is, it's going to sound really nerdy. I just like learning, whether it be in, well, a large part of my time was coding, but now it's just like photography. I was a musician in the past before I went into the biology route. And so spent a lot of time, you know, actually working in studios during my undergrad time and even in grad school here in Lexington. So spent a lot of time there. And so, yeah, I love learning, love spending time outdoors. Lexington is beautiful for the gorge, being able to just go to the Red River Gorge hike. Lake life is a thing coming from the coast. I didn't know that that was a thing. So I've been living that a little bit.

And the moment that, like, that bash script worked at all, it just, like, that high that you feel was really what drove me to keep learning.

When I went into a PhD program, I knew I was, like, I got the bioinformatics bug, but I specifically told myself I didn't want to go into a bioinformatics program. I wanted to go into a biology lab, because I wanted to understand, I guess, to the last point, like, what the actual question was, this domain, what was the domain I wanted to study. And a lot of professors at the time, and maybe even now, still don't have a lot of bioinformatics students or bioinformatics help. They're really outsourcing to bioinformaticians. And so, I just pretty much, like, convinced another professor, like, hey, I have this background. I want to continue this. I just need data. And he was like, yo, hold my beer. I got all the data you need.

I will add to that, chat GPT nowadays is, like, insane when it comes to code building. And typically, I think a lot of people, and what I found in my experience is that pseudocode is actually, a lot of people understand logic and can actually build out pseudocode. When it comes to syntax, that's really where you get the drop off. And so, I think chat GPT nowadays has been very helpful, even for me, to build out pseudocode and structure. Obviously, you need to, or sorry, to build out syntax as long as the pseudocode is in good format. So, you have the logic trained down, you are clear with what you're wanting. And, I mean, the statistical assumptions are very important, regardless of what, you know, you're doing transcriptional data, epigenomic data.

What the data looks like

No, absolutely. I guess directly from our data, so our matrix is looking like pretty much your, depends on if you transpose or not, your columns are going to be your samples. So it'll be either an individual or, you know, the sample set that you get it from. And then your rows are going to be those individual biomarkers. And so that depends on the array or the platform that we use to quantify the biological tissue type. So most of the time we run blood, but it can be whatever you want it to be. And then, so when you get it out, it's a frame of however many samples to 850,000. I guess now it's almost a million. And then the values themselves are actually going to be from 0 to 1 in decimal format. And it's actually bimodally distributed because when you think about the data that comes out, DNA methylation really exists as a methyl mark that gets added to cytosines in the DNA. So typically, it's either a 0 or 1 orientation. It's either there or not.

Public data resources for genomic data

Yeah, shout out to Rebecca for pitching methyl clock. That is actually one of the ones that we use internally. But yeah, so for data sets, so there's multiple consortiums that have contributed a lot of genetic and genomic data. So gene expression omnibus is something that the U.S. government through NCBI has actually outputted as well. And so there, essentially, you can download because every time you publish a genetic, let's just say genomic or epigenetic type of study, you have to release your data. And so a lot of these university consortiums will release their data and deposit it into gene expression omnibus. You have ArrayExpress as well. One major source, as Rohan said, is also the Human Microbiome Project, if the NIH still has it up, especially for microbiome. On the European side, EMBL, E-M-B-L, is another one. I think that they have their own version of a GEO, so you can get it from there.

Reproducibility of analysis results

So a lot of the times, in the context of our results, like let's say we do an interventional trial, we publish it immediately. So we use bioRxiv and if it gets accepted, whatever journal. Also for those, we are able to release all of our source code. The only thing we cannot release as of yet is our models because ultimately that is kind of what's driving our business. In the past, I mean, in the academic sector, even the models and everything was released. So at least in terms of reproducibility from the interventional and research side, that's kind of how we're handling it is just by releasing everything.

In terms of our internal firm, again, the reproducibility is on multiple things. The model will stay consistent. Really, the source of variation comes from the platform. So that is very dependent on, for example, the lab side of things, making sure that they're hitting, let's say, ICC correlations, interclass correlations between, what was it, not, well, technical replicates, really, and also biological replicates, making sure that they're higher than 0.9, at least 0.96 in terms of these replicates, making sure that their quants are also highly repeatable as well, and then only using standardized pipelines. So for example, we haven't recreated on the data side how to pre-process methylation data. We use MinFi.

So kind of going back to what Rohan was mentioning, in terms of reproducibility, we're using a lot of bioconductor packages. So R is our native like pre-processing, and also quantification steps. So like MethylClock that was mentioned is actually from a collaborator of ours, Juan Gonzalez, who I think manages that repository. And so pretty much like R will handle all the pre-processing just because MinFi has been the one that seems like it translates over to a lot of academic labs and all that. But the moment that it gets into, let's say, the beta matrix format, which is that format that I was describing, I think it was Joey that asked that question, then that's where we branch off.

So if, let's say, we're doing a lot of what I handle are actually the academic collaborations. So if we are running these academic research and commercial collaborations, if we're running this on a data set where people are trying to understand how does, you know, people that eat vegan food versus carnivore food, like how do they differ in their epigenetic ages, then we'll stay in R because a lot of the statistical packages are just perfect there.

And so for this, RStudio has been like, that's how we started our entire bioinformatics business, was just using RStudio. And then now we've gotten into Posit, I think the UIUX via AWS, just for a more decentralized connection into S3, and S3 being the source. So decentralization is huge there. However, for the modeling, we used to use a lot of like CARET, C-A-R-E-T, GLMNet for elastic net regressors, and then I think for feature selection using LIMA to do kind of those, like differential methylation identification, and also things like, what is the other one? There's another one that I'm thinking of, which, mutual information, sorry, yeah. So using mutual information specific packages, and then gradient boosting and all that is typically handled there. I know that tidy models was one that we tried to use, but because of things like, we just weren't able to include it, but I got a shout out to tidy models because that really alleviates a lot of the moment, the time that it takes.

However, Saif, who is now the actual ML guy, has shifted everything over to PyTorch and Keras to do all of the neural net, elastic net, random forest regressors, and light GBM, everything like that. So that's why we've kind of, we still use Posit a little bit there, but I think his kind of attack is SageMaker on AWS, just because we're not running one model at a time. Typically, for example, this one project that we're working on is, we're trying to use DNA methylation as a way to predict metabolite levels, and the results are actually pretty striking, but we're doing this for thousands of metabolites. And if we did it serially, it would, I mean, we'd still be working on it. And so he's been using SageMaker as a way to just kind of distribute a lot of those trainings so that we're not waiting like, you know, decades to have it finish.

Handling imputation in precision medicine

No, it's a fantastic question. And imputation has been the bane of our existence within the bioinformatics department. Okay. So imputation is essentially, if you have missing data, in our case, if a probe is not as reliable in quantifying that CPG site, or let's say, I don't know, metabolite level or whatever, you have a certain detection that has to pass. If it doesn't pass it, then it just goes as missing.

So initially, if a data set has more than, I mean, like commercially, if a data set has more than 85% missingness, we will just rerun the sample. That's actually one of the best things to do, especially because we have a lab. That's one of the perks of working at True Diagnostic is that we can actually just rerun the sample and get a new data set. In the cases that we can't, some of the ones that we've used in the past are impute.knn in R, just knn imputation. But we've also, and actually by we, I have to really shout out Natalia Carreras and Laura, who are on the bioinformatics team, who've actually went through and using our, let's say, I think it was like, what, 17,000 patient data set, actually tracked average methylation levels across sex, age, and I think a few of race and everything, and just replace that with the median of that.

Now, is that the best model? No, absolutely not. But this is where, like, if as long as there's like an 85 to 90%, what we're trying to do is just be able to replace it with something that is possibly logical. Now, in the future, or not even in the future, right now, especially in methylation, individuals are using transformer models to create more, to create these imputations. So, for example, I think it was called CPGPT. That's literally what it's called. And then methyl GPT. So, these are the two that have been coming out. And they've been doing a pretty good job. But the problem there is, again, array-to-array compatibility. So, it's a very hand-wavy thing of, we haven't figured this out, but we try to use the best that we can, presumably.

Career advice and lessons from failure

I mean, my GPA in undergrad was, like, at 2.8, and so, like, it was pretty bad. But I think the, I don't know, I don't think I would have, you know, I mean, like, hindsight is 50-50. I would have loved to have taken more, you know, computer science classes or, like, actually statistics classes. It's not even the computer science. It's more of statistics, because that's really where I fell in love with numbers and understanding numbers. But honestly, no, I don't think I would have changed much, just because the failures taught a lot, and I think that that was the reason why I was able to even succeed at a PhD, which succeed meaning I finished it.

And so, like, I see this a lot with new PhD students nowadays, is, like, you know, the brightest, like, the ones that had, like, valedictorians or 4.0s or whatever, like, no knock on them. Like, I wish I was one of them. My parents wish I was one of them. But, like, you know, there's something about failing a lot and being comfortable in that failure that really allows you to kind of finish that degree. And I think that, you know, the confidence then gets instilled of, like, you know, what are the fundamentals? Where is the starting point, and how do you kind of expand upon that starting point?

There's something about failing a lot and being comfortable in that failure that really allows you to kind of finish that degree.

And I think that's helped a lot in my current position, because when I came to True Diagnostic, the bioinformatics was not developed at all. Like, we didn't actually have a division for bioinformatics. And so, a lot of the times, like, I had to learn what the hell AWS was. I had to figure out what modeling was, like, in the context of, like, how do I actually deploy GLM net? And then later, it's, like, how do I make this more efficient, and then be able to do a little bit more of a reproducible aspect? So, I think the failures allowed me to be more comfortable when, like, I loved that background of, I believe it was Luke, where everything's on fire, and you're still, like, okay, this is fine. So, I think it really, it all just kind of comes down to, like, that. So, yeah, accept failure as a really good teaching tool.

The epigenomics tech stack | Varun Dwaraka | Data Science Hangout

Transcript#

What TruDiagnostic does

Breaking into bioinformatics from general data science

Learning to code as a biologist

What the data looks like

Public data resources for genomic data

Reproducibility of analysis results

The epigenomics tech stack

Handling imputation in precision medicine

Career advice and lessons from failure