The epigenomics tech stack | Varun Dwaraka | Data Science Hangout
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hey there, welcome to the Paws at Data Science Hangout. I'm Libby Herron, and this is a recording of our weekly community call that happens every Thursday at 12 p.m. U.S. Eastern Time. If you are not joining us live, you miss out on the amazing chat that's going on. So find the link in the description where you can add our call to your calendar and come hang out with the most supportive, friendly, and funny data community you'll ever experience.
I am very excited to introduce our featured leader today, Varun Dwaraka, Director of Bioinformatics and Principal Investigator at True Diagnostic. Varun, thank you so much for joining us today. We would love to learn a little bit about you. If you could tell us about yourself, what you do, and what you do for fun.
Yeah, thank you, everyone. This is the first time that I'm part of this Hangout, and I'm looking at the numbers, and it just keeps getting higher and higher. So if I start going through puberty again, my bad.
So my name is Varun. I serve as the Director of Bioinformatics. I'm actually maybe a little bit of a sheep in a wolf's den. I was a self-taught coder and data analytics type of person because my background was very much in molecular biology. I was trained mainly on the bench, so doing a lot of DNA extractions, PCR, ELISAs. And so as a biologist, starting out, a lot of where I come from is trying to understand the beauty of biology, but now taking it from the lens of data science. I'm a native of San Francisco Bay Area, which I think probably had a huge impact on why I went into the tech route. And then through that, went to do my undergrad at UC Santa Cruz in molecular cellular developmental biology, and fell in love with bioinformatics there.
If people don't know yet, UC Santa Cruz was actually the first institute to go through and build the first human assembly of the human genome. And so a lot of bioinformaticians there actually played a huge role on a lot of that development and getting the bug, and then realizing that, oh, crap, I need to learn data science. And so I had to switch gears to take more of those roles, and then did my PhD at University of Kentucky. And so I am now based in Lexington, Kentucky in biology. So with all that said, yeah, I'm a biologist, but have spent a good deal of the past, like, I don't know, 12 to 15 years kind of learning, failing, and somehow getting to the place I'm at in data science and using R and Python specifically.
What I do for fun is, it's going to sound really nerdy. I just like learning, whether it be in, well, a large part of my time was coding, but now it's just like photography. I was a musician in the past before I went into the biology route. And so spent a lot of time, you know, actually working in studios during my undergrad time and even in grad school here in Lexington. So spent a lot of time there. And so, yeah, I love learning, love spending time outdoors. Lexington is beautiful for the gorge, being able to just go to the Red River Gorge hike. Lake life is a thing coming from the coast. I didn't know that that was a thing. So I've been living that a little bit.
What TruDiagnostic does
Yeah, absolutely. So from a large scale perspective, what True Diagnostic is, is that it's a diagnostics company which is leveraging a specific biomarker called DNA methylation. And so using this biomarker, what we're doing is we're capturing different values of your health using different types of machine learning models that we have generated, not only within the aging community, but within our company as well. Our primary focus is looking at aging in general. So we actually have biomarkers that are developed from DNA methylation data to look at your biological age and other measurements that are actually more based on your exposures.
So one way to think about this is, for example, there was a company named 23andMe, RIP, which really started this whole revolution. And I will give them all the props because they kind of got this whole personalized medicine thing down where they were looking at different types of genetic mutations, SNPs, and using that as a way to build a profile for your own, you know, what are your predispositions? The problem with that, it's a great model and I give them all the roses that they deserve, but the problem with that is that a lot of the genetic changes, genetics don't change on a regular basis. And so what we're doing is that rather than looking at something that is static throughout your entire life course, we're focusing on a biomolecule called DNA methylation, which is actually changing as time goes on, dependent on your lifestyle factors, the pollution, the environment that you live in, the stress that you go under. And so by leveraging that, we actually were able to create models that are able to capture things like your biological age, your let's say HbA1c for anybody that's in the medical field, things like your smoking impact, how much does smoking actually affect you? TLDR, it affects you a lot, regardless of whether you're, you know, good for smoking or not, that's not a thing. And also things like, you know, what type of workout actually has the most benefit based on your methylation patterns.
So this goes just beyond whatever your general lab panel is that you go, when you go to see a clinician, not saying that that's not important, but this is at now capturing something that is more exposure based rather than it is, you know, something that's reactive. So really touching on more of the proactive healthcare aspect.
So let's say somebody comes in and is interested in this whole hype of biological aging, or just wants to better understand themselves. They can take our tests. What you receive is essentially a value of here is how you are aging, just a one single composite. Some other information that you'll get is, okay, not only from the composite, this is your age based on your liver, your brain, your blood, your kidneys, your hormones, things like that. And also certain measurements of like your pace of aging. Now, these are great. And so let's say you come in and you are like, for me, I'm chronologically based on date of birth, 33 years old, but then I report as a 35. My first emotion is why the heck am I 35 years old? Why am I older? And so in the report, we actually go through and using different types of associations that we've figured out through our R&D department, ways to kind of reduce that age value.
And so you take that first test, you undergo a, let's say you take some of those suggestions that are available, like maybe more certain diet changes, maybe a different type of exercise routine, maybe being more mindful about the stress that you're in. You can then maybe modulate those changes and then take another test after the fact. Now, I'm not trying to sell you a test. But it's really, that is what we think is the power of taking this type of a test is the looking at the trends. Even if I'm 35 now and I'm 34 the next year, at least I came that one year back and it's kind of gamifying this whole situation of how do you take that healthcare back and focus on things that are the most important.
Breaking into bioinformatics from general data science
That's a fantastic question. It's something that we deal with all the time. The thing that really helps within the team build, is that you have someone that is a subject matter or subject domain expert working with the data scientist. In my case, I'll be very honest, my biology trumps more of the data science, even though I am skilled and have applied a lot of these data science skills. But for example, one of the team members on my team, Saif, is an excellent machine learning expert. While he has a background in, let's say, biology, he hasn't really worked with a lot of the aging space. It's really how do you identify individuals within the team and figure out where you fit.
For example, whenever we're creating these models, a model is as good as the data that you bring in, so we spend a lot of time on the cleaning side, identifying if the samples really are part of, let's say, if we're creating a biological age predictor, do we use a disease cohort? Do we use a healthy cohort? What are the predispositions for the individuals that have those methylation profiles? And so that's where the subject domain can really come into play. But working with a person that is skilled on the machine learning aspect, you don't have to be two people at once. You can have two people in their zones of genius.
Learning to code as a biologist
I took a Python class and almost failed it in undergrad. I'll be honest.
So I will say this. After undergrad, I had a gap year. And a lot of the places, especially in the Silicon Valley, I was able to somehow convince a professor to take a chance on me to be a bioinformatics analyst at UC San Francisco. And one of the things that I did mention is, like, hey, I don't really have a lot of bioinformatics background. Like, I know, like, certain terms. I know kind of low charts, but haven't really applied it. And so this is where, like, for me, what helped was just three months of just slaving away, going, this was before ChatGPT. So, like, going through vignettes on vignettes that, like, R had, and even Python had, Stack Overflow back in the day. So, a lot of Google searching and working with an actual data set.
I think that was the thing that really clicked in my head is that whenever I was doing projects in a Python class, it was just so tailored and I didn't really care for it. But the moment that I was in a lab where the biology really was something I wanted to understand, and I felt like I was actually applying tools that would translate to other places, that really got me motivated to fail a lot. And then the moment that the first piece of code actually worked, like, I remember I was using Top Hat to do some RNA-seq aligning. And the moment that, like, that bash script worked at all, it just, like, that high that you feel was really what drove me to keep learning.
And the moment that, like, that bash script worked at all, it just, like, that high that you feel was really what drove me to keep learning.
When I went into a PhD program, I knew I was, like, I got the bioinformatics bug, but I specifically told myself I didn't want to go into a bioinformatics program. I wanted to go into a biology lab, because I wanted to understand, I guess, to the last point, like, what the actual question was, this domain, what was the domain I wanted to study. And a lot of professors at the time, and maybe even now, still don't have a lot of bioinformatics students or bioinformatics help. They're really outsourcing to bioinformaticians. And so, I just pretty much, like, convinced another professor, like, hey, I have this background. I want to continue this. I just need data. And he was like, yo, hold my beer. I got all the data you need.
I will add to that, chat GPT nowadays is, like, insane when it comes to code building. And typically, I think a lot of people, and what I found in my experience is that pseudocode is actually, a lot of people understand logic and can actually build out pseudocode. When it comes to syntax, that's really where you get the drop off. And so, I think chat GPT nowadays has been very helpful, even for me, to build out pseudocode and structure. Obviously, you need to, or sorry, to build out syntax as long as the pseudocode is in good format. So, you have the logic trained down, you are clear with what you're wanting. And, I mean, the statistical assumptions are very important, regardless of what, you know, you're doing transcriptional data, epigenomic data.
What the data looks like
No, absolutely. I guess directly from our data, so our matrix is looking like pretty much your, depends on if you transpose or not, your columns are going to be your samples. So it'll be either an individual or, you know, the sample set that you get it from. And then your rows are going to be those individual biomarkers. And so that depends on the array or the platform that we use to quantify the biological tissue type. So most of the time we run blood, but it can be whatever you want it to be. And then, so when you get it out, it's a frame of however many samples to 850,000. I guess now it's almost a million. And then the values themselves are actually going to be from 0 to 1 in decimal format. And it's actually bimodally distributed because when you think about the data that comes out, DNA methylation really exists as a methyl mark that gets added to cytosines in the DNA. So typically, it's either a 0 or 1 orientation. It's either there or not.
Public data resources for genomic data
Yeah, shout out to Rebecca for pitching methyl clock. That is actually one of the ones that we use internally. But yeah, so for data sets, so there's multiple consortiums that have contributed a lot of genetic and genomic data. So gene expression omnibus is something that the U.S. government through NCBI has actually outputted as well. And so there, essentially, you can download because every time you publish a genetic, let's just say genomic or epigenetic type of study, you have to release your data. And so a lot of these university consortiums will release their data and deposit it into gene expression omnibus. You have ArrayExpress as well. One major source, as Rohan said, is also the Human Microbiome Project, if the NIH still has it up, especially for microbiome. On the European side, EMBL, E-M-B-L, is another one. I think that they have their own version of a GEO, so you can get it from there.
Reproducibility of analysis results
So a lot of the times, in the context of our results, like let's say we do an interventional trial, we publish it immediately. So we use bioRxiv and if it gets accepted, whatever journal. Also for those, we are able to release all of our source code. The only thing we cannot release as of yet is our models because ultimately that is kind of what's driving our business. In the past, I mean, in the academic sector, even the models and everything was released. So at least in terms of reproducibility from the interventional and research side, that's kind of how we're handling it is just by releasing everything.
In terms of our internal firm, again, the reproducibility is on multiple things. The model will stay consistent. Really, the source of variation comes from the platform. So that is very dependent on, for example, the lab side of things, making sure that they're hitting, let's say, ICC correlations, interclass correlations between, what was it, not, well, technical replicates, really, and also biological replicates, making sure that they're higher than 0.9, at least 0.96 in terms of these replicates, making sure that their quants are also highly repeatable as well, and then only using standardized pipelines. So for example, we haven't recreated on the data side how to pre-process methylation data. We use MinFi.
The epigenomics tech stack
So kind of going back to what Rohan was mentioning, in terms of reproducibility, we're using a lot of bioconductor packages. So R is our native like pre-processing, and also quantification steps. So like MethylClock that was mentioned is actually from a collaborator of ours, Juan Gonzalez, who I think manages that repository. And so pretty much like R will handle all the pre-processing just because MinFi has been the one that seems like it translates over to a lot of academic labs and all that. But the moment that it gets into, let's say, the beta matrix format, which is that format that I was describing, I think it was Joey that asked that question, then that's where we branch off.
So if, let's say, we're doing a lot of what I handle are actually the academic collaborations. So if we are running these academic research and commercial collaborations, if we're running this on a data set where people are trying to understand how does, you know, people that eat vegan food versus carnivore food, like how do they differ in their epigenetic ages, then we'll stay in R because a lot of the statistical packages are just perfect there.
And so for this, RStudio has been like, that's how we started our entire bioinformatics business, was just using RStudio. And then now we've gotten into Posit, I think the UIUX via AWS, just for a more decentralized connection into S3, and S3 being the source. So decentralization is huge there. However, for the modeling, we used to use a lot of like CARET, C-A-R-E-T, GLMNet for elastic net regressors, and then I think for feature selection using LIMA to do kind of those, like differential methylation identification, and also things like, what is the other one? There's another one that I'm thinking of, which, mutual information, sorry, yeah. So using mutual information specific packages, and then gradient boosting and all that is typically handled there. I know that tidy models was one that we tried to use, but because of things like, we just weren't able to include it, but I got a shout out to tidy models because that really alleviates a lot of the moment, the time that it takes.
However, Saif, who is now the actual ML guy, has shifted everything over to PyTorch and Keras to do all of the neural net, elastic net, random forest regressors, and light GBM, everything like that. So that's why we've kind of, we still use Posit a little bit there, but I think his kind of attack is SageMaker on AWS, just because we're not running one model at a time. Typically, for example, this one project that we're working on is, we're trying to use DNA methylation as a way to predict metabolite levels, and the results are actually pretty striking, but we're doing this for thousands of metabolites. And if we did it serially, it would, I mean, we'd still be working on it. And so he's been using SageMaker as a way to just kind of distribute a lot of those trainings so that we're not waiting like, you know, decades to have it finish.
Handling imputation in precision medicine
No, it's a fantastic question. And imputation has been the bane of our existence within the bioinformatics department. Okay. So imputation is essentially, if you have missing data, in our case, if a probe is not as reliable in quantifying that CPG site, or let's say, I don't know, metabolite level or whatever, you have a certain detection that has to pass. If it doesn't pass it, then it just goes as missing.
So initially, if a data set has more than, I mean, like commercially, if a data set has more than 85% missingness, we will just rerun the sample. That's actually one of the best things to do, especially because we have a lab. That's one of the perks of working at True Diagnostic is that we can actually just rerun the sample and get a new data set. In the cases that we can't, some of the ones that we've used in the past are impute.knn in R, just knn imputation. But we've also, and actually by we, I have to really shout out Natalia Carreras and Laura, who are on the bioinformatics team, who've actually went through and using our, let's say, I think it was like, what, 17,000 patient data set, actually tracked average methylation levels across sex, age, and I think a few of race and everything, and just replace that with the median of that.
Now, is that the best model? No, absolutely not. But this is where, like, if as long as there's like an 85 to 90%, what we're trying to do is just be able to replace it with something that is possibly logical. Now, in the future, or not even in the future, right now, especially in methylation, individuals are using transformer models to create more, to create these imputations. So, for example, I think it was called CPGPT. That's literally what it's called. And then methyl GPT. So, these are the two that have been coming out. And they've been doing a pretty good job. But the problem there is, again, array-to-array compatibility. So, it's a very hand-wavy thing of, we haven't figured this out, but we try to use the best that we can, presumably.
Career advice and lessons from failure
I mean, my GPA in undergrad was, like, at 2.8, and so, like, it was pretty bad. But I think the, I don't know, I don't think I would have, you know, I mean, like, hindsight is 50-50. I would have loved to have taken more, you know, computer science classes or, like, actually statistics classes. It's not even the computer science. It's more of statistics, because that's really where I fell in love with numbers and understanding numbers. But honestly, no, I don't think I would have changed much, just because the failures taught a lot, and I think that that was the reason why I was able to even succeed at a PhD, which succeed meaning I finished it.
And so, like, I see this a lot with new PhD students nowadays, is, like, you know, the brightest, like, the ones that had, like, valedictorians or 4.0s or whatever, like, no knock on them. Like, I wish I was one of them. My parents wish I was one of them. But, like, you know, there's something about failing a lot and being comfortable in that failure that really allows you to kind of finish that degree. And I think that, you know, the confidence then gets instilled of, like, you know, what are the fundamentals? Where is the starting point, and how do you kind of expand upon that starting point?
There's something about failing a lot and being comfortable in that failure that really allows you to kind of finish that degree.
And I think that's helped a lot in my current position, because when I came to True Diagnostic, the bioinformatics was not developed at all. Like, we didn't actually have a division for bioinformatics. And so, a lot of the times, like, I had to learn what the hell AWS was. I had to figure out what modeling was, like, in the context of, like, how do I actually deploy GLM net? And then later, it's, like, how do I make this more efficient, and then be able to do a little bit more of a reproducible aspect? So, I think the failures allowed me to be more comfortable when, like, I loved that background of, I believe it was Luke, where everything's on fire, and you're still, like, okay, this is fine. So, I think it really, it all just kind of comes down to, like, that. So, yeah, accept failure as a really good teaching tool.
