Vasant Marur - Quality Control to avoid GIGO in Deep Learning Models
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, my name is Vasant Marur. I'm a Senior Data Scientist at Merck. For those of you who don't know, it's a big pharma company. We make drugs and vaccines. I think as all data scientists, we can all agree, if you bring garbage in, you're going to get garbage out. And in my field, pharma, the stakes are really high.
What we use deep learning for
So, what do we use Deep Learning for? What do we use Deep Learning for? Our Deep Learning Models classify images like this. These are images of cells, specifically macrophages. So, the images look like this, but sometimes they look like this. Though they look pretty, these have artifacts in them. What are artifacts? Artifacts could be debris, could be hair. Sometimes the liquid dries out, so you don't get a good image. So, artifacts are a bad thing, even though these images look pretty.
So, we need quality control to help us flag these images, so that we can get really high-quality images of the cells to train our Deep Learning Models, because these images help us find drugs that help cure disease. Without really good quality images, our Deep Learning Models will spill out garbage.
Without really good quality images, our Deep Learning Models will spill out garbage.
Macrophages and cell painting
So, why are we collecting these images? Let me go into what macrophages are. So, macrophages are basically a white blood cell, basically an immune cell. What's their job? They kill bacteria, dangerous microorganisms. They stimulate other immune system cells. They start off in a state called M0, and they can go to a state called M1, which is pro-inflammatory, or M2, which is anti-inflammatory.
Now, when the balance is upset, they can cause disease. And these can range from autoimmune diseases, like MS, and including neurodegenerative diseases like Alzheimer's. And if they go the other way, to M2, they can cause cancer. So, we want to find a way to control these macrophages, because if we can try to modulate these macrophages, then we might be able to help cure disease.
So, on the left, on the top, you can see some images of examples of the different kind of cells we have. These are M1, M0, and M2. But how do we take these images? So, we use a microscope, but you can't just use a microscope. You need to take some really good pictures. So, first, we start off with this technique called cell painting. That's literally, as you might think, we take different colored dyes, and we color different parts of the cell with different dyes, and then take images in different channels.
And so, that lets you see every single part of the cell, as you can see on the bottom. So, you can get a good, complete picture. And why are we doing this? Well, the idea is that cell morphology, which is a fancy name for the shape, the size, the texture of the cell, and if you can look at the image, you can see what's happening to these cells. So, by taking these images and then training a deep learning model, we hope to classify them and find different classes.
High-throughput screening pipeline
So, in effect, we're searching for new drugs that alleviate disease by using a technique called high-throughput screening and imaging. What's high-throughput screening? It's basically a very easy, well, not very easy, very rapid way to test a lot of compounds against a biological target. So, you can test a lot of these compounds, like thousands of them, very rapidly. And we do that with imaging.
So, this is a busy slide, but I want to help you, walk you through it. So, we start with some cells. We get them from disease patients and normal patients, and then we add compounds to them, or we change the genome a little bit using a technique called CRISPR, and then we do high-throughput screening and imaging. Once we have these images, you can get lots of features from them, and then do groupings and clusterings.
So, as part of the drug discovery effort, we use both the high-throughput screening and imaging to help us get these images. Now, in practice, this is what our pipeline looks like, or where I work. So, once we get the images from the scientists, and we get some metadata about what these images are, we do something called ImageQC, where we look for artifacts, or we look for images that are not good, and then we follow up with image segmentation, where we try to find every single cell in an image, and we try to find the coordinates. Once we have that information, we train a deep learning model, and then we can do further downstream analysis.
So, this whole pipeline is built in NextFlow, that's a workflow manager, and deployed in AWS. But today, I'm mainly going to focus on the ImageQC part, because that's how you can avoid GIGO, because you don't want garbage in deep learning models, because you don't want to get garbage out.
The image QC challenge
So, our scientists are doing a lot of images. You know, they look like this, this, and this. So, you get the picture, a lot of images. How many images? Well, an experiment could have, what, 40,000 images on the low end, to about more than 200,000 images.
So, I talked about QC. Now, imagine going through and checking every single image manually. That's tedious. It's going to be time-consuming. It's basically impossible.
So, how do we do ImageQC? Well, we decided we should automate it. And how do we do that? Well, enter Python. I know this is more R, but no, I'll get to that. So, we had a team that had already written a tool using Python to do ImageQC. So, once we get the images, we use this in-house tool. It computes a bunch of image metrics for us and then uses a technique called isolation forest to find outliers and inliers, in effect, giving us a list of good and bad images.
So, the whole process looks like this. Once we get the images, we compute the image metrics. Then, we take the isolation forest model, feed them these metrics as features, and in an isolation forest, you can give an automatic threshold or you can specify a threshold if you don't know. So, that will take that threshold and then show you the list of good and bad images for us.
So, this is another busy slide. Don't worry about it. We basically compute a lot of these image metrics. So, basically from the computer vision field. But once we have these, we feed them to the isolation forest and then that uses these metrics as features to do further processing.
So, what does isolation forest do, really? So, once we have those image metrics, as I showed you, it calculates a score. It ranges from minus one to zero. Once it has the score, and then it uses a threshold, so automatic, as I mentioned, and then it finds the outliers and inliers. Now, you can specify a threshold if you think, oh, maybe 3% of my images are bad or 4%. But you see where I'm going with this. It's kind of arbitrary. If you give it a high percentage, you might lose a lot of images, which are actually good, but you're rejecting all these good images that you've collected. So, you want to adjust that threshold.
Well, how do we do that? Well, we just run the Python tool again, right? But we're dealing with non-computational scientists. Telling them to run a Python tool, that's basically hard, especially when they were students in the command line. They end up like the IKEA guy. You know, we give them all the instructions, but they always end up having issues.
Building the Shiny QC tool
So, how do we help them? Because having a computational tool is not enough if people can't use it. It has to be easy for them to use. So, I was thinking, hey, we could pre-compute these metrics using the Python tool, and then it would be really cool if we could give them a way for them to adjust the threshold, see which images are getting flagged, see which images are not good, and then keep adjusting it until they're satisfied with the output of the QC. And then once we have that, then they can go forward. So, oh, wait, there's something called Shiny. We can do that. So, that's where we use Shiny.
So, we have a tool like this. The interface looks like that. On the top, they can adjust the cutoff. And let me just walk you through what the interface looks like. It gives you a list of bad images on the top, good images in the middle, and the best quality images on the board. And as they adjust it, so this is starting off with the automatic threshold, and you can see there are some images on the top where there are like some images of cells, some images have nothing in them. So, you do want to catch those. You can adjust the threshold, and then you can see you get more images which are not great. So, you can keep doing this until they're happy.
And as you adjust the threshold, you get, it regenerates which images are getting flagged, which images are not great, and so on. So, in this way, we can give them a tool. They could adjust threshold, and they can see which images are getting flagged, which images are not great. And once you're satisfied, they generate a set of good images. So, by the way, this is all deployed on, you know, using Connect and deployed on AWS. So, this allowed us to generate, give them a lot of data, but still the app worked great. So, thanks to Connect.
So, just to give you an idea of how many images are getting flagged. So, on a plate, we get about 3,000 odd images. So, if you choose the automatic threshold, you can reject up to 10% of these images, which is really not great because you might be losing a few images that are actually good. So, we can keep adjusting that threshold, and then you can see the number of images that are getting flagged reduce.
So, that's why we decided we should have an interactive web-based version so the imaging scientists could use and be the human in the loop, so to speak, and check the quality of the image that you see because without doing that, they're not going to be satisfied with what our tool is doing.
Training the deep learning model
So, once we have these images, now we can do deep learning. So, on the left is an example of a 384-well plate. It's basically in which we put all the samples in, and what we do to train our deep learning models is we take images from the first two columns and the last two columns. Now, this is what I call controlled regions on a plate in a biological experiment. Controls are basically we already know what they have, so these are labeled data for our deep learning models.
Now, as you can see in the middle, these are the images that come out of those wells, and you can see there are some that have artifacts in them, so then we use our QC tool. It marks those images and flags them as it should. So, once that's done, we get rid of those images from our deep learning model, and then we train it, and the main goal of doing this is we want to make sure that the deep learning model is using really high-quality images, images that have actual biology or changes happening in them, and not have artifacts in them because, again, if they have artifacts, you can't trust the results coming out of these high-quality models.
So, we basically use deep learning models to get the features from these images and then train a model to know the difference between the different images and then give us a prediction for each well in the sampled region, which is all the yellow in the middle. So, on the left and the right are the controls, like M0, M1, and M2, and in the middle are sample wells, which we don't know what's going on, but we add compounds to it and perturb them, and then we try to see whether it mimics the controls on the left or the right, and that way we generate a prediction from our deep learning models.
Recap and acknowledgements
So, to recap, combining the existing Python code that was written already and then using a Shiny-based app for thresholding enables us to get really high-quality images between our deep learning models. So, this is the way we did it. We already had a different team working on image analysis, so they developed this Python tool for us. So, like, do we need to rewrite this in R? Do we need to redo it all? No, we can just take that. Then we developed the Shiny tool to take the input from the Python tool and then give an interface to our imaging scientists so that they could use this, and, you know, everybody's happy this way.
So, without rewriting or reinventing anything or rewriting anything, we could use the tools that was already written. So, in this manner, we could actually get a really good deep learning model so that we could trust images coming out of it.
So, of course, this is not possible just with me. This is a huge team that worked on it. I'm part of the data science team, and we collaborate with the image analytics team, and there's the biological team for the Mac Pro team, and, of course, without IT, you can't get anything done. Huge shout-out to them. And then, of course, all the imaging scientists who helped us curate the tool, use it, and give us feedback. So, with that, thank you.
Q&A
I was wondering how long it took you to realize that you needed a Shiny app and how long it took you to make that.
So, it started off with someone actually doing a Python Jupyter notebook, and they were like, no, let's just do a Jupyter notebook and give that to the scientists. I'm like, no. So, I personally am not a fan of notebooks. I hate notebooks. You know, you have notebooks in different cells, and then you're expecting a non-data scientist to go, oh, run this cell, then run this cell, then run this cell, and then give us an output. It would never work. So, we tried that a few times, and then we said, you know what? I think I can do this in Shiny, and we have Workbench. We can include the Python code, set up the environment, and just do a Shiny. So, a few weeks, and then two more weeks as we start up the app very quickly, the first version. But we redeveloped this app with more features out. So, I would say a long answer for that earlier question, about two months, but a lot of feedback.
So, would a similar approach work in R? Yes, it would. So, for those not familiar, I kind of skipped over, I realized. So, we use scikit-learn in Python, and Python already has a lot of image utilities and image packages that make it really easy to develop. So, those don't exist in R yet, unfortunately, but it works in Python. So, just use that. We use what's available. I'm a major fan of using what's available, not reinventing the wheel. So, can I do this in R completely? No, not completely.
So, why not just train the deep learners to find the artifacts? So, this is biology. Nothing works, and biology trumps math and data science. Things keep changing. So, when we first started these series of experiments, there's something called batch effects. So, when you have an experiment today, and you do it a month later, and you try to use a deep learning model to classify it, it's not going to work. We realized, because someone changed something in your experiment, or something happened, or again, the batch effects, something went wrong in liquid hand, something went wrong with the assay, or how many cells you used. So, we didn't have enough data to just use deep learning straight off, and we tried it. It would work sometimes. It would miss a lot of things. So, that's a work in progress. So, once you have more data, then I can just ignore this and go back to using deep learning.
So, what scale is the back end of the Shiny tool on AWS? Like cost and compute volume. So, hard facts. So, cost. Fortunately, I don't have to worry about that. IT takes care of it. It's amazing when you work for a big pharma. I mean, but I do think about that. So, it doesn't cost that much. We can do a very cheap spot instances or EC2 instances to get that. Volume. So, that was just one experiment. So, the plate that I was talking about, we collect about 13,000 images. That's one plate. If an experiment has eight plates, multiply that. And over the past two years, we've done about 50, 60 odd experiments with varying plates. So, a lot of data.
So, I wonder if use of different thresholds would affect the model training. Yeah. So, seriously speaking, deep learning is robust enough to handle those kinds of images I showed you. It can still be like, oh, it doesn't matter. But our goal was to have really trustworthy results coming out of these deep learning models because the biologists in the room, they want to know that, is it just artifact? Is it noise? Is it just magic? No, no. It's actually learning from your data. So, that's one of the reasons why. We did train some. It'll still work, but you can't be sure what it's learning from. So, in this way, we can be sure that we are not learning from noise or artifacts in your data. It's learning from actual biology.
we can be sure that we are not learning from noise or artifacts in your data. It's learning from actual biology.
We're going to close with this good example of a collaboration between a data scientist and a subject matter expert. Yep, thank you. Thank you for being a great audience in this session.
