Brendan Graham - A Machine Learning Approach to Protect Patients from Blood Tube Mix-Ups
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
My name is Brendan Graham and I am a data scientist at the Children's Hospital of Philadelphia. And I'm really excited to talk to you all about a project I've been working on with a few colleagues around how we're using machine learning to prevent patient harm in the hospital.
What is a wrong blood in tube error?
So to set the stage, every day thousands of tubes of blood make their way into the lab to be analyzed. And these tubes of blood serve a really important purpose, which is to inform the care of patients in the hospital. Clinicians need to make decisions about diagnoses, treatment plans, and just monitor the status of their patients. And in the hallway in the fifth floor of the lab, there's a poster reminding us that behind every specimen, behind every tube of blood, is a patient.
And oftentimes that's a sick patient. Let's call this patient John Smith. And John Smith's clinician is worried that he might have an infection, so they order some labs to check for it. So John Smith's blood is collected, put into a tube, it makes its way into the lab to be analyzed. Somebody reviews those results, one of the lab technicians. Those results are then published to the medical record, where they're accessed by the clinician caring for John Smith. This forms kind of the typical life cycle of a blood test. But sometimes the wrong decision is made, or the wrong treatment plan is implemented, and it's because there's wrong results in the chart.
And those wrong results are not because that machine makes a mistake, and it's not a human error mistake by somebody reviewing their results. But rather, the wrong blood was put into the tube in the first place before it ever even got to the lab. And when that happens, this workflow looks something like this, where now there's a second patient in the mix. Let's call them Jane Smith.
And Jane Smith's blood was accidentally put into the tube labeled for John Smith. And once that happens, that tube proceeds along this pipeline, this workflow. And now we have the clinician making treatment decisions for John Smith based on the blood results for Jane Smith.
And this is what's called a wrong blood in tube error, or a WBIT. And as you can guess, the consequences of a WBIT can be really severe. The wrong medications can be administered, the decision not to administer a medication could be taken, and there could even be life-threatening reactions from transfusion complications.
To make this even more complicated or hard of a problem to solve, these errors are often silent, meaning once the blood is put into the tube, there's nothing stopping the downstream processes from proceeding, right? It looks like blood in the tube, and everything can proceed as normal. And so this is a really potentially difficult problem to detect, but potentially very severe in terms of patient harm and outcomes.
To make this even more complicated or hard of a problem to solve, these errors are often silent, meaning once the blood is put into the tube, there's nothing stopping the downstream processes from proceeding, right? And so this is a really potentially difficult problem to detect, but potentially very severe in terms of patient harm and outcomes.
Current detection methods
But the goal of my talk is not to scare everybody. There are measures in place to prevent this from happening. So at the time of collection, there's a barcode scan that can happen. So on the wristband, there's a barcode. On the tube of blood, there's a barcode. Both are scanned at the time of collection to make sure that a mistake hasn't happened.
But human error, you can't account for everything, and so mistakes do end up happening. So in the lab itself, there's an automated comparison that takes place. And lab machines can perform what's called a single analyte delta check, which is really just a fancy way of saying, let's take one component of the blood, one analyte, in this case, in this example, a red blood cell count. So let's look at that analyte and compare it from the current sample from the patient to this previous sample from a patient over some short enough time frame. And if the delta or the difference between those red blood cell counts is small enough, then we can conclude that it's most likely the same patient sample, right? Over a short enough time frame, certain characteristics of the blood shouldn't change too much.
And we were actually really interested in this automated approach because in the tubes of blood and in everybody's blood right now, there's more than just red blood cells, right? So we have lymphocytes and monocytes and white blood cells and platelets and hemoglobin and all these different components of the blood. So could we use all of them? Could we develop a multi-analyte approach to detect wrong blood and tube errors using machine learning? Could we implement this in the clinical workflow to ultimately prevent patient harm from occurring? So with these goals in mind, I'm going to talk through a project that I've been working on with a few colleagues in the labs to do just this and some of the constraints that we came across and how we overcame them.
Building the dataset
So my job as the data scientist on the team is to start building up our data set. And we wanted to focus on complete blood count tests or CBCs. This is a very commonly used test in the hospital and all over. I would imagine everybody in this room has had a CBC test at some point, most likely.
And so I write some SQL code against our data warehouse and extract the results into a data frame in R to start the analysis. But we're now faced with the reality that the thing we're trying to predict, these WBITs, there's no indicator in the data that flags when these happened.
As I mentioned before, these are silent errors. And it's almost impossible to go back retroactively and understand in the data when a WBIT error has occurred. There's no logic we can implement that says if XYZ outcome occurs, it's always because a WBIT error happened that caused that outcome. They're silent. And so we turned to a methodology in the literature that other researchers have used, which is to simulate how a WBIT error might occur in practice.
And before I kind of walk through that, I want to kind of orient everybody to what our data set really looked like. So before our simulation, we had in this fake data set here, we have columns for patient ID and when the collection was taken. So when the specimens were taken, we also have the actual results from the blood test. So white blood cell count, red blood cell count, a dozen and a half other results. Then for every sample, for every patient, we grab their previous sample, right? And we get their previous results. And then we take the difference between those two results across all of the analytes. Remember, this is our multi-analyte approach.
And these analytes, sorry, these deltas actually form the predictors in the model. So we have our predictors, but we don't have an outcome. So to simulate these outcomes in a way that kind of approximates how they occur in practice, we use the matching algorithm from the matchit package that matched temporally and on department. So if we have a patient in our data set in the neonatal ICU, the NICU, what we want to do is find another patient in the NICU who had their sample taken around the same time. And then once we have a match, we overwrite the results of one patient with the matched patient. And we kind of introduce a synthetic WBIT error.
And that looks something like this. So we have, let's say we ran our algorithm on this fake data set and we've matched patient 1, 2, 3 with patient 7, 8, 9, because they had their samples collected only a half hour apart. And let's say, for the sake of example, they're in the same department. We take the results from patient 7, 8, 9 and overwrite the original results of patient 1, 2, 3. And then when we take the deltas, we can see that specifically the white blood cell count delta in the simulated data set is 9.7 compared to 1.6 in sort of the unmodified data set. So we've introduced this WBIT error, which is reflected in this really high delta, which allows us to create a new column or WBIT indicator column.
And this approach, we repeated for our entire data set. And it took us in this kind of stylized example here, on the left, of a situation where we had, you know, no WBITs in our data set to running it on our 50% of the samples and finding a match. And now we have like a perfectly balanced data set in terms of the outcome, which is great for model training. But, you know, keep this in the back of your mind, because as we talk about model assessment, this is going to play a pretty big role in how we assess our models.
Modeling with tidymodels
But speaking of models, we move on to doing some modeling. And so we want to develop a binary classification model, right, like a WBIT or no WBIT. And so using the tidymodels framework, which I would imagine lots of people are familiar with, this was incredibly easy. We have a couple different types of models ranging in the complexity, a couple different types of preprocessing and predictor sets that we're thinking about using. And using kind of this high level example code on the left there, mainly the workflow map function from tidymodels really allowed us to pretty easily work through training and, you know, validating all these different kind of combinations.
And the full technical like details of the modeling, it's kind of out of scope for this talk. I'm not going to get too into the weeds, but if you are interested, feel free to stop me. I'd be happy to talk about it some more. So we're at the point now where we have our models, we have our data, and we train our models using cross-validation, and we want to see how do the models perform.
So here I have some ROC curves for the two different CBC test subtypes. And if you're unfamiliar with ROC curves, basically like the curves towards the upper left of all the plots are performing better in terms of ROC. And the colors here represent the different model types and preprocessing combinations. So we can see kind of a range of performance profiles in terms of ROC curves, but the best models performed extremely well, which, you know, wasn't too surprising. We have this like simulated data, we've injected these errors, and it's not surprising that the predictors have a lot of information about when a WBIT error occurs because we've overwritten the results with a different patient.
Thinking about clinical implementation
So it's not totally surprising, but also, you know, this isn't just a model development exercise. We're not just concerned with the fact that it's cool to make models with nice data. We actually want to think about implementing this, right? Like who could use these models and how would they use them in their workflow? What happens if we get a prediction wrong? And so this kind of switched our mindset from like model development. Like I said, we're not just concerned with developing classification models, but how do we implement these into the clinical workflow? And so we kind of switched gears into the clinical workflow and faced a new set of constraints, specifically around false positives.
So here I have on the screen kind of the classic confusion matrix, a two-by-two table comparing predicted versus actual values. And on the top row, we have true positives and false positives. Bottom row, we have false negatives and true negatives. And the false positives are highlighted for an important reason, because you can think about a hypothetical scenario in the future where we have this model in production and we make a prediction. We see a sample and we think, oh, this is a WBIT. So now we're kind of living in that top row. We're either correct, it's a true positive, or we've made a prediction and it's a false positive. It's not a real WBIT. And in the context of like the clinical lab, the false positive translates to a lot of extra work that somebody now has to do. They probably have to rerun that test to make sure that they get the same weird result, which delays all the other tests in line and which has downstream impacts on all those patients and the clinicians waiting for those results.
They might have to find the ordering provider, call them, get in touch with them somehow, and say, hey, did you really order this blood test for John Smith? Because there have been anecdotal evidence in the past where you get a blood result published in the chart and the provider says, I never ordered this. And so one quick way to check for a WBIT is just confirming, hey, did you actually order this for this patient? Again, that's extra work on the backs of the lab staff. And at the scale that the CHOP lab operates, it has real impacts on turnaround time and just overall lab efficiency.
So it's something we want to avoid. Another idea that's been floating in the back of our minds here is around this idea of alert fatigue. So people who work in kind of a clinical capacity, whether they're on the front lines or kind of like in the background as the lab staff are, they're bombarded with medical device beepings and PA going off and pagers going off. And even in the software itself, there's alerts popping up reminding them to do something. And what we're kind of talking about here is making another alert telling them to do something. So the last thing we want to do is have that alert firing a lot unnecessarily.
Evaluating models on realistic prevalence
So with these two kind of ideas in mind, we were thinking as we move to evaluate the models, maybe we should move on from focusing on more of the traditional classification metrics like accuracy and ROC that I showed earlier and think about positive predictive value, which is the proportion of all positives that are true positives. And so if we can have a model or identify a model that has a high positive predictive value, we can be as confident as we can that when we raise an alert, that it will hopefully not result in a lot of unnecessary extra work. I mean, there will always be false positives, but we can try our best to reduce that.
One of the potential issues that we ran into, or not potential issue, an issue that we ran into with this though, is that PPV is extremely sensitive to the prevalence of the outcome. And if you remember a few slides ago, when I talked about how we took our training data and made it kind of perfectly balanced in this nice bar charts here on the left, this will have major implications if we were to assess our models with the holdout data, with data the models haven't seen yet, that if it's structured the same way, we would vastly overestimate the prevalence of the outcome. We would probably overestimate the model PPV. And that has, like I said, real-world implications for generating a lot of false positives.
So to help get over this constraint, we chose to downsample in the holdout data set, the WBIT proportions, such that, excuse me, the WBIT prevalence in our holdout data set matched the real-world, excuse me, prevalence of the WBITs that are reported in the literature. And the real-world, the reported prevalence is like one in 3,000 to one in 5,000 samples. So it's pretty low. So what that means for us, like practically in our data set, is that we have about 11,000 or so samples in the holdout data set and we downsample them so that only three WBITs remain out of 11,000. So the idea is to like unleash the models on this unbalanced, downsampled, low-prevalence validation data set and then see how the models perform in terms of PPV.
Another thing to think about though, as we assess the models, is what's the threshold that we want to raise an alert at? You know, the default 50% threshold is probably too low. It might generate a lot of false positives. But what is the optimal threshold? Is it 90%, 95%? We don't know that right now. It's not obvious. And so using the model validation data set, we actually use the probability package to iterate across a bunch of different probability thresholds and then reassessed the model performance at each threshold, again on this unbalanced, low-prevalence validation data set, which looked something like this.
So if we used, again in this unbalanced data set, using a probability threshold of 0.95, we have 61 false positives and a PPV of 0.05, which is pretty low, even at this high threshold. And which makes sense, we have a really low prevalent outcome. But one of our goals here was to see how high could we push this threshold while still capturing those three true positives that we know are in our validation data set. And so again, using this probability package, we're able to iterate across thresholds. Here, we're bringing it up to 0.99. We see a reduction in the false positives and a slight increase in PPV to just under 0.09. And as we increase the threshold even more, we see that same pattern. We see fewer false positives and a PPV of about 0.21, which again is pretty modest. But given the low prevalence of this outcome, we were actually pretty happy with this. And actually, if we increase the threshold even higher, we start to drop one of those three true positives. And that validates one of the criteria that we really care about, which is making sure that we can flag the ones that we know exist in our holdout data set.
Another positive kind of unintended effect of this kind of analysis is it helped to frame the discussion at, let's say, higher levels, like hospital administration levels, on the trade-off of maybe implementing a model like this, which we know there will be false positives. We know there will be some unnecessary work. But we also hope that we prevent unnecessary patient harm from occurring if we're able to catch a WBIT error. So there's a trade-off there. And this sort of analysis across these thresholds can at least help inform that discussion on that exact trade-off of what we estimate the false positive rate to be and the benefit of potentially catching these WBIT errors.
We know there will be some unnecessary work. But we also hope that we prevent unnecessary patient harm from occurring if we're able to catch a WBIT error. So there's a trade-off there.
Next steps and deployment
Again, this kind of in the future as we think about implementation. And speaking of implementation, over the next few months, we plan to dive deeper into this to figure out how does a model API that lives on our Kinect server interface with the lab middleware, that machine that does the analysis? Because the model needs to know, hey, a test is ready to bring you to check. And then if we make a prediction and we need to raise an error, we may need to point that API or that alert to the EMR software. And somebody has to see that, which is another system. And then maybe a lab technician has to have an alert pop up or a front-end provider. So we're really excited to kind of dive into this next set of constraints that we hope we can overcome and implement this in the workflow and get all these things talking.
Because as I mentioned before, even though this is a low prevalence outcome, it's an instance of preventable patient harm. And this is that poster. I took the picture of it in the hallway in the lab. Behind every one of those tubes of blood, there's a patient and their family and their providers who are kind of all counting on this process to go right. And we hope that by hopefully implementing these models, we'll be able to kind of deliver on that. So thank you.
Q&A
I think we have time for probably a few questions. Do you have problems with true WBITs that you didn't simulate in your training data introducing noise into your model?
For sure we do, but it's hard to know exactly where they are. That's kind of the whole crux of the problem. So yes, the data is noisy for sure in ways that it's hard to account for.
And how do you avoid error in a first-time visit or single blood collection, for example, like an outpatient?
Yeah. Good question. So a few inclusion criteria that I did not mention was that we're only focused on inpatients, number one, because that error is more likely to occur in an inpatient setting than if you're just coming into an office visit. And as far as patients who don't have a previous sample, one of the inclusion criteria that we used was that a patient had to have a CBC within the last seven days. So in an inpatient setting, that's not too restrictive, but there does have to be a sample within some relative timeframe for you to take that difference. And it has to be kind of short enough so that there's not like a change in the blood, especially with younger patients, like infants and like a couple of day-old patients, they have, they can fluctuate a lot. So we have to keep that timeframe pretty short.
Okay. I think that's all the time that we have, but thank you so much. Thank you.
