Adam Wang - Why You Should Think Like an End-to-end Data Scientist, and How

video

Oct 31, 2024

19:34

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Today I want to posit the following idea to you all, and that is most data scientists like us would benefit by thinking more end-to-end. And I'll make a case for why I believe that, and I'll also share some principles on how you can start thinking end-to-end and benefit your own organizations. And I want to start by defining what I mean by end-to-end, and that is bringing ideas to life. And so I think we already saw some compelling examples in this session, but I'll give another one, and that is at my company, How We Think End-to-End, at NMDP, our mission is to save lives through cell therapy. And I think like many of you, we come up with lots of ideas to advance the missions of our organization, and for us, that really means thinking about our patients and donors. Many of our patients have blood cancers like leukemia, and the best shot for cure is to find the perfectly matched donor.

And there are many things that go into this matching, one of them being how ready our donors are, how likely they are to be available when called upon. Because unfortunately, for a variety of reasons, not all our donors will be available, and that can make a stressful situation for a searching patient even more stressful, as many of them don't have the luxury of time. So one of the ideas that we had, which just comes from a lot of conversation with both our internal and external partners, is can we predict how ready a donor is going to be for donation before they get called? And bringing this idea to life would mean providing this prediction and information to our physicians in real time, so that when they're talking to their patients every day, they can run a search and have the most informed decision for their patient. And so thinking end-to-end is really about going from this idea phase, from usually just a conversation, all the way to a final product that benefits the users who need the information the most.

The gap between prototype and production

And so it's very challenging, especially when it's your first foray into a new kind of domain. For us, this was our first foray into machine learning. We wanted to build this new donor readiness score to provide this information to our transplant center physicians. And so on one end, at the start of the process, you have me, the developer, in a rapid ideation phase, just coming up with ideas, doing some exploratory data analyses, and getting a local prototype of this donor readiness score. But we also have to think about the other end, and that is our physician end-users in this case. They are most concerned about having this donor readiness score available to them in the medical system that they use, and they want to ensure that it's up-to-date, they can trust it, and we also want to protect our donor information. We want to protect their privacy. And so there's a big gap between going from prototype online laptop to something that's going to be useful to our hundreds of external network partners and end-users in the medical system and workflows that they use day-to-day. And I think this is a gap a lot of data scientists face of, how do you get your work into the hands of those that would benefit most from it?

And I'll start by saying you definitely don't want to do this and say, it works on my machine, I'm good, we'll ship my machine to hundreds of different people, you might as well teach them R, Python, Docker, all that stuff while you're at it. It's not going to work.

So what worked for us was to try to loop in other teams. And so here we start in our comfort zone of the business ideas, the developer, the data scientist, you and me. And I think this is a pretty comfortable workflow where we often get questions, we provide answers. And when it comes to these big new projects and trying to get buy-in from the entire organization, it becomes a little bit more daunting of, how do I make sure my thing is up-to-date every single day? It's going to be sent to external clients. How do we make sure they trust what we're building for them? And that's when we typically think of our engineering teams and we try to loop them in because our data engineers are experts at building scalable data pipelines and protecting our donor information. Our DevOps engineers are experts at scaling compute into the cloud, building the infrastructure that we need. And of course, our software engineers help build the applications that our transplant center partners can use and integrate into their workflows.

So then the question becomes, how do you hand off from your comfort zone to this engineering zone? And it's actually really tricky because among these engineering zones, they don't necessarily work together all the time. They're not one cohesive unit, especially when we're introducing this new machine learning functionality. There's kind of many handoffs within that engineering zone. And so it's also really tough because I wasn't a data engineer. I wasn't a DevOps engineer. I don't really know what's going to benefit the most. I have some R scripts, some Python scripts, some SQL scripts. I know what donor readiness score should look like. I know what goes into it and what goes out. And so initially, we just tried a nice handoff process of, here are the requirements. Here are how we're building donor readiness score locally. Can you help us deploy this in production, automate this, build a custom application to get this to the hand of the end users? And after several months of work, we eventually got a solution that was successful. Our partners were able to view this donor readiness score in their systems.

And so you might be wondering, what's wrong with this approach? Or why can't you just hand off your data science work to other teams? And that's because nothing you build is ever going to be perfect, especially on the first iteration. And so there's always going to be things that you want to improve, things you want to add, like improving donor readiness score because we know of this business process change. And this began really, really hard because the donor readiness score that we developed in our comfort zone looks completely different once translated through the engineering zone. Because to them, our model is a black box. And then within each of those data, the DevOps, the software handoffs, those are also kind of mini black boxes. And so it became really hard for me to make changes to donor readiness. And on the flip side, they didn't necessarily want to put such a big investment, get four teams together in a room with a product manager, and just meet to make a potentially small change of improving accuracy by two or three percent.

And so we spent a long time trying to think about how we can overcome this and make the handoff more smooth. And after about a year of trial and error, we landed on a solution to break down these barriers and increase our comfort zone as data scientists and be more comfortable talking about more parts of the data science life cycle.

Why expanding your comfort zone pays off

And this isn't going to happen overnight. You know, there's no free lunch, except maybe at PositConf. But I would say that the benefits of expanding your comfort zone and the investment in that is going to pay off in the long run. And why do I think that? First, you'll have full context of the entire process. And this is extremely helpful because you have a bird's eye view of what's going on. I think it's really important to be able to pinpoint, hey, if there is some update to our donor, to understand how that flows in your database, how that flows to your model, how your model gets sent back to the people who are building it for. Because when issues come up, then you can kind of spot check a few cases and really identify where in the pipeline things are going wrong, especially when there's handoffs between four different payments.

Secondly, adding new functionality becomes so much easier. Because if you can start to learn a little bit of engineering language and not even anything too serious, but just provide context for your requests and the features you want to build. I'm sure we get all the time asked questions from the business of, hey, I need this data in two days. I'm not going to tell you why I need it or what I'm going to do with it. And that kind of makes you roll your eyes. You wish you had that context that would make it easier for you to work with them to provide the solution that works. The similar thing here to our engineering teams. We want to give them enough context to be able to build the best solution that's agreeable among all parties.

And finally, I want to stress that thinking end-to-end means that you're not alone. Sometimes people think, if I have to think end-to-end, that means I have to do everything from start to finish. I'm going to become one giant silo, just working alone. And I can guarantee you, you're not going to replace your DevOps engineers or your data engineers. They're still going to be there. But I would think that you would be working more closely together with them instead of filling out an IT request, waiting a couple of weeks. Hopefully it moves to the IT system. And then finally getting to talk to someone about what you're trying to do. And so you'll be able to work with them at multiple touch points in the entire process.

And finally, I want to stress that thinking end-to-end means that you're not alone.

And if these aren't good enough fundamental reasons for you, a practical one is that you're going to kind of be forced to think end-to-end anyhow. I'm sure many of you have gotten an email like this of, Adam, I noticed that donor readiness score is not populating correctly in my application UI, whatever that means. Can you look into it? And again, if you have this kind of disjointed structure, this is also really easy to just make you roll your eyes and complain and be like, I don't build applications like a software engineer. Let me talk to my software developers. But they're probably going to be like, well, we haven't made any release updates. There's probably some compute that's going wrong in the DevOps process. And then they might point fingers back at the engineers and you're just kind of on a wild goose chase. And so if you're able to, again, think about how data flows from update to your donor, to database, to your model, to your prediction, it's really easy to identify where in that process something has gone wrong. And when you've done your homework and you can kind of build your case, provide some context, show what you've done, show why you think it's this way, you're going to have a more receptive audience, in this case, our DevOps team. And so this is based on a true story.

And I want to encourage you to not think of these emails as something that is terrible. In fact, I think it's kind of a good thing, because it means that the business stakeholders here, they value your work. They clearly also want to advance your mission of your company by making sure that this thing works, and they trust you to know how to resolve it. And because of that trust, I would encourage you to actually use these emails, these issues that come up, as an opportunity to start thinking end-to-end, because you have a great reason to try to figure out what's going on with your donor readiness score pipeline, your model pipeline, your application pipeline.

End-to-end in practice: email alerts

And so I'll go into example of what doing end-to-end work can look like. In this case, I talked with my DevOps team, and after we figured out this issue, we were like, wouldn't it be awesome if, instead of getting an email from a stakeholder, we could email ourselves and catch this in advance? And so this is something that sounds like a great idea of implementing email alerts. And starting in the traditional data science realm, it is overall pretty simple. We have this store.py module, which essentially makes a prediction based on an event, usually data coming in, and we get our score. And then we have some logic based on what we know of potential errors that have happened in the past. If that prediction fails, we send an email with some useful message, and we can catch other errors, too. Nothing too crazy, pretty typical code that we would work with.

But once you start asking questions about, you know, Adam, what does this send email function look like? Because don't we have to send emails from our AWS server to our internal NMDP email server? As a data scientist, that's not something you would typically think about, but definitely is something you would need to, in this case, to get your alerts to work properly. But don't panic, because remember, you're not alone. This is a great opportunity to work with your DevOps team, and instead of asking them to, hey, go do this for me, go fix this, if you ask them how to implement something, they'll be much more receptive, because it shows that you care, and potentially, if you know how to do this, the next thing we build, or if we improve donor readiness score, maybe this is something that we could do that would save them resources as well, and then you're expanding your skill set as a data scientist. Because from the DevOps lens, the equivalent codes and bits to them actually look really easy as well. So here we have a simple notification service in AWS, SNS, and we have some Terraform code, .tf, which essentially builds the resources we need in our AWS infrastructure to send emails. And there's a bunch of boilerplate code that's simple to most DevOps engineers. You create this SNS module, and the main thing that they'll probably point out to you is, hey, the custom thing here is we're building your resource that subscribes to email alerts. And if I wanted to change this email, maybe I'm doing some testing, and I expect some errors to come out, I don't have to go through the IT process of requesting a change or review, wait a few weeks just to test a few simple changes. I can just go in here, change my endpoint from production data alert at nmdp.org to my personal email.

And if I ever have to build another machine learning model, which we want to do, we want to keep the momentum going, I can now implement my own email alert. And I understand that it's still pretty intimidating to dive into a new field, especially sometimes DevOps engineering, data engineering, which you don't have training in. But I would empower you to actually use your data science skills and bring them with you. They're a really powerful tool set.

Using data science skills to debug infrastructure

And I want to illustrate that with this other example we have, where we have a Connect server that runs a bunch of automated reports, including some monitoring for our donor readiness score model. And one week, we started noticing a bunch of random failures across our 30, 40 reports, with some kind of obtuse log error messages. And we weren't really sure what was going on, so we thought it wouldn't hurt to reach out to our DevOps team. And the first thing that they came back with, which is actually pretty reasonable, is, well, there's been no infrastructure change that we know of. Are you sure it's not a Connect issue? Maybe it's a code issue with your R code. Maybe you have too many reports now running at the same time, and that's causing resource conflicts.

And so these are definitely valid issues. And if you can eliminate potential issues, potential reasons, it's going to be much, much easier to troubleshoot that. And so all we needed to do, with some help, is look at this log messages file on our Linux server hosting our Connect instance. And it contains hundreds of thousands of lines of mostly gibberish, but a few nice things. For example, a timestamp of when an event happened. And in this case, the success was no. There was some error for whatever reason. And so my first instinct as a data scientist is, let me just scrape these logs, scrape them all for the timestamps when there's a success equals no, and just see when they're happening. And so here, there's about 10,000 error messages plotted. Each vertical line corresponds to the timestamp when there is a success equals no message. So maybe not too helpful yet, but then we combine this with the known timestamps when we had random failures, we overlay them, and we have perfect correlation. And so just with a few simple finding some data and plotting it, we're able to provide a really compelling case for probably something that's going on under the server. And our DevOps team was like, okay, let me reach out to our security team. Turns out there was a security scan every 30 minutes, but there was an expired certificate, and it would just yell for a bunch, which would cause some downstream effects, like our Connect reports failing at random times. Seemingly random.

And so just with a few simple finding some data and plotting it, we're able to provide a really compelling case for probably something that's going on under the server.

Closing thoughts

And so I hope the idea I posited at the beginning of this session, I think in turn will increase your impact, has resonated with some of you, and you can start to think about how thinking broader picture and more aspects of the data science lifecycle will benefit you and your organization. And if you're really trying just to get started and you want to take away one message, the next time you get an email about, hey, there's this potential issue, try to lead the resolution of it. It doesn't mean you have to solve everything by yourself, but try to take an active role and learn a little bit more, adapt your tool set, adapt your skill set, because it's going to happen again, and it'll make next time even easier.

And just one little piece of encouragement. If you can learn how neural networks move data around, maybe like causal inference and DAGs, I'm extremely confident that you can learn how data moves around from your IT system. And so with that, thank you. If you want to view my slides, they're on my GitHub. And if you want to connect with me on LinkedIn, please do. Happy to take questions whenever. Appreciate your time.

Q&A

Right. I think we have a couple of questions. First one is, how did you go about selecting the criteria for donor readiness? Any tips for selecting criteria for building a metric for something that seems qualitative? Yes. Awesome question. So we kind of did the scan the ether approach, just talking to all the different business parties and stakeholders and operational processes that are involved and figuring out and trying to link what do we find is correlated with donor readiness, like likelihood to be available and called. We have a bunch of historical data from calling donors and seeing which ones are available. And we also have a great marketing team that tries to keep our donors engaged. And we have some communications, some campaigns to really keep the engagement up. And those all play an important role in our donor readiness score. All right. I actually think that's the only question we have time for. So we'll give Adam another round of applause. Thank you for his time.