Jon Nye | Small Team Large Organization: Building Impactful Shiny Dashboards at NIH | RStudio (2022)
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
I'm a science policy analyst at the National Institutes of Health. For those of you that don't know, the NIH is the largest funder of biomedical research in the world. It consists of 27 different institutes, each with a different focus on human health. I work at one of those institutes, the National Institute of Allergy and Infectious Diseases.
This guy may look familiar, you may even own a bobblehead or a pair of Dr. Fauci socks, but he's the director of NIAID and has been for almost the past 40 years. And NIAID's mission is to fund research in infectious diseases, biodefense, and immune-mediated diseases such as asthma and allergy.
So a little bit about myself, I was a cancer researcher for a little over a decade. I started at the University of Arizona Cancer Center and then I moved to the National Cancer Institute in Bethesda, Maryland. I tell you this because I beg you, please know infectious disease questions, you'll be disappointed. But while at the NCI, I developed a love for science policy. I then left the lab and started a new position at NIAID in the data analytics and research branch. And there I was introduced to R and all of the tools and packages that I'll be talking about today.
So when I joined NIAID, I joined a very small team, it was me and two other analysts, and I kind of felt like this is how I felt the team was. We were small and scrappy, but we really wanted to have an impact on a large organization like NIAID. And today I'm going to tell you about how we use Shiny to support NIAID and make our team a lot more efficient as well.
Research area tracking dashboard
So the first example I want to give you is a Shiny dashboard to hit research funding targets. So why do we have funding targets for a specific research area? First, we're a government organization. So Congress can tell us, we'd like you to spend this amount of money on this research area. So it can be congressionally mandated. We also sometimes receive targeted funding aside from the NIH budget. One example of this recently was Operation Warp Speed for COVID vaccine development. And finally, NIAID leadership can also set funding goals based on NIAID's mission and priorities at the time.
But when I first got involved in this process, I found out it was a lot more complex than I originally imagined. So in some research areas that we're tracking, there are over 700 different projects. These are grants and contracts that represent a lot of different moving parts. To give you an idea, in some of these areas, we're tracking over half a billion dollars worth of funding. And another added layer of complexity is that at NIH, we receive grant applications throughout the year. So we're stuck making important funding decisions at the beginning of the year when we don't have all of the data available to us. So we don't have all of the grants that we're going to fund that year.
But I wanted to use an analogy to help understand this process. So you can imagine that hitting these research funding targets are sort of like a road trip. So on a road trip, you have to know exactly where you are at any given point and how you're going to get to your destination, otherwise you're going to get lost. But we all know it's not that easy. Sometimes things come along that make it difficult to get where you want to go. I'm thinking of one specific road trip I took moving from Tucson to Bethesda, Maryland. On this road trip, I got a flat tire. My air conditioning broke in the middle of a heat wave. And I had a very vocal, very disgruntled cat in the backseat for 2,300 miles.
But today I want to talk about the challenges that we faced hitting these research funding targets and how we use Shiny to address those challenges. So first, like I said, there's a lot of different moving parts. There's a lot of people involved in this process throughout the Institute. I immediately found out that there were a lot of competing estimates, right? Everybody had their own idea of where we were. So it's like if you're on a road trip and you know you're in Dallas, somebody else thinks you're in El Paso, and another person thinks that you're in Utah. And nobody can agree on where you're at at the moment.
So it was really frustrating. Even worse is when people would come to me with funding estimates and have no details. So it's like somebody saying we're in Missouri, and I'm like, there's no chance we're in Missouri. You took a wrong exit somewhere. How did you get there? And they would have no idea. So my team spent a lot of time and effort trying to figure out how they came to that conclusion and comparing it to our estimate too.
And finally, at the beginning of this process, a lot of people were reliant on spreadsheets and emails. And this is a big problem, because you can imagine that immediately once you send that email, that information is out of date. Because at NIH, we're constantly getting new applications and making new awards throughout the year. And also, people make mistakes when you're searching your inbox. Maybe you found that email from last week or last month. So it was hard to get everybody on the same page.
So I created the Research Area Tracking tool to help address these questions. But how does this dashboard get us closer to our destination? First, behind the scenes, I have an R Markdown document that's pulling in data from multiple databases. I use dbplyr, so I never have to look at SQL coding again. And so it transforms it, it formats it, and writes it to a database for our Shiny dashboard to use. And because I published it on RStudio Connect, I can schedule it to run daily so that the information is always up to date.
So we've immediately made this process a lot better, right? So we have data integration, multiple data sources in one spot. The information is always up to date. And people have one source that they can go to for information, kind of like Google Maps, right?
So next, I wanted to address this problem of having multiple estimates. And it seemed like everybody, people around the Institute had little pieces of the puzzle, but we could never put those together and get the big picture. So in the dashboard, I made all of the data available. So you can see every single project that's included or not included in our estimate. So this is almost like having turn-by-turn directions showing you exactly where we are and how we're going to get to our goal. We've made this data available to be downloaded at any point for any user. We also wanted this to be a platform for collaboration. So you can add and remove projects from the estimate as well. This is an example of that, where you can add a project and add a reason for that change so that other users can see why you made that change.
And so that the data is always up to date, the Shiny dashboard writes all of these user inputs to a database so that whenever a user logs on, it's always up to date and includes all of those user inputs as well. So the benefit of this is that we've increased transparency and collaboration in this whole process and hopefully we've gotten closer to our destination.
So lastly, I wanted the dashboard to be able to make predictions based on an application's score. So at NIH, when an application comes in, it's basically reviewed by an expert panel of scientists and they give it a score that helps us determine whether we fund it or not. So this dashboard can actually make predictions on how much we're going to fund even before we've made those awards. And since this dashboard is based on Shiny reactivity, all of the graphs, the tables, the high-level numbers, everything can be changed based on user input.
So we've allowed users to adjust that scoring cutoff as well. So if you're in a meeting saying, I want to fund all of the applications up to this score, you can actually change that in the dashboard and see how it affects the estimates so that you can make these important decisions in real time.
you can actually change that in the dashboard and see how it affects the estimates so that you can make these important decisions in real time.
So I hope that this dashboard has made this process a lot easier and maybe a little bit more accurate as well. But I think in the end, my small team benefited the most, right? Because we've been able to automate a lot of these processes and really free us up to take on additional projects, right? Before we were spending so much time manually updating the data, answering questions from colleagues and making changes that we didn't have that much time to do other things.
Machine learning for grant classification
So the next example I want to give you is how we've used machine learning to classify NIAID grants. So just a little background on grants at NIH. There are really two main types. One starts at NIH. So we send out a request for applications, and this is when the NIH asks the scientific community to apply for funding in a very specific, well-defined area. We usually use these to stimulate research in really high-priority areas. The next type of application is the investigator-initiated application. So instead of starting at NIH, we're starting with that investigator or scientist that's running a lab somewhere, right? So these make up by far the majority of our applications, greater than 85%.
And so at NIH, we really have to rely on the scientific community, the creativity and ingenuity that they have to come up with innovative research ideas. And those ideas can be very broad and cross-cutting or very narrow and well-defined. But when we get those applications, these investigator-initiated ones, we don't have much information on it. Besides that, it falls in the broad research areas of NIAID's mission. So you can imagine that this is very difficult, right? When you're dealing with thousands of applications, how do we know how much we've funded and what projects we've funded in a specific research area? So we get questions all the time. How much have you spent on HIV vaccine research or how much have you spent on tuberculosis research and what projects? And we need to be able to answer those questions.
So how do we do that? The answer is we have a team for that. So every year, NIAID gets over 9,000 grant applications, and we fund about 20% of these, give or take. We then have a team of about three to six people that go through these awarded grants and apply codes based on the research being proposed. So they have over 1,300 unique codes that they can use to describe that research. And I've included a few examples here on the right. So they look, are they researching a specific disease like Ebola or Zika? Is the research area, you know, new therapeutics or diagnostics and so on?
But you can imagine with thousands of grants, this can be challenging. First, it's very time intensive. So they have to go through thousands of these awarded grants and read the title, the abstract, the specific aims, and describe that research. Because it's so time consuming, we only have the resources to code awarded grants. The 80% of applications that are unfunded, we don't have any information on. And they have a constantly increasing workload. So every year, we're getting more applications and giving out more awards. And especially at the beginning of the pandemic, this was, we saw a huge increase in applications due to COVID research. And finally, because it takes so much time, this usually isn't done until the very end of the year. So we don't have real time data on this, which would be helpful.
So we thought this was a great opportunity to use supervised machine learning for grant classification. We had a database of thousands of grants that had been manually coded over the past decade that we could use to train machine learning models. So here's an example of a grant that's proposing to research a more sensitive diagnostic for tuberculosis. When I looked up the codes on this grant, you could see the disease was tuberculosis. The research area was diagnostics, and it was a clinically focused grant.
So I just want to give you one example of how we've used machine learning to classify grants based on the research area. So here, it was diagnostics. But there are five categories, including therapeutic, prevention, and vaccine. So we took about 7,500 grants that had previously been manually coded and used them to train three different machine learning models, including logistic regression, support vector machine, and random forest to compare the three. We then tested these models using about 3,200 grants that had also been manually coded. And so what we found was the logistic regression was the most accurate, 88.5%. And it also did the best at predicting this relatively small category of grants in the prevention class.
So we use this machine, this logistic regression model, to build a Shiny dashboard for coding NIAID grants. And we then expanded it to assist in coding in five additional categories. So behind the scenes, once again, we have an R Markdown document that's pulling in the data from multiple sources, making those machine learning predictions, and writing this data to a database. It's constantly getting updated with new applications, new awards. And this is what our Shiny dashboard uses. And once again, we wanted people to be able to collaborate. So it's writing user inputs in a database for future use.
So here's an example of how we're incorporating user feedback. So you can actually go through each project and say whether it's relevant or not relevant to that specific area. We're including the machine learning score so they know the probability that that project is in that area. And once again, we wanted it to be a platform for collaboration. So you can attach notes to each of these projects for other users or even yourself in the future. Here's an example of that. You see a tiny little envelope when a note is attached to a project. We also have tags involved. So if you don't want to write a whole note, you can just add a small tag or phrase. For example, if somebody has verified that project, you can just quickly add that.
So we're hoping that this project will improve productivity. So if previously it took five minutes to code a grant, we're hoping with a machine learning prediction and a probability estimate that we can cut that time down. And even if it was one to two minutes, you can imagine that with thousands of grants, that's a very significant time savings. We also want to support data-driven decision making. So as I mentioned, 80% of our applications are unfunded and therefore not coded. But with these machine learning predictions, we can apply these codes to all grant applications. And hopefully, we can identify gaps where researchers aren't applying and we can then address those gaps.
So I want to thank my small team for introducing me to R and all of the tools and packages that I've shown you today. I think we've definitely been able to have an outsized impact for such a small team. And I hope we can use the lessons that we've learned and the tools that we've developed to continue to support NIAID in the future with their goal of to better understand, treat, and ultimately prevent infectious diseases and to improve human health. So, thank you.
I think we've definitely been able to have an outsized impact for such a small team.
