GSK's R Journey: From Pilot Projects to Enterprise Adoption

Transcript#

This transcript was generated automatically and may contain errors.

Today we have Ben, we've got Becca, we've got Andy. It's really going to be an awesome team to highlight what they've done there to focus on training, user adoption, the leadership vision to incorporate, the open source backbone, also all of the change management that has to go into this, creating packages, and doing all this phenomenal work. So really excited to bring this webinar today to you. The work that GSK has done has went over so many boundaries. I mean, into the community with Caymus, the R validation hub, working on the Pharmaverse and the list goes on. So it's just amazing to highlight this work today and to hear from this team. And so with that, I'm going to pass it over to Andy, who's going to kick off the webinar for today.

So thank you, Phil. Thanks for flagging some of that good work. Really nice introduction for us. And thanks, Posit, for inviting us to present today. We really enjoyed the series and we were thinking about what we were going to present on. We've really enjoyed the presentations from some of our industry peers on their work with R and their submissions. But rather today than talk about the submissions work that GSK has done and the work that they've done, we're going to talk about the work that GSK has done and what we've done on that small scale. The idea today is to talk about from pilot projects to enterprise adoption. We want to not just focus on these really, really nice examples where we get to a submission with a study, but look at what we've done generally over the last five or six years to get to a point where we are today.

I'm going to start us off today and then give a little bit of that context and background. I'm going to hand over to Ben and to Becca to come in and give a little bit more details of some of the things we've done which have really driven that adoption at the enterprise level. With my job of setting the scene, I guess a good first place to start is with the image that you're seeing in front of you with the mountains and the base camp. Where we are at today as GSK is probably not base camp, but maybe some further on camp further high up the mountain. We are not at the point where we've completed our journey. We're not at the point where we have, I guess, fully 100% succeeded in rolling out across our organization, but we are very, very near.

We put in a lot of effort and, as Phil's called out, a few of the initiatives that we've been involved in. We've done a lot to get to this point. That's really the focus of this. It's how we put ourselves in a great position to make that final push. I do want to warn you there's going to be quite a lot of mountaineering references throughout this. That's the focus today. How has GSK taken an organization in biostatistics of over 1,000 people and helped them on their journey to R?

We need somebody who is going to A, pay for all of the equipment for us to get us going, and we need people who will not just put the equipment in place so it can sit in the closet. We need people who will drive that and encourage or enforce the usage of these tools, these languages in practice.

So it needs to all align. And even if we have all of that, we still need the right conditions before we can move forward. So the climbing analogy would be the weather. We need the right time of year. We're not going to go in a blizzard. But again, in our industry, we felt that time was 2017 to go. So we're talking about years and not months or days here. But we need the external industry conditions to be right. There's a reason why I'm talking to you about this now and not in 2005. For GSK, this started in 2017. For other companies, it may have started earlier or later. But broadly speaking, this is where we feel we got off to something.

And if I step away from the climbing analogies for a second and break this down to what it actually meant in practice for us, this slide is here to represent our open source journey. And if I go from top to bottom in these rows, we've got the systems and the processes that I've mentioned a couple of times already that we've put in place. We've got training. And then we've got the pilots that we ran with people doing the job on the ground. And lastly, we have the tools that were needed to be put in place. So all of these different elements have combined.

And before I hand over to Ben and Becca to dive into a little bit more of the detail around this, I just want to flag a couple of things I've mentioned once or twice, which is the systems and the processes that we put in place. So I did say I really, really like this webinar series. If you go back far enough through the history of these webinars with Posit, you'll find one from me in another couple of colleagues at GSK talking about one of the systems that we put in place, our R platform version 1, as I put it on here. This is a system we built called Warp, which uses a lot of the Posit products to create that starting point. On our installation, where users can log in, there's several hundred packages already available to them to use. So they don't have to go out to CRAN and install R and install all those packages one by one on the laptop. It provides that consistency and user experience and can be supported at the enterprise level.

The other component Phil actually mentioned in his introduction, which was the external factors, something like the R validation hub, which I had the pleasure of leading for three or four years. The R validation hub was something that started around about the start of that era, sort of 2018, 2019 time, and was a collaboration across industry whereby lots of pharmaceutical companies got together and started tackling the question of, well, if there's 20,000 packages on CRAN, how do I know whether they all do what I expect them to? How do I choose the right packages to help me with my analysis? And that can be for the regulators and in a regulatory environment like we have at GSK, but this can be in any industry. If you need to be sure that the packages you're choosing, the packages you're installing centrally, do what you expect, what processes can you put them through in order to check that and make sure your users aren't running code that gives them the wrong result?

The R validation hub produced white paper, and we took that internally, and the R for GXP label on here is a process that we've built that adapts the thinking behind that white paper. So we put those things in place, and I'll just talk about those lightly because they're really, really key components. And with those two key components, we got to the point where through various different pilots, we got to the point where, like some of the presentations that I mentioned recently from our industry peers, we had our full pilot study in R. All of the code, 100% of the code produced in our study, written in the R language, which is a fantastic milestone. But bear in mind, at the enterprise level, we maybe have 50 or 60 studies reporting every year. To go from that one study to the many, many studies takes a different kind of journey, and I'm going to hand over to Ben at this point to talk about how we went from some of these early pilots into the wider scale adoption.

From pilots to enterprise: the Accelerate R approach

Awesome. All right, perfect. Thanks, Andy. So as Andy mentioned, we did some pilots. Now what? How do we go from that sort of kind of old boots, old trusty boots to a more expansive hike? Well, we did what we thought, what worked last time. We decided to do a lot of training. So we created large in-person training initiatives and classes, and we provided that training and created training documentation for study teams. And I think at one point within GSK Biostats, we got up to the point where we had trained 80% or more of all individuals in the organization with like an intro to tidyverse training. And so our thought was, great, we've done, we've got, we've touched a lot of individuals to do training. We have given them documentations. Study teams would begin to use R and create their outputs because that's what happened last time. But the issue is it didn't quite work.

And so one of the things that we did was we had to reflect. And the thing that we realized is the distance, the time period between an R training class and when you could use an R, when you could use R on a new study was like something around 12 to 18 months. And if anyone's experienced that sort of gap in terms of learning something and then trying to apply what you've learned, you very quickly realized you're going to forget everything and you have to redo that training on the fly. So one of our biggest insights was similar to mountaineering. You don't learn how to climb from a book. You don't learn R from a book. You actually have to apply it. And so that's when we decided to think about, okay, what do we need to do? What are the transition, what's the transition that we need to make?

So during our reflection period, there's a couple of things that we realized. One, not all study teams adopt and transition to R at the same time. And so that's similar to that 12 to 18 month timeline that I showed. So it often requires retraining on the use of R. Study teams have different needs according to their timelines. Also, the training needs to be a major focus for study teams and how we train those individuals is important for success. So for example, if a study team already has some R experience, why do they need to go through an intro to tidyverse training again? Maybe they need something more advanced. The other thing that was happening while we were making this transition was a lot of those R experts actually sat outside the business and sort of a center of excellence. They didn't sit on the pipeline.

So we looked at these kind of like insights and we decided that we needed to make a change. We needed to change our strategy for how it is that we do training. So what did we do? We decided to say, no more classes. We're going to move all our training into on-demand training and documentation. And then what we decided to spend most of our effort is actually focused on supporting and mentoring individuals in biostats. So how did we do that? We created a team called Accelerate R, which is basically a small agile pod of R experts that literally goes and sits side-by-side with study teams to train them on R. If you're kind of curious about how we did this, I did have a more in-depth presentation on Accelerate R at PositConf. I've dropped the link in the slides. But that is what we decided to do.

So what does a general kind of Accelerate R initiative look like when we sit side-by-side? It's not just we drop in and we just help them. We actually have kind of a set process. First thing that we do is we help individuals with prereqs, getting system access, GitHub or SCE, making sure that they've done like an intro to R training, as well as then figuring out sort of what is the availability of the engagement of a team. If there's not a ton of availability for the team to actually work with us and sit side-by-side, we won't do it because we don't think there's a lot of value in trying to upskill that team because they all need to be there and available and engaged.

Next, we do a bunch of onboarding. So basically it's a week of training where we focus on either GitHub ways of working, delivery specific R packages. Sometimes we touch on Agile and Agile ways of working, but really what we're focusing on is like how do they deliver the set of agreed upon deliverables using R. We then go through a process of doing delivery. We like to basically associate or what we like to do is assign deliverables to kind of sprints or two-week intervals and then we help the study team actually deliver those. This could be the where we create templates, we do code review, we do training on specific new nuance or niche packages, or even just kind of general like overview of understanding of tools within GSK. And then finally after we do that delivery, we have a closeout. And this was really important for us to be able to document a lot of example code and templates.

And it's helped us as an organization because one of the things that R and open source technologies kind of we're trying to battle or have an uphill walk against is we have a ton of historical legacy code libraries and it's very easy to go and pull code from there in order to do things fast. We needed to be able to create sort of like example code and templates that study teams could then refer to using R to be able to kind of catch up and utilize the same speed.

One of the things that we learned during Accelerate R is the learning curve is incredibly steep, especially with R as well as some of the other pieces of technology that we're introducing like Git or GitHub. So this is a graphic I like to show to explain sort of what is our experience oftentimes with study teams. If you look at time and confidence, over time people become more and more confident until they've reached kind of like a peak and then all of a sudden they have to dive into like the internals of an R package and then they kind of deep dive and spiral down because they're like, oh no, I have actually no idea what I'm doing. And that's fine. We like to make sure that individuals are there and supported, but we know that there is going to be a steep learning curve.

So we started Accelerate R and we had some really good success over about two years. And one of the things that we realized doing Accelerate R is that we needed to evolve. We needed to evolve because we realized that training was helpful, but the true impact that study teams needed was they needed their obstacles removed. So we needed to change the role of Accelerate R. So similar to the old boots and, you know, I'm sure the boots that, you know, Tenzing Norgay and Eamon Hillary were using to climb Mount Everest, we needed to change. We needed to partner with the R engineering team to bring real-time insights and feedback from study teams to that R engineering team in order to create those tools to enable study teams. And that was crucial for us as an organization because it helped us to remove obstacles quickly for study teams instead of having long cycles of development.

We needed to evolve because we realized that training was helpful, but the true impact that study teams needed was they needed their obstacles removed.

So supportive mentoring is great. I love it. I've spent a lot of time doing it, but really to achieve enterprise adoption, it's not purely only supportive mentoring. You actually need to deliver the right tools to study teams and individuals in order for them to create their outputs. So going back to, you know, Andy's chart of we have the R platform, we have the processes, and we, you know, have the supportive mentoring, we need the tools. So I'm going to hand it over to Becca, who's going to talk about sort of this last work stream around supporting tools and again and again over time.

Engineering tools to close technical gaps

Great. Thank you, Ben. All right. So I sit on our engineering team, and as Ben mentioned at the end of his section, we started to collaborate with the Accelerate R team to provide a better support model for our study teams. And what we found is our teams are really eager to use the latest and greatest tools that are available in open source. They're really excited. But with new things come, you know, inevitable roadblocks to try to implement them in your traditional workflows. So as an engineering team, our overarching goal is to assess those roadblocks when they come up and try to create tools and solutions to help people move forward.

And when I talk about roadblocks, what do I mean exactly? We have this vast open source ecosystem available at our fingertips, tons and tons of R packages these days. And in pharma in particular, we have an ecosystem called the Pharmaverse, which is a whole variety of So really, there's no better time than now to start using R in our study work.

But every organization is a little bit different. Every study is a little bit different. Every data sets a little bit different. And so, you know, gaps in functionality do come up. And when we, you know, the Accelerate R team encounters a gap in functionality, a tool doesn't quite do exactly what we need it to do. The engineering team, we come try to assess the situation and figure out, okay, what do we have that's working well, and where do we need to, to augment and kind of build some more things around it.

And next, I want to talk about a particular use case, a particular situation that came up that led us to build a new tool. So we were in a situation a year or two ago, where GSK had created and made available as first production environment for R. So this is the R for GXP effort that Andy had mentioned at the top. We call these production environments for our frozen environments because their contents are carefully assessed and curated, tested, and then ultimately put in an environment that's locked down and users can't modify it. So these frozen environments are set up exactly for production work.

And a study team came into Accelerate R, and they were ready to use R in production. And one of the initial tasks in the clinical reporting workflow is preparing datasets. And there's a really standard type of dataset transformation that is made much easier by an R package in the Pharmaverse called Admiral. So Admiral helps support this very standard dataset transformation step. This team was ready to use that package, but they were also interested in an extension of that package that was a little bit newer, called Admiral Onco. This team was an oncology team, they were working on an oncology study. So Admiral Onco added, you know, some extra bits that would be useful for them specifically. When they inquired about this package, we were excited to tell them that it would be coming available in the next frozen environment.

So every frozen environment, we're adding a little bit more, we're making it a little bit better, but they take time. They take time and effort to create. And so that means waiting a number of months in between. It turned out that this team could, they could wait for their final deliverables. They were far out enough that using the future frozen environment would be suitable. But they had a lot to do and they needed to get started now. They needed to get started programming.

So this is actually the first time the engineering team was brought in to, as part of the Accelerate R model, to talk with this team and figure out how we can support them in the meantime, as they kind of wait for this future frozen environment, but not get stalled as they're waiting and continue to move forward. So in discussing with this team what they might need, first and foremost, we, you know, reproducibility, reproducible code, a reproducible environment. Now, they're not able to use a frozen environment. They need to be installing and managing packages themselves in an open R environment. So reproducibility is really critical to ensure that all of their code runs smoothly between folks and both now and in the future. And renv is great for reproducibility.

They needed to know which packages were okay to use and which packages would be available in the next frozen environment and then easy access to those packages. Posit Package Manager plays a critical role in this piece. This is really important. This team is trying to work towards something in the future. So they need the ability to smoothly move forward and then have a nice, smooth transition over to that environment when it's ready. CRAN snapshots are one of the many pieces that help us here. And then finally, you know, with a team that's newer to R, newer to environment management, we need to make this easy for them and help them stay on track, not use packages they shouldn't be using, making sure that, again, their transition to that frozen environment is smooth. We want to make sure what they're using in their interim environment matches and aligns as closely as possible with that future frozen environment.

Introducing Slushy

So all of these bits and pieces wrapped up into a new package we created called Slushy. So Slushy, its name is inspired by the fact that it's a frozen environment in a way, but melted down a little bit because we're adding a little bit more flexibility. And the workflow for Slushy looks something like this. At the start of the study, the team, or when they're ready to start programming, the team initialized Slushy with the packages they needed to a particular CRAN snapshot. And every so often, an update is performed to kind of slide that CRAN snapshot forward bit by bit until they didn't need to anymore, because the next frozen environment, its snapshot was decided, and then they could just kind of coast to the end, and their environment was more or less matching.

To support them throughout this process, I'll just highlight one of the supports in Slushy. If you're curious about more details, I did present on Slushy at PositConf 2023, so you can check out that talk if you're interested. But one of the areas we tried to support is just helping people through this process of anything that, any updates that might occur that could impact their code. As we all know, you know, open source is moving very quickly. Packages update, new features come, things change. And so, if they're able to anticipate any code changes before, during, after updates, that's really, really helpful. So, they can adjust along the way. So, we created Slushy for this particular case, but it turned out this was a pretty common scenario, and we're continuing to find new cases, new uses for Slushy to this day. So, it's a pretty common solution for the mix of stability and flexibility. And Slushy is just one example of many of kind of tools and solutions we've built in the moment for our teams to close these technical gaps.

Unexpected benefits of open source engagement

But if we kind of pause along our journey, and this day-to-day work, and look around, what we see is there's actually some other unexpected gems. So, there's a little more that meets the eye, because we have the whole open source world. And the more that we lean into it, the more we kind of discover these gems. And so, next I'm going to talk to you about some of the other sort of side effects of our open source journey, and some of the other benefits we're realizing along the way.

So, one benefit, you know, when we bring the engineering team into the Accelerate R model, we're bringing this on-demand support to the teams, which is really great. But we're also bringing the tool developers and the teams much closer together. So, what that means is they're getting more insight. The teams are getting more insight into our development process. They're understanding how we iterate on functionality, and where we store the code. If they, they're of course valuable users, and they're testing things out as they go, or they're stress testing the tools. So, if they encounter something that's not working quite right, or not as expected, we work with them, and we determine is, you know, maybe they found a bug, and we can thank them for that, and help them understand how to file that as an issue. So, this is also, you know, increased exposure, and building data science skills for our study teams, which is really great, and we feel this helps us bring more contributors out into the open source world.

Another benefit is the growth of our tools. So, Slushy, if we go back to Slushy, we created this as an internal tool for our internal needs, but we discovered that it was pretty useful internally, and that inspired us to consider maybe Slushy is actually useful for the industry. After all, the industry's trying to adopt open source, and everyone's trying to bridge together stability and flexibility. So, we decided to open source Slushy, and put it on our public GitHub, and what that allowed us to do is, of course, welcome in new users, but also open up the dialogue, and engage with other organizations about this scenario, and how best to deal with it, and also engage with Posit, who's building important pieces of infrastructure, like Posit Package Manager, and renv, just so we can share our needs back, and again, open up this dialogue for how best to solve this.

More broadly, we try to ask ourselves this question a lot of, you know, have we built something? Would it be useful outside of GSK? And sometimes, the answer is no. Sometimes, we're building something that's really specific to us, and there's no reason to share it, but oftentimes, we are solving problems that are common to others as well. Going back to the Pharmaverse, that's a really great example of people committing to open source to solve these common problems. So, we go through this exercise often. When we create something, is there a benefit to others? If so, what do we gain by sharing? If we decide to put our package out in the world, or maybe just contribute to an existing open source package, instead of building our own thing? And something, you know, there's a lot that can be gained, more users, more stress testing of the tool that's ultimately making it better, but also increased insights, and increased perspective into what, you know, what things that we're not considering. Ultimately, you know, we feel really strongly about contributing to open source, because we've seen how those contributions really pay off, not just for us, but for our industry as a whole.

And finally, I want to mention, this is really important, that open source is not just packages to us. Open source also means shared access to information. I've talked a lot about how we're trying to close technical gaps with the Accelerator model. We're trying to help people use packages, figure out where the gaps are. There's also an opportunity to close gaps in knowledge. A model of sharing information that I want to highlight that does this really, really well, is KAMIS. KAMIS is a working group that we, we at GSK, we have involvement in co-leading. So what KAMIS puts out in the world is example code for how to do statistical analyses in different programming languages, and also what might be different between them, because there's a lot of nuances, there's a lot of differences that do pop up. And the process to research them and figure out what might be different or the same is quite time consuming and tedious and requires a lot of expertise at times. So KAMIS is a nice model that's, that's, the resources are really, really building. And it's an example of how you can put in a little bit, a small investment, but getting a big reward on the other side. Because if everyone contributes a little bit, then we have this nice, big shared repository that that is, has better longevity.

It's an example of how you can put in a little bit, a small investment, but getting a big reward on the other side. Because if everyone contributes a little bit, then we have this nice, big shared repository that that is, has better longevity.

So with the Accelerator and engineering teams, working together with study teams, we're able to better close gaps that come up along the way. And then together with the open source community, we can better conquer the divide. And all of these pieces between all of these support layers and, and layers of engaging with the community, we feel that they're all really, really important for our overall success. So next, I will pass it to Andy, who's going to remind everyone of where GSK is and where we hope to go in the near future.