Resources

GSK's R Journey: From Pilot Projects to Enterprise Adoption | Hosted by Posit

video
Nov 7, 2024
1:02:10

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Today we have Ben, we've got Becca, we've got Andy. It's really going to be an awesome team to highlight what they've done there to focus on training, user adoption, the leadership vision to incorporate, the open source backbone, also all of the change management that has to go into this, creating packages, and doing all this phenomenal work. So really excited to bring this webinar today to you. The work that GSK has done has went over so many boundaries. I mean, into the community with Caymus, the R validation hub, working on the Pharmaverse and the list goes on. So it's just amazing to highlight this work today and to hear from this team. And so with that, I'm going to pass it over to Andy, who's going to kick off the webinar for today.

So thank you, Phil. Thanks for flagging some of that good work. Really nice introduction for us. And thanks, Posit, for inviting us to present today. We really enjoyed the series and we were thinking about what we were going to present on. We've really enjoyed the presentations from some of our industry peers on their work with R and their submissions. But rather today than talk about the submissions work that GSK has done and the work that they've done, we're going to talk about the work that GSK has done and what we've done on that small scale. The idea today is to talk about from pilot projects to enterprise adoption. We want to not just focus on these really, really nice examples where we get to a submission with a study, but look at what we've done generally over the last five or six years to get to a point where we are today.

I'm going to start us off today and then give a little bit of that context and background. I'm going to hand over to Ben and to Becca to come in and give a little bit more details of some of the things we've done which have really driven that adoption at the enterprise level. With my job of setting the scene, I guess a good first place to start is with the image that you're seeing in front of you with the mountains and the base camp. Where we are at today as GSK is probably not base camp, but maybe some further on camp further high up the mountain. We are not at the point where we've completed our journey. We're not at the point where we have, I guess, fully 100% succeeded in rolling out across our organization, but we are very, very near.

We put in a lot of effort and, as Phil's called out, a few of the initiatives that we've been involved in. We've done a lot to get to this point. That's really the focus of this. It's how we put ourselves in a great position to make that final push. I do want to warn you there's going to be quite a lot of mountaineering references throughout this. That's the focus today. How has GSK taken an organization in biostatistics of over 1,000 people and helped them on their journey to R?

Leadership commitments and the open source journey

I'm going to take us back about 12 or 14 months, back to August 2023. At this point, our biostatistics leadership team made some very, very big commitments. The first commitment was that from that point forward, all central tools that we built would be built using open source languages. Secondly, that 50% of the code, remember this is 50% of the code from an organization of 1,000 people that produce code every day, 50% of that code would be open source by the end of 2025. Now, to really get the impact of those statements, you've got to think where we were five years ago. Five years ago, in fact, even a year prior to this, none of our central tools were built using open source languages. Open source was in use, but not for our central tools. In terms of the code, again, going back five years, maybe 5%, 10% of that code was produced using open source languages. We've gone from essentially from a position of zero or close to zero to a point where we were able to, a year ago, make two very, very big and bold statements about our use of open source tools, particularly the R language moving forward.

Before I get into some of the details, before Ben and Becca come in, just to position this for those who might be listening who are not from the pharmaceutical industry, you will know that we deliver medicines and vaccines, and that is a core part of what we're trying to do. In doing so, we're essentially delivering what's a lot like this rope on the left-hand side. The rope here for a climber is a key, key part of what a climber needs to climb that wall, and in trusting that rope, they want to make sure that that rope is going to be safe, and they want to make sure that rope is going to be effective. Delivering medicines and vaccines is very, very similar.

We need to make sure that the medicines and the vaccines that we develop are safe and effective, and that requires a lot of analysis of the data that we have that we collect from our clinical studies. If you buy a rope, you go in and you're a climber and you go and buy a rope, you don't really expect to just take the manufacturer's word for the safety of that rope. You would hope that at some point, some independent body has assessed the manufacturing of that rope. They've checked the processes and the way that the process is used by the company manufacturing the rope to ensure that it's to good standards, that the rope has been thoroughly tested and been through all of those measures that are going to make it fit for purpose.

Again, linking that back to our industry and how we work, that transparency is also vital for us. We have regulators like the FDA, the Food and Drug Administration in the USA, and other regulators throughout the world who are going to check what we do. All of that boils down to a practice where the reason we have a thousand or so statisticians, programmers, and data scientists is because we have a lot of work to do to generate our outputs, generate tables, and generate graphics that demonstrate that the medicines and the vaccines we produce are safe and effective.

There's a lot of work to be done, and importantly, there is not a lot of time to do it. And like in any job, we need to be fast. But here, bear in mind, we've got patients around the globe looking for us for new medicines or better medicines than the standard of care that they have today. There's also the commercial aspect. We want to be ahead of our competitors. So we have a lot to produce. We don't have a lot of time to produce it in.

Overcoming resistance to change

Now, ideally, you'd look at this and you think, OK, well, we've got a huge amount of code to write, a lot of analysis to be done, not a lot of time. It should be the ideal scenario for innovation. Unfortunately, however, although that can be the case, there is a little bit of the tendency towards what I call the old boots. So the boots you see on the screen at the moment, I have a pair like this. I bought or was bought by my parents a pair of boots when I was 14 years old. I still have them today. I will walk in them. I will garden in them. They're a trusted pair of boots that I've had for for many, many years. I won't tell you how many years past 14 I am, but it's a long time.

And these boots, these boots are great. Now, over time, they're a bit worn. They're maybe not quite as waterproof as they used to be, but they work for me. They worked for me last time. Last time I took a hike, they worked. And I expect the next time they'll broadly work. Now, out there in the market today, there are boots that are lighter. There are boots that are more waterproof than the boots I have today. Maybe there's better colors, other design features that could improve on that. But what tends to happen in our industry is that because it worked well last time, we're a little bit reluctant. People can be a little bit reluctant to change and upgrade those boots, just like me and my hiking boots.

And when I talk about that upgrade, when we relate this back to languages, we're not just talking about a migration from one language to another. We're talking about systems, tools, processes, ways of working that have to change alongside. And that kind of resistance against shifting is what we're facing with when we try and roll out open source adoption, particularly our adoption at a company like GSK or any large organization like this.

What it takes to begin the journey

So we're going to make an open source journey. We didn't start in 2023. We started before that. But what got us there? Well, first of all, we need the right equipment. We need the standardized central environments that we can all, that people can log into and use. We need to think about the management and support of, again, a thousand statisticians, programmers, and data scientists. But before we even get there, before we build these tools, we have to want to climb the mountain.

And that's potentially obvious. But it's really, really important. You can buy me the rope, you can buy me the carabiner, the ice axe, the shoes. But unless I really want to go climbing, they'll just sit in a cupboard somewhere in my room and not get used. I have to want to do it. And again, the same applies for adoption of open source languages in large organizations. There have to be not just me, but I need colleagues. I need a group of people around that really, really wants to do this before it can happen. And those people are going to need the knowledge, the skills, and some experience of doing so before we can do so.

So you might be the enthusiastic R user or Python user within your company, but you need people around you who also have that enthusiasm and the knowledge and skills. And for those who don't, we're going to need to train them. Even those who do have that knowledge are going to need further training and to develop new skills, things they wouldn't have even thought about if they were just using R to do statistical analyses. The rollout of R at an enterprise level takes new skills and new knowledge that has to be built.

So we need a core group of people who are enthusiastic, they want to do this, they have the right skills, and then we get on to sponsorship. So this is probably about the furthest I'm going to push the climbing analogy, but when people go and climb Everest, they don't just go climbing, they don't buy the equipment and set out. They have sponsorship. They need insurance, they need all these other things paid for by someone. They need that sponsor behind them to make that mission happen. And again, it's the same when we're talking about enterprise adoption of languages like R and Python. We need the senior leaders. We need somebody who is going to A, pay for all of the equipment for us to get us going, and we need people who will not just put the equipment in place so it can sit in the closet. We need people who will drive that and encourage or enforce the usage of these tools, these languages in practice.

We need somebody who is going to A, pay for all of the equipment for us to get us going, and we need people who will not just put the equipment in place so it can sit in the closet. We need people who will drive that and encourage or enforce the usage of these tools, these languages in practice.

So it needs to all align. And even if we have all of that, we still need the right conditions before we can move forward. So the climbing analogy would be the weather. We need the right time of year. We're not going to go in a blizzard. But again, in our industry, we felt that time was 2017 to go. So we're talking about years and not months or days here. But we need the external industry conditions to be right. There's a reason why I'm talking to you about this now and not in 2005. For GSK, this started in 2017. For other companies, it may have started earlier or later. But broadly speaking, this is where we feel we got off to something.

And if I step away from the climbing analogies for a second and break this down to what it actually meant in practice for us, this slide is here to represent our open source journey. And if I go from top to bottom in these rows, we've got the systems and the processes that I've mentioned a couple of times already that we've put in place. We've got training. And then we've got the pilots that we ran with people doing the job on the ground. And lastly, we have the tools that were needed to be put in place. So all of these different elements have combined.

And before I hand over to Ben and Becca to dive into a little bit more of the detail around this, I just want to flag a couple of things I've mentioned once or twice, which is the systems and the processes that we put in place. So I did say I really, really like this webinar series. If you go back far enough through the history of these webinars with Posit, you'll find one from me in another couple of colleagues at GSK talking about one of the systems that we put in place, our R platform version 1, as I put it on here. This is a system we built called Warp, which uses a lot of the Posit products to create that starting point. On our installation, where users can log in, there's several hundred packages already available to them to use. So they don't have to go out to CRAN and install R and install all those packages one by one on the laptop. It provides that consistency and user experience and can be supported at the enterprise level.

The other component Phil actually mentioned in his introduction, which was the external factors, something like the R validation hub, which I had the pleasure of leading for three or four years. The R validation hub was something that started around about the start of that era, sort of 2018, 2019 time, and was a collaboration across industry whereby lots of pharmaceutical companies got together and started tackling the question of, well, if there's 20,000 packages on CRAN, how do I know whether they all do what I expect them to? How do I choose the right packages to help me with my analysis? And that can be for the regulators and in a regulatory environment like we have at GSK, but this can be in any industry. If you need to be sure that the packages you're choosing, the packages you're installing centrally, do what you expect, what processes can you put them through in order to check that and make sure your users aren't running code that gives them the wrong result?

The R validation hub produced white paper, and we took that internally, and the R for GXP label on here is a process that we've built that adapts the thinking behind that white paper. So we put those things in place, and I'll just talk about those lightly because they're really, really key components. And with those two key components, we got to the point where through various different pilots, we got to the point where, like some of the presentations that I mentioned recently from our industry peers, we had our full pilot study in R. All of the code, 100% of the code produced in our study, written in the R language, which is a fantastic milestone. But bear in mind, at the enterprise level, we maybe have 50 or 60 studies reporting every year. To go from that one study to the many, many studies takes a different kind of journey, and I'm going to hand over to Ben at this point to talk about how we went from some of these early pilots into the wider scale adoption.

From pilots to enterprise: the Accelerate R approach

Awesome. All right, perfect. Thanks, Andy. So as Andy mentioned, we did some pilots. Now what? How do we go from that sort of kind of old boots, old trusty boots to a more expansive hike? Well, we did what we thought, what worked last time. We decided to do a lot of training. So we created large in-person training initiatives and classes, and we provided that training and created training documentation for study teams. And I think at one point within GSK Biostats, we got up to the point where we had trained 80% or more of all individuals in the organization with like an intro to tidyverse training. And so our thought was, great, we've done, we've got, we've touched a lot of individuals to do training. We have given them documentations. Study teams would begin to use R and create their outputs because that's what happened last time. But the issue is it didn't quite work.

And so one of the things that we did was we had to reflect. And the thing that we realized is the distance, the time period between an R training class and when you could use an R, when you could use R on a new study was like something around 12 to 18 months. And if anyone's experienced that sort of gap in terms of learning something and then trying to apply what you've learned, you very quickly realized you're going to forget everything and you have to redo that training on the fly. So one of our biggest insights was similar to mountaineering. You don't learn how to climb from a book. You don't learn R from a book. You actually have to apply it. And so that's when we decided to think about, okay, what do we need to do? What are the transition, what's the transition that we need to make?

So during our reflection period, there's a couple of things that we realized. One, not all study teams adopt and transition to R at the same time. And so that's similar to that 12 to 18 month timeline that I showed. So it often requires retraining on the use of R. Study teams have different needs according to their timelines. Also, the training needs to be a major focus for study teams and how we train those individuals is important for success. So for example, if a study team already has some R experience, why do they need to go through an intro to tidyverse training again? Maybe they need something more advanced. The other thing that was happening while we were making this transition was a lot of those R experts actually sat outside the business and sort of a center of excellence. They didn't sit on the pipeline.

So we looked at these kind of like insights and we decided that we needed to make a change. We needed to change our strategy for how it is that we do training. So what did we do? We decided to say, no more classes. We're going to move all our training into on-demand training and documentation. And then what we decided to spend most of our effort is actually focused on supporting and mentoring individuals in biostats. So how did we do that? We created a team called Accelerate R, which is basically a small agile pod of R experts that literally goes and sits side-by-side with study teams to train them on R. If you're kind of curious about how we did this, I did have a more in-depth presentation on Accelerate R at PositConf. I've dropped the link in the slides. But that is what we decided to do.

So what does a general kind of Accelerate R initiative look like when we sit side-by-side? It's not just we drop in and we just help them. We actually have kind of a set process. First thing that we do is we help individuals with prereqs, getting system access, GitHub or SCE, making sure that they've done like an intro to R training, as well as then figuring out sort of what is the availability of the engagement of a team. If there's not a ton of availability for the team to actually work with us and sit side-by-side, we won't do it because we don't think there's a lot of value in trying to upskill that team because they all need to be there and available and engaged.

Next, we do a bunch of onboarding. So basically it's a week of training where we focus on either GitHub ways of working, delivery specific R packages. Sometimes we touch on Agile and Agile ways of working, but really what we're focusing on is like how do they deliver the set of agreed upon deliverables using R. We then go through a process of doing delivery. We like to basically associate or what we like to do is assign deliverables to kind of sprints or two-week intervals and then we help the study team actually deliver those. This could be the where we create templates, we do code review, we do training on specific new nuance or niche packages, or even just kind of general like overview of understanding of tools within GSK. And then finally after we do that delivery, we have a closeout. And this was really important for us to be able to document a lot of example code and templates.

And it's helped us as an organization because one of the things that R and open source technologies kind of we're trying to battle or have an uphill walk against is we have a ton of historical legacy code libraries and it's very easy to go and pull code from there in order to do things fast. We needed to be able to create sort of like example code and templates that study teams could then refer to using R to be able to kind of catch up and utilize the same speed.

One of the things that we learned during Accelerate R is the learning curve is incredibly steep, especially with R as well as some of the other pieces of technology that we're introducing like Git or GitHub. So this is a graphic I like to show to explain sort of what is our experience oftentimes with study teams. If you look at time and confidence, over time people become more and more confident until they've reached kind of like a peak and then all of a sudden they have to dive into like the internals of an R package and then they kind of deep dive and spiral down because they're like, oh no, I have actually no idea what I'm doing. And that's fine. We like to make sure that individuals are there and supported, but we know that there is going to be a steep learning curve.

So we started Accelerate R and we had some really good success over about two years. And one of the things that we realized doing Accelerate R is that we needed to evolve. We needed to evolve because we realized that training was helpful, but the true impact that study teams needed was they needed their obstacles removed. So we needed to change the role of Accelerate R. So similar to the old boots and, you know, I'm sure the boots that, you know, Tenzing Norgay and Eamon Hillary were using to climb Mount Everest, we needed to change. We needed to partner with the R engineering team to bring real-time insights and feedback from study teams to that R engineering team in order to create those tools to enable study teams. And that was crucial for us as an organization because it helped us to remove obstacles quickly for study teams instead of having long cycles of development.

We needed to evolve because we realized that training was helpful, but the true impact that study teams needed was they needed their obstacles removed.

So supportive mentoring is great. I love it. I've spent a lot of time doing it, but really to achieve enterprise adoption, it's not purely only supportive mentoring. You actually need to deliver the right tools to study teams and individuals in order for them to create their outputs. So going back to, you know, Andy's chart of we have the R platform, we have the processes, and we, you know, have the supportive mentoring, we need the tools. So I'm going to hand it over to Becca, who's going to talk about sort of this last work stream around supporting tools and again and again over time.

Engineering tools to close technical gaps

Great. Thank you, Ben. All right. So I sit on our engineering team, and as Ben mentioned at the end of his section, we started to collaborate with the Accelerate R team to provide a better support model for our study teams. And what we found is our teams are really eager to use the latest and greatest tools that are available in open source. They're really excited. But with new things come, you know, inevitable roadblocks to try to implement them in your traditional workflows. So as an engineering team, our overarching goal is to assess those roadblocks when they come up and try to create tools and solutions to help people move forward.

And when I talk about roadblocks, what do I mean exactly? We have this vast open source ecosystem available at our fingertips, tons and tons of R packages these days. And in pharma in particular, we have an ecosystem called the Pharmaverse, which is a whole variety of So really, there's no better time than now to start using R in our study work.

But every organization is a little bit different. Every study is a little bit different. Every data sets a little bit different. And so, you know, gaps in functionality do come up. And when we, you know, the Accelerate R team encounters a gap in functionality, a tool doesn't quite do exactly what we need it to do. The engineering team, we come try to assess the situation and figure out, okay, what do we have that's working well, and where do we need to, to augment and kind of build some more things around it.

And next, I want to talk about a particular use case, a particular situation that came up that led us to build a new tool. So we were in a situation a year or two ago, where GSK had created and made available as first production environment for R. So this is the R for GXP effort that Andy had mentioned at the top. We call these production environments for our frozen environments because their contents are carefully assessed and curated, tested, and then ultimately put in an environment that's locked down and users can't modify it. So these frozen environments are set up exactly for production work.

And a study team came into Accelerate R, and they were ready to use R in production. And one of the initial tasks in the clinical reporting workflow is preparing datasets. And there's a really standard type of dataset transformation that is made much easier by an R package in the Pharmaverse called Admiral. So Admiral helps support this very standard dataset transformation step. This team was ready to use that package, but they were also interested in an extension of that package that was a little bit newer, called Admiral Onco. This team was an oncology team, they were working on an oncology study. So Admiral Onco added, you know, some extra bits that would be useful for them specifically. When they inquired about this package, we were excited to tell them that it would be coming available in the next frozen environment.

So every frozen environment, we're adding a little bit more, we're making it a little bit better, but they take time. They take time and effort to create. And so that means waiting a number of months in between. It turned out that this team could, they could wait for their final deliverables. They were far out enough that using the future frozen environment would be suitable. But they had a lot to do and they needed to get started now. They needed to get started programming.

So this is actually the first time the engineering team was brought in to, as part of the Accelerate R model, to talk with this team and figure out how we can support them in the meantime, as they kind of wait for this future frozen environment, but not get stalled as they're waiting and continue to move forward. So in discussing with this team what they might need, first and foremost, we, you know, reproducibility, reproducible code, a reproducible environment. Now, they're not able to use a frozen environment. They need to be installing and managing packages themselves in an open R environment. So reproducibility is really critical to ensure that all of their code runs smoothly between folks and both now and in the future. And renv is great for reproducibility.

They needed to know which packages were okay to use and which packages would be available in the next frozen environment and then easy access to those packages. Posit Package Manager plays a critical role in this piece. This is really important. This team is trying to work towards something in the future. So they need the ability to smoothly move forward and then have a nice, smooth transition over to that environment when it's ready. CRAN snapshots are one of the many pieces that help us here. And then finally, you know, with a team that's newer to R, newer to environment management, we need to make this easy for them and help them stay on track, not use packages they shouldn't be using, making sure that, again, their transition to that frozen environment is smooth. We want to make sure what they're using in their interim environment matches and aligns as closely as possible with that future frozen environment.

Introducing Slushy

So all of these bits and pieces wrapped up into a new package we created called Slushy. So Slushy, its name is inspired by the fact that it's a frozen environment in a way, but melted down a little bit because we're adding a little bit more flexibility. And the workflow for Slushy looks something like this. At the start of the study, the team, or when they're ready to start programming, the team initialized Slushy with the packages they needed to a particular CRAN snapshot. And every so often, an update is performed to kind of slide that CRAN snapshot forward bit by bit until they didn't need to anymore, because the next frozen environment, its snapshot was decided, and then they could just kind of coast to the end, and their environment was more or less matching.

To support them throughout this process, I'll just highlight one of the supports in Slushy. If you're curious about more details, I did present on Slushy at PositConf 2023, so you can check out that talk if you're interested. But one of the areas we tried to support is just helping people through this process of anything that, any updates that might occur that could impact their code. As we all know, you know, open source is moving very quickly. Packages update, new features come, things change. And so, if they're able to anticipate any code changes before, during, after updates, that's really, really helpful. So, they can adjust along the way. So, we created Slushy for this particular case, but it turned out this was a pretty common scenario, and we're continuing to find new cases, new uses for Slushy to this day. So, it's a pretty common solution for the mix of stability and flexibility. And Slushy is just one example of many of kind of tools and solutions we've built in the moment for our teams to close these technical gaps.

Unexpected benefits of open source engagement

But if we kind of pause along our journey, and this day-to-day work, and look around, what we see is there's actually some other unexpected gems. So, there's a little more that meets the eye, because we have the whole open source world. And the more that we lean into it, the more we kind of discover these gems. And so, next I'm going to talk to you about some of the other sort of side effects of our open source journey, and some of the other benefits we're realizing along the way.

So, one benefit, you know, when we bring the engineering team into the Accelerate R model, we're bringing this on-demand support to the teams, which is really great. But we're also bringing the tool developers and the teams much closer together. So, what that means is they're getting more insight. The teams are getting more insight into our development process. They're understanding how we iterate on functionality, and where we store the code. If they, they're of course valuable users, and they're testing things out as they go, or they're stress testing the tools. So, if they encounter something that's not working quite right, or not as expected, we work with them, and we determine is, you know, maybe they found a bug, and we can thank them for that, and help them understand how to file that as an issue. So, this is also, you know, increased exposure, and building data science skills for our study teams, which is really great, and we feel this helps us bring more contributors out into the open source world.

Another benefit is the growth of our tools. So, Slushy, if we go back to Slushy, we created this as an internal tool for our internal needs, but we discovered that it was pretty useful internally, and that inspired us to consider maybe Slushy is actually useful for the industry. After all, the industry's trying to adopt open source, and everyone's trying to bridge together stability and flexibility. So, we decided to open source Slushy, and put it on our public GitHub, and what that allowed us to do is, of course, welcome in new users, but also open up the dialogue, and engage with other organizations about this scenario, and how best to deal with it, and also engage with Posit, who's building important pieces of infrastructure, like Posit Package Manager, and renv, just so we can share our needs back, and again, open up this dialogue for how best to solve this.

More broadly, we try to ask ourselves this question a lot of, you know, have we built something? Would it be useful outside of GSK? And sometimes, the answer is no. Sometimes, we're building something that's really specific to us, and there's no reason to share it, but oftentimes, we are solving problems that are common to others as well. Going back to the Pharmaverse, that's a really great example of people committing to open source to solve these common problems. So, we go through this exercise often. When we create something, is there a benefit to others? If so, what do we gain by sharing? If we decide to put our package out in the world, or maybe just contribute to an existing open source package, instead of building our own thing? And something, you know, there's a lot that can be gained, more users, more stress testing of the tool that's ultimately making it better, but also increased insights, and increased perspective into what, you know, what things that we're not considering. Ultimately, you know, we feel really strongly about contributing to open source, because we've seen how those contributions really pay off, not just for us, but for our industry as a whole.

And finally, I want to mention, this is really important, that open source is not just packages to us. Open source also means shared access to information. I've talked a lot about how we're trying to close technical gaps with the Accelerator model. We're trying to help people use packages, figure out where the gaps are. There's also an opportunity to close gaps in knowledge. A model of sharing information that I want to highlight that does this really, really well, is KAMIS. KAMIS is a working group that we, we at GSK, we have involvement in co-leading. So what KAMIS puts out in the world is example code for how to do statistical analyses in different programming languages, and also what might be different between them, because there's a lot of nuances, there's a lot of differences that do pop up. And the process to research them and figure out what might be different or the same is quite time consuming and tedious and requires a lot of expertise at times. So KAMIS is a nice model that's, that's, the resources are really, really building. And it's an example of how you can put in a little bit, a small investment, but getting a big reward on the other side. Because if everyone contributes a little bit, then we have this nice, big shared repository that that is, has better longevity.

It's an example of how you can put in a little bit, a small investment, but getting a big reward on the other side. Because if everyone contributes a little bit, then we have this nice, big shared repository that that is, has better longevity.

So with the Accelerator and engineering teams, working together with study teams, we're able to better close gaps that come up along the way. And then together with the open source community, we can better conquer the divide. And all of these pieces between all of these support layers and, and layers of engaging with the community, we feel that they're all really, really important for our overall success. So next, I will pass it to Andy, who's going to remind everyone of where GSK is and where we hope to go in the near future.

Where GSK stands today

Thank you. Thank you, Becca. So yeah, before we go into the Q&A, just got a couple of slides to round off. And here's one that I shared earlier. So a year ago, our biostatistics leadership team committed to these, these two pieces. And hopefully, over the last half an hour, 45 minutes, you've seen, you've seen some of the things that we've done to get us to this point around, around a year ago. So all central tools using R and 50% of our code by the end of 2025. I think where we're at today, a year on from when these commitments were made is in a very, very good position. I think we will easily hit 50%, probably exceed it. And then we'll see where we go beyond, beyond that. So it's really, really nice to be in that position. And it's thanks to all the kind of all of these various different components and things that we've put in place at the enterprise level to get us there.

I wanted to finish with this, with this last slide. I really, I'm obsessed clearly with mountaineering, the one more mountaineering slide would, would work. I like this for two reasons. One, it's, it's showing that the summit is in sight, you know, we are through the point where most of the key issues that we could foresee have been, have been dealt with. We've, we've prepared, we've got to a point where we're, we're almost there. Now, we haven't had studies with every regulator submitted with every regulator around the world. We haven't tried every type of statistical model and run that in our, everything we've ever done in, in SAS or in anything else. But we are, through things like KAMIS, we have, that Becca was just talking about, we have the tools we need for, for that second part of the question. We have the kinds of, the kinds of answers that we can give to anybody who's scrutinizing what we do. So I think barring extreme bad weather at this point, we're, we're really confident of making it to the top.

The other reason I wanted to finish on the, on this slide is the path. You can see lots of climbers, and if you've got good eyesight or a big screen, you can probably see some more climbers further along, or even that way up to the, the top of the summit. And as people tread this path time and time again, the path gets more worn, it gets more clear, it's more obvious for others to follow. And it's been a real pleasure to kind of come and present today and share what we've done. And I mentioned at the start, we really, really like seeing all of the other presentations that, that Posit have put on with, with others from, not just from the pharmaceutical industry, but from all industries, because it shows how people are making their own, their own journeys. And this kind of information, putting it out there, as Becca said, again, sharing that with the, sharing these kinds of stories, sharing the kinds of tools that we build, helps others on that path. And when you get that groundswell of people, all kind of pulling in the same direction at the industry level, that's the kind of stuff that feeds back in and helps us internally follow that lead and, and get there as well.

So whilst others hopefully will learn some things from the presentation that we've given today, equally, we enjoy seeing the other presentations because there's always things we learn that will help as well. And that really is what all of the, the open source spirit and the move, move is all about is, is that kind of sharing and learning for the sort of the greater good.

Q&A

All right. Awesome job. It's, it's great to feature the work that your team is doing, because so many pharmaceutical companies reach out and they're looking for, you know, blueprints or advice on the change management. And I feel like I'm just constantly sharing your team's conference talks in various ways through emails. And so I'm so glad that we can tell the story and show how leadership commitment, how user training, how package development can all come together to help support the transition to open source. And so we have a flood of questions that have been coming in. And so let's go ahead and tackle some of these and we'll see how far we get. So I really liked this first question because I get a lot of questions around the why and how for open source. And so right off the bat, someone says, what was the initial driver to move towards open source?

So the joke answer to this is that I rejoined GSK in 2017, which was the start of the timeline. So I was one of those kickers, but I'd go back to that slide that I shared in the introduction where we had, I listed out all the things you need in place. You know, you need the tools, the equipment, the people. So I guess I was one of those enthusiastic people who wanted to make the journey. But we also had one of the key components was our head of programming at the time decided that they were on board and they'd seen the conditions across industry. They'd seen what other groups were talking about, thinking about, and asked if this was the right time, could we do this? Could we put these measures into place? And they were quite proactive in setting objectives to our programs at the time, which is 400, 500, 600 people saying, we want you to start learning R. We want to progress beyond where we are today. And this is a future vision for us. So that sponsorship line that I put up was really, really important as well.

And of course, all of that was the external influence, the weather, as I put it before, all of that stuff was influencing at the same time, along with things like the need to, we were doing a lot more Bayesian analysis at the time. So R was starting to be used there, but we didn't have the right systems. People were unable to run the kind of code, the simulations on their laptops. So we needed the systems as well. So all of the kind of several components all kind of came together at once. And that's why I really emphasize, it's the right time. No one thing will help you make this journey. You need a lot of different things aligning at the same time.

You know, it's amazing. If you look at the presentations by Roche, by Novo, Nordisk, and yours, there was some magic in the air around 2017, 2018. It's when R and Pharma came about, the R Validation Hub. It was definitely a turning point, I think, for a lot of groups. So to keep the questions going, we got the second one here. Why did GSK choose R over other scripting languages like Python or Julia? What do you think?

So I can take this one. I'll let Ben and Becca come in on these questions, so you're not just all hearing from me. So for those who don't know our industry well, I talk about statisticians, programmers, and data scientists. Obviously, Python is a really, really big language in data science. But a lot of the core part of what we're delivering within biostatistics is not predictive models. I mentioned several hundred outputs, potentially, that we will produce on a clinical study. A lot of that analysis is pre-planned, and it's statistical analysis at its core. And R is, as everyone knows, a very, very strong language for statistics. That's what it was built for. And so the latest methods and tools tend to be developed in R first before they appear, potentially, in other languages as well. So R is a natural point to go to. And if you're hiring statisticians, most statisticians these days are trained in R at university, at college. So they pick it up, and they input into industry.

So that's one of the core reasons. To the point around the faster languages, I mean, yeah, sure, we see Python used heavily where we do have predictive modeling questions. That's increasingly a push. When we're getting into generative AI, Python will come up. So it's mainly for the core deliverables where R is the focus. But I guess, although I said speed is really, really important, I'm talking about speed. Maybe we should have had a picture of a glacier, because a drug, from molecule to market, is 15 years. So when I talk about speed, I'm not talking about real time. Something needs to be there like that. We're talking about speed in terms of, can I shave days off? So the speed difference in the languages is not really as important as it is if you're doing fast on the fly analytics.

Can I also jump in on this one real fast? The other thing that's important that you have to remember is it's really hard to train people on one language. If you're going to start making them other languages as well, you're just boiling the ocean. And at a certain point, you have to make a decision on where you're going to go. So it's really important. Again, this journey started in 2017. In 2017, R was the appropriate language for those statistical components. So it's an important thing to remember. You're transitioning 1,000 plus people. It's really hard to have multiple things and to keep those things aligned.

Well, let's keep the questions coming. So Andy, you and your team has been so impactful on the package side, the validation side. We've got a question here on, how does GSK manage packages? Do you have your own package repository that everyone uses the same versions of every package? One of you want to tackle this question.

I can try to take this one. We do have package manager accessible to everyone. But I think the important part of this one is the production or GXP environment. So these are the frozen environments. These environments have everything pre-installed. So there's no deviating from it. It's just you use that environment and you use what's in it. And that's already been kind of decided and curated. So there's no variation there. For Slushy, we're helping people manage out in the open environment so that they can eventually get there. So we're trying to give them similar controls and help them better manage the packages that they should be using and the versions and everything. So that's another option that we think of as a little bit more of an intermediate solution. And then for a playground, I mean, I suppose anyone can use what they want, more of a playground environment. But yeah, the main answer is those frozen environments have everything pre-installed.

What does it look like if someone requests a new package or wants to add a package? What is that process like?

Sure. So the way that we do get a package into a frozen R environment is the package typically goes through what's known as an endorsement process, which is basically a quick glance at the package to say, all right, does this package actually meet our business needs from a strategy point of view? And do we believe it makes sense from a risk point of view? One of the things that we're looking for from that, and the primary place that these packages are brought up by the business, it's not brought up by a central team. So it's totally business driven, which is a really important component to make sure that we get the right tools for the right study teams. What we're looking for often is if someone proposes a specific stats methodology. Like one example I'm thinking about right now is all the different R Stan interface libraries. Do we want to select one? R Stan versus CmdStanR versus RStan arm versus BMRS, et cetera. Do we want to choose one? Do we want to let everyone just pick which one they want to use? And we'll make a decision. From there, if it's going to be used for creation of outputs to a regulatory, it then goes into our validation assessment process, which then it goes through basically a package assessment in order to get included into the frozen R environment.

So let's go to another question here. I think is a topic that's been coming up more often is could you share more about the code review sessions and what that's like?

Yeah. So code review sessions I think are like really crucial, especially if you are helping teams transition from other tools into our other open source languages. I think the most important thing about a code review session is you have to upfront declare a norm on the fact that this is not going to be used for performance evaluations. I think one of the things that we have noticed culturally sometimes is that people will look at code review as a negative performance indicator. It is not. It is how we all get better and how we all improve our capability. So it also depends on once you've