Shifting to an Open-Source Backbone in Clinical Trials with Roche

video

Jan 11, 2023

1:04:19

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

I'm Phil Bauscher, the Director of Life Sciences and Healthcare here at Posit. I'm going to help moderate the session today. We're going to be doing some Q&A. You'll see that popping up on the right-hand side. We'll also be doing Q&A through Slido. We'll share that link. We'll also be doing questions coming in from YouTube. So lots of exciting questions coming in, and we'll tackle those at the end of the session today over the last 10 to 15 minutes.

I'm very excited to bring to you today the eighth installment of our Life Sciences Series. Previously, we highlighted BMS, Merck, GSK, Novartis, Johnson & Johnson, and others. You can find these Posit webinars linked below from my team. Today, we have an exciting sequel to our July 2019 webinar by Roche and Genentech that highlighted the role of R in drug discovery, research, and development.

Today, we're going to highlight the awesome work being done by Roche and Genentech in shifting to an open-source backbone in clinical trials. This amazing work is being led by James Black, the Senior Director of Insights Engineering and Data and Statistical Sciences. We'll kick off the presentation today with Karen Martin, the R Enablement Lead and Statistical Programmer. We'll then pass it over to Thomas Niemann, Code Collaboration Lead and Data Scientist, as well as the Technical Lead for the Admiral R Package. We'll then end with Ning Ling, People and Product Lead in Product Development and Data Sciences.

Roche and Genentech is such an inspiration, from the legacy with Robert Gentleman to the contribution in countless hours to the open-source drug development community, a true lodestar in our space working with other pharmas and contributing to efforts such as Admiral in the Pharmaverse, the MMRN Package, the FDA Pilot Submissions, Bioconductor, R-Core, R-Tables, and countless other projects for the community. Over 10 people contributed workshops or talks at the R in Pharma Conference in November, and you can find those videos on the R in Pharma YouTube page linked below.

And it's with great honor and pleasure to bring this session to the community today. And now I'll pass it over to Karen to kick off the webinar.

Thanks, Phil. So, are my slides out? I'll just wait a moment.

Okay, great. Yeah, so thanks so much, Phil, for that intro. It's really great that our efforts are being seen in the wider community. So today, yeah, myself, Thomas, and Ning are going to talk about the next steps we're making at Roche, the steps we've kind of been making already and what's happening this year.

So I'm going to start by just providing an introduction as to what it is exactly we're talking about in terms of shifting to an open-source backbone and kind of why we've got to this point, why we've made the decision right now to switch there. And then Thomas is going to get into what that looks like in terms of the platform we're building to do our work on and the tools that we're going to use on that platform. Then he's going to hand back to me, and I'm going to talk about how we're getting our talent base ready to switch from one language to another, and also how we're kind of trying to move to a different culture in how we approach data science in general. And then we're going to finish with Ning, who's going to go into a bit more detail about how this business transition is actually going to look like in practice.

Setting the scene: what is PD Data Sciences?

So let me start by just setting the scene a little bit for what I'm talking about. Phil kind of alluded to this already, but Roche is obviously a really big company, and today we're not here talking for the whole of Roche. All of us are from a particular department at Roche called PD Data Sciences. If you don't know what that is, that's fine. I'll explain it a little bit now. Our basic job is to report clinical trials.

So Roche obviously conducts loads of different clinical trials all over the world to gather data on the effectiveness of the different products we create to help fight diseases. And our role in PD Data Sciences is to make sure all that data is gathered accurately, following the protocols we've set out, and to summarize it so that we have a good picture of how these drugs work. So as you might imagine, this is a really important job. It's really vital that we get this right, that we have the safety and efficacy data to be accurate, and that we can share it with regulators as soon as possible. Because the more quickly we get this data summarized and presented, the quicker we can get drugs to patients to help them fight illnesses.

The goal: a language-agnostic framework

So what's our goal then? So fundamentally, our goal in PD Data Sciences is to shift towards a language-agnostic framework. And what I mean by that is we're moving away from having one solution for how we do our programming for clinical trials. We want the ability to be flexible and move towards the best language for solving our problems. And our expectation is that in future, there won't be necessarily one solution for solving it. We'll be picking the best problem from the marketplace.

So most of what we're talking about today is the first step on this journey, but we will be alluding to the wider ideas of this as well. And that first step, I think, has already been mentioned a bit in the advertising for this seminar, which is that this year, from quarter two, novel studies will be using R as their core data science tool. So we'll be analyzing our data using R primarily.

I will say that this is definitely novel studies. Like, we're not doing a cliff-edge thing where we switch everything to R overnight, but it is a big change. We expect that over the next few years, we'll have more and more studies using R as their core data science tool until we're using it as our primary tool across all of our studies.

And this, as I say, is the first step in this journey where in future, we can move flexibly in the future. So R isn't our destination. It's part of a journey that allows us to move towards what the best data science tool is in the future.

So R isn't our destination. It's part of a journey that allows us to move towards what the best data science tool is in the future.

Why now? Reasons for the shift

So just a bit of context setting for this, if you're not super familiar with Pharma, Pharma, in general, has typically used commercial software to support filings. I think that's been true across the industry, and it's been true for a really long time. That's kind of the way we've been doing things. And there's lots of historical reasons I won't get into there. But it does mean that our talent and tools, at least historically, were used for a lot of this. You know, we have a code base that's been built up over 20, 30 years, and a talent base that's been using that code base for this time. So that's kind of just to highlight why this shift is big for us.

So why are we doing this? There's loads of different reasons that have inspired us to make this change. I'm going to highlight a few today, but I wouldn't say that this is an exhaustive list. When it comes to R in particular, R is something that had been used at Roche for a while, but I would say the kind of wedge product that got it used a lot more at PD Data Sciences in particular was R Shiny . R Shiny really offered capabilities we didn't have already in the tools we had. The ability to make these interactive apps using code so quickly and simply really proved a really attractive offer and really encouraged us to start using R at least behind the scenes in clinical studies.

And in fact, one of the earliest R packages that was built in-house to support this was a package called Teal, which Thomas is going to talk about later. This is now open source. You can follow that URL to go to see that package right now. And this was an in-house solution for standardized Shiny apps.

So we had this kind of justification for why we wanted to start using R more. And then there were a few other things that really started to weigh us towards thinking about moving towards R. One was just about the talent base. Generally speaking, when we're recruiting new graduates, they're much more likely to know these open source tools such as R or indeed Python than they are commercial software.

But the final point, which I think is a bit more future looking, is we wanted to go open source. And the reason for this is we think open source offers loads of opportunities for how we work. The first is just around getting latest developments more rapidly. Typically, if there's a new statistical method used to analyze data, that will often be implemented in R, first of all. So if we want to get access to those more quickly to provide better results, then we should be using R.

The other is we knew we wanted to go language agnostic. We knew we wanted that ability to switch between languages. And we know that open source is going to make that easier. Because with no proprietary formats, it's usually easier to switch between contexts. And the obvious example of this is the ability to switch between R and Python. There's been a lot of effort to make those two languages talk to each other nicely.

And the last one, which is one I'm really excited about, and Thomas is going to talk about a bit more in a minute, is the ability to collaborate with external partners. So up until a few years ago, all of our code was created entirely in-house. It wasn't really possible for us to work with people from outside of Roche. But if we go open source, we suddenly have this ability to get the inputs of people from across the industry, which I think is in the long run going to lead to much more efficient and better code and ultimately better outcomes for us and for our patients.

So hopefully that's been a good introduction as to why we're doing what we're doing. And now I'm going to pass over to Thomas, who's going to talk a bit about, as I say, the environment and the tools we're using in that environment.

The Ocean platform: building blocks

Thank you so much, Kieran. So then let's talk about the building blocks of this kind of modern data science platform we set out to create, which is really focused on open source as its backbone.

So first of all, I would just like to quickly dive a bit more detail into the requirements we and PD data scientists have for our day to day work. And generally, we can separate our task into either the regulatory side of things or the exploratory side of things. And on the regulatory side of things, there's really four main deliverables we need to create. The first one is called SDTM. So SDTM is an industry standard for the data we collect in clinical trials. So regardless of how the data is collected in the end, when you submit it to the health authority, it has to be in this data standard such that it's harmonized across companies.

Next up, we have the analysis or ADaM data sets. And here you take the SDTM data sets as source and then you transform them in a way that suits your analysis downstream. So as a quick example, imagine you collect a questionnaire which has 10 questions and an overall summary score. So the 10 questions you would have in your SDTM data, because that's the raw data as collected. But then you would calculate the summary overall score as part of your ADaM transformation.

Next up, we need to perform a bunch of statistical analyses, and that can be something as simple and descriptive as counting adverse events or something much more complex, for example, fitting a survival model or fitting a mixed measures repeat model. And finally, we need to put those statistical analysis into an easily digestible format. And that's generally either tables, listings or graphs. And all these four components combined make up the packages that we submit to health authorities, such as the US FDA, in order to get market approval for novel therapeutics we're working on. So this is really at the core of our business.

But of course, we also have the exploratory side of things. And here, really, it's highly context dependent what we need to perform. And therefore, we need flexible tools and really the best tool for the job. So having an open source background where you have access to multiple language, for example, is really a good thing in this context.

We also need to consider that there's novel data modalities we're getting more and more. For example, imaging data is becoming rapidly and more in our trials, digital biomarker, omics data, et cetera. So it's not just the classical rectangular data anymore, which we used to analyze in our commercial software.

And finally, we want not only to enable our data scientists to easily explore our data, but also people who don't work in data science, our colleagues in safety or clinical scientists. And we want to really enable them to easily explore their data and do so in a reproducible manner such that they don't always have to come back to us as data science. And we have this long feedback loop. And really, all of this combined necessitated that we created a next generation statistical computing platform where we could really perform all these analyses.

And this is where Ocean comes into play, our one central analytics environment, which is the home for both our exploratory and regulatory work. So Ocean is built in the cloud on AWS, and it's a language agnostic platform, which right now supports our Python as well as SAS for our legacy work. But it is easily extendable because it's based on Docker containers. So if you spin up a container with, for example, Julia in it, suddenly you would have access to that language as well.

For version control, we are using Git and GitLab. And we're not only using that for our packages we develop internally, but it actually will be there to store the code for all our studies. So if on a study you program a particular ADaM data set or an adverse event table, the code for that will be stored on GitLab.

Next up, and this is really a big change, even though it sounds somewhat silly, but we moved away from this proprietary format, SAS 7 BDAT, that used to be the only format really that we used to store our CDISC data. And instead, we opted for an open source format called Parquet, which is developed by the Apache Foundation. And it's a binary format for handling rectangular data, and it's highly efficient in both reading and writing, and it has served us already very well so far.

We have an internal RStudio Connect server as part of this platform where we house our exploratory Shiny apps, and we have a validated R package repository. So while typically you would install packages from CRAN, internally, we use this validated package repository, and you would find a lot of packages that you know from CRAN, such as ggplot2 or dplyr over there as well, as well as our own open source packages we develop, plus internal packages. And it goes through our own internal, really streamlined and efficient validation process, which is called autovalidate.

Then we are using another open source tool called SnakeMake to orchestrate our production runs. And SnakeMake is very clever in a way because it creates a directed acyclical graph of all the artifacts that you need to create. So think of your SDTMs, your ADaMs, your TLGs. And then it really paralyzes as much as possible, taking dependencies into account. So what used to take maybe a whole day, for example, to run everything for a final clinical study report now is done in a matter of hours, if not minutes.

Next-gen tools: OAK, Admiral, and RTables

And finally, Ocean is the tool, is the home of our next-gen tools for clinical reporting, and I will spend the rest of my part talking about these next-gen tools.

So the first one of those is called OAK, and OAK is our solution for automating SDTM mapping. So SDTM mapping used to take a considerable amount of our time, and it was a rather labor intensive process that was also somewhat mundane. So we really opted to streamline this by creating an automation solution. And driven by metadata and our global data standards at Roche, OAK can now automate around 80% of SDTM domains with around 22 reasonable mapping algorithms. And this, in turn, then really saved us a lot of time. We've seen already efficiency gains by at least 50%, so this is really huge. And finally, it's actually a fairly simple tool to use because it's, at the end of the day, a single R package you interact with plus a web app.

But that being said, under the hood, it's somewhat more complex because it has to interact with several systems, and that's actually why it's called the OAK Garden. And the first component of this garden is called Pistol and Honeybee. So really, what you have to think about is we have a metadata repository at Roche, which is a graph database. And inside that, we store the mappings from the raw data as we collect it to this SDTM standard in a machine-readable format. And that's really key here because that makes it then easy for R, for example, to query this information and make use of it.

Next up, we have our SDTM spec creator, which is called Mint. So this is a web application in React to create the study-specific SDTM mappings. And when I say that, really, most of it, at least 80% from what I had on a previous slide, should already be automated there if you adhere to standards. And then in this app, you have additional functionality to add any non-standard mappings. If, for example, you collect somewhat new data endpoints, which are not yet in our standards. All of this is then saved in a repository, which is called SEPHRON. And again, in a machine-readable format, namely JSON, such that it may query down the road in R. And very importantly, any non-standards you would define in the Mint application, you would save in SEPHRON, such that you can reuse these non-standard mappings across different studies.

And finally, we have the actual data transformation engine, which is called OAP. And it's the R package. It's the R package people in the end would use. And really, it's a very streamlined process. You fetch all this metadata from SEPHRON, basically. And then you have a single function call per SDTM domain. And given all this metadata, it will then output for SDTM datasets.

So the Oak Garden is not open source. And the reason for that being is that it's fairly tailored towards our internal Roche architecture. But that being said, the overall principle is really broadly applicable to anyone who has to map SDTM datasets, which is also the reason why we're in active talks with the CDISC Open Source Alliance to create a proof of concept in which we would use CDASH rather than our own internal data collection standard as a source. And if that's something you and your organization are interested in to collaborate, potentially, then please reach out. I certainly have heard a lot of questions regarding SDTM and R at R in Pharma. So this might be a good opportunity to reach out. And maybe a good collaboration can spring out of that.

So we cover the first pillar of our regulatory reporting site. And now it's time for the second, which is the ADaM datasets. And here, our solution is called Admiral, which is an open source modular toolkit for ADaM dataset creation in R. And what is really unique about Admiral is that unlike other packages we developed, which are now open source, this did not start out as a closed source internal project. And then later on got open source. Really, from the start of it, we collaborated with another organization, namely GSK, to hit the ground running and build something that would be usable not only for Roche's purposes, but really for the industry as a whole. And I think it's fair to say that this has been quite a success. We certainly want to use this kind of model of collaboration going forward.

And when I say success, I think one metric that really hits that home is that what started out as this collaboration between two organizations to create one R package, now is actually an ecosystem of R packages, which are developed by seven pharmaceutical organizations in total. So in addition to Admiral, we now have Admiral Onco, where Amgen and Bristol Myers Squibb joined us with their oncology expertise. There is the Admiral Optha package, which Novartis and Roche are working on together. And finally, GSK is collaborating with Pfizer and Johnson & Johnson on this Admiral vaccines package.

And if you and your organization are interested in leveraging R and Admiral for ADaM dataset transformation and you have expertise in a particular therapeutic area, again, please reach out. We would really love to see this ecosystem growing.

Next up, I want to talk about our statistical engineering team, which is really working at this intersection of biostatistics and statistical or software engineering, actually. So this team really focuses on accelerating the adoption of novel statistical methods. So really all the cutting edge stuff that comes out of academia and trying to implement that as software. And so far, this has been mainly focused on R packages and also occasionally standard or C++ have been used in the background. But the team is really open also to other languages, for example, Julia, basically anything that is open source and a viable solution for statistical computing.

And really important to highlight is this team is not only working at Roche alone, but it's part of a larger industry wide effort, namely the software engineering working group of the American Statistical Association, where not only Roche and other pharmaceutical companies, but also academic institutions jointly work together to basically implement the cutting edge of statistics in software. And here's just a couple of examples of packages the statistical engineering team has worked on. The MMRM package down there is actually the first one that came out of the software engineering working group. And another one I would like to highlight is this RBMI package for reference based model imputation, which won an award by the Roche Statistical Society last year. Once again, this has been a collaborative effort in this case between Roche and the University of Bath, so an academic institution.

So finally, on the regulatory side of things, we have the creation of our tables, listings and graphs. And really, when we started doing that in R, the G part, the graph part was the easy one. We just used ggplot2. It was the best tool available for the job. It still is, in our opinion. And it was certainly also a big reason why to shift to R in the first place, having such a good tool readily available. However, the T side of things, the tables, especially if you look a couple of years back, there was not such a really good solution out there. So what we did is we created our own, which is called RTables.

And RTables really has an expressive and modern API for table generation in R. And very importantly, it is built from the ground up with the needs of clinical reporting in mind. And we think that this is really key because the kinds of tables you do in this space tend to be much more hierarchical and complex than, say, a typical table you would find in an academic journal or even on some websites. Right now, RTables supports outputting tables in plain text, Word, PowerPoint, PDF, HTML, and the version on GitHub, the development version right now, also added RTF support. And this is an open source package which you can install from either GitHub or CRAN.

Something that is working in tandem with RTables is TERN. And it's really a complement that focuses around the summary statistics because you can think of it as bridging the statistical analysis you perform, for example, with one of the packages we've seen on the previous slide, and then actually getting them to display in the table. And this is yet again an open source package which is readily available for you also to use and check out.

The final piece on that kind of TLG side of things is Chevron. And this is really about streamlining our processes even further. So when we started out doing tables in R, we had an internal TLG catalog, which looked something like this. This is a particular first event table. And you can see it's just kind of a code dump you would copy-paste into your study folder and then use. But really, copy-pasting is not a good solution. So we opted for something sort of template-based. And right now, you can basically create the same table by using the Chevron run function, indicating the template you want to use, and then feeding in your data. And off you go. So a further gain in efficiency in that way.

Teal: exploratory analysis for non-data scientists

So finally, let's talk also about the exploratory side of things. So I mentioned in the beginning, we want to enable our non-data scientist colleagues to also easily and, importantly, reproducibly analyze their clinical trial data. So we set out on this journey and said that R Shiny would be really a good solution for that. However, we just have a problem, which is, how do we actually get our 600 or so data scientists who are not Shiny developers to build these exploratory apps in the first place? We did not want to train all of these 600 people to become sort of Shiny gurus.

Instead, what we opted for is to create this framework called Teal, which Kiran already mentioned in the beginning. And this framework really centers around reusable modules. And it really abstracts away a lot of the complexities of building Shiny apps from scratch. You don't actually have to care so much about the UI as well as the server logic in the modules you would readily adopt from the catalog.

This is an example of a very simple app created using Teal. And you can see on the top, you have all these tabs and each of those tabs represents a particular module. So in this case, it's a demographic table module. And what all nodes have in common is that in the center, you have a display. In this case, it's a table, but it could also be a plot. And then on the left, in addition, you have this encoding panel where you can kind of customize what you see in this table. For example, here, you could select different variables to be displayed. So that's all the module specific part. And as I said, we have a large catalog, I think even over 100 modules readily available for our internal people and also for you, because this is open source to use.

But then the Teal framework itself also provides a couple of goodies that work across all modules, the first one of which is this filter panel you see here on the right. So depending on which data is used in a particular module, it would list out the data set on top. And then you have the ability to select any variable out of these data sets to drill down the data. So in this case, we have a particular subgroup based on ADSL. But if you imagine this would be some kind of adverse event analysis, you could easily filter down for, let's say, serious adverse events.

Next up, we have this reporter functionality, and this is a really neat one. Anything that you create in any of these modules, you can add to a shopping basket, sort of. And then at the end of the day, you can go to the card and check out by exporting a PDF report or a PowerPoint report, which would then contain everything that you added to the card. So maybe a demographic table, a couple of my upload, a response table, whatever you may wish.

And finally, we have this show R code button functionality. And this is really key for us because that enables the reproducible nature of Teal, because unlike Tableau, for example, or Spotfire, which are also great tools to visualize data and drill down. This is really 100% reproducible because if you click the button, you will get a complete R script, including loading all packages, loading all data that is needed, filtering the data as you have in the filter panel here, and then reproducing the exact display that you see here. And that's really, we think, a key for us.

This is really 100% reproducible because if you click the button, you will get a complete R script, including loading all packages, loading all data that is needed, filtering the data as you have in the filter panel here, and then reproducing the exact display that you see here.

So I already mentioned this has been open source since last year, and the code is available on GitHub for you to look at and also to install it from there. We have two sample apps which are publicly available, and the previous picture I showed is one of those. So definitely check those out. And actually, Teal has been used in an R submission pilot to the FDA, which is quite an exciting development, kind of instead of the static TLG submitting a Shiny app. And there will be more communication about that in the future.

We have an upcoming webinar around Teal at the end of February, where we will dive deep into this particular tool for an hour or so. And finally, we are looking for collaborators to develop this framework further. And I think that really goes to all the tools that we've shown here, kind of in the spirit of the open source community, where we take a lot of the great stuff that other peoples have put out there. We also want to put the packages we develop out there and collaborate with others to develop them further.

And finally, I would just like to mention that this everything sort of I said is part of this larger effort in the industry called the Pharmaverse, where different pharmaceutical companies come together and develop R packages for end-to-end clinical reporting. That's really a great development, which we really hope further gains traction this year and the years to come. And with that being said, Kieran, I hand it back to you.

Preparing the talent base

Oh, thanks, Thomas. So yeah, Thomas has spent some time talking about the sort of setup of how we're going to do our Roche. So we've kind of it's not all finished, but it's very close to finish. You know, it's a work in progress where we have this environment we're going to work and we have this tool base. And certainly both of those are going to continuously evolve over time. But we have a good idea of what we're doing there. And we've presented that now.

But the other piece that we really can't forget about when it comes to moving to a new language is looking at our talent base, making sure that they're ready for this change. And it definitely is a big change. You know, it's not just about shifting to R. We're changing the environment we're working in. We're changing the tools. We're using version control. So there's a lot of things for our talent to learn. And we need to make sure that they're supported.

So what I'm going to talk about for the next 10 minutes or so is just about some of the measures we're taking to support everyone. And this is all, you know, this is a work in progress. This is something we're learning with as we're going. And certainly I expect to be giving future talks in a year's time where we talk about what worked and what doesn't. But here's at least some of our ideas that we're kind of throwing out there right now.

So just thinking for a second about who we have working in PD Data Sciences. There's a real mix of different people here. We have people who've been in the industry for 20 or 30 years and have never coded an R at all. So they have a wealth of experience in clinical trials, but they don't have much experience in R in particular or Git as well. And then we have people who are our experts. We've got a lot of people who've joined the company more recently who, as I kind of alluded to earlier, have this R experience and haven't had the opportunity to apply it in the context of clinical reporting.

The other challenge we have is that not everyone's transitioning to R at the same time. So again, this is something I mentioned in the intro that we're not, you know, we're not moving every single person to R all at once. We're doing it in a phased way. And that's good from a management point of view, but it does mean that you can't just give everyone training one day and expect them to be able to apply it the next day.

And the other thing that I think is really important when you think about transitioning to a new language, it's not about just learning kind of how to use that language. It's also about being fluent in that language. We really want our programmers to start being able to know what good R code looks like, not just be able to throw something together that just about works.

So as I say, I'm going to go through some of the solutions we've looked at for dealing with these issues. So the big one, I think, is around helping guide our users. So because of all those challenges we have, doing active live training isn't always going to be possible. Realistically, in terms of resource restraints and the fact that everyone's not transitioning at once, it doesn't make sense to do live training. It's certainly something we'll do, but we need people to be able to do self-study. And the problem with self-study is that if you just put someone in a room and tell them to learn a subject, they might be able to do it, but it's going to be challenging. They're not going to know where to start necessarily. And in particular, they might go on a journey that helps them a bit, but isn't as guided as it should be.

So we've been looking at a few different ways to make this more effective. The first idea is to work out what people don't know. Often when we've been sending out initial surveys on people's knowledge on subjects, we do it in a kind of granular way. We say, you know, how long have you been programming this language? How good do you think you are on it? And self-assessment on that level can be helpful, but it's often true, particularly when you're starting a subject, that you kind of, you don't know what you don't know. So there may be gaps in your knowledge and you're just not aware of them because you're not familiar enough with the language.

And certainly for those of you who are familiar with R, there's lots of pitfalls when it comes to the complexity of how R works, that if you're only starting off with R, you're not going to necessarily know about. So that's what the data science profile does. It breaks all of these categories into discrete topics and then builds assertions on kind of knowledge we expect people to know to be considered proficient in that topic. So I've got a screenshot, which is an example here for Git. And it's got a collection of different things I might expect as basic knowledge when you're starting in Git. And basically, this is all based on self-assessment. So we're not trying to catch people out here. We're not trying to trap anyone. We want an honest reflection of where they are. And that helps them to understand where they need to focus their learning on.

So the next step is giving them the resources to fill those gaps that they have. One important thing we've kind of thought about is we don't really want to reinvent the wheel. What I mean by that is there can be a temptation when you've got some time and the ability to create training materials to go ahead and just start making training resources without thinking maybe this resource exists already. There's a huge amount of material online, both free and available via commercial license. And some of it is great.

The challenge that a new learner has is they don't necessarily know how to direct themselves in this. So we've already got a bit of a start, and they know the kind of topics they want to work on. But even then, what are the best resources for them to use? And what orders should they be progressing through them in? So what we've built for this is an in-house solution, which we're calling the Data Science University. And what this is, is it's a tool to curate existing materials. So what we do, and we've got an example here, is we build learning paths on a particular topic. So this is a good coding in R topic, where we pick out, we pick that into discrete topics. So again, that links to that data science profile of seeing where your gaps are. And in each one, we provide material that we think is of high quality for learning about that subject. And again, that may be an in-house training that we've created, but it could well be a great resource that's available for free online.

So the other thing I wanted to mention as well is, obviously, a couple of these things are in-house, and you may be curious about some of the material we're producing at Roche. Well, I wanted to take a little opportunity to do a sneaky advert for something that's coming out very soon on Coursera. We've actually been building a Genentech-based course called Making Data Science Work for Clinical Reporting. This is actually going to be part of a wider specialization, and it's going to look at how we think data science should be done in the context of clinical reporting. So this course is literally being beta tested this week. So I'm really hopeful that it's going to be out either this month or early in February. So look out for that if you're interested in the kind of material we're producing around learning how to use data science in the context of clinical reporting.

So when it comes to making training resources, one thing that I think is really key that's kind of been learned by hard experience is try to make sure it's relevant. One experience we've had when we've, for instance, paid for external trainers to come here is that training material might be really great, but it doesn't take advantage of the fact that we kind of know what we do for a living, right? We know what our programmers are doing in their day-to-day work. So when we develop trainings, we can make it relevant to what they're actually doing. It definitely involves an effort, but I think you get a lot of value by making problems real. One simple way of doing that is just picking out real data, not using empty colors or iris, which is what you'll find on online resources. And that's what we've done in the trainings we've built in-house. There's actually an open source package that we've created specifically for generating synthetic CDISC data, so you can make use of that right now, and that's the link on the screen there.

So the last point on training is making sure it's at the point of need. So this is kind of relating to what I was saying earlier on, that not everyone is moving towards using R all at once. So we're doing our best to focus our training on those who need it. We're training in cohorts rather than trying to train everyone at once, because if you take a training a year before you actually apply it, you've basically wasted your time because you've forgotten everything you're using in that time period. So our goal is training at the point of need.

So the last point I want to make, which is going to lead into Ning talking about the business transition, is that while I've talked a lot about training here, learning really isn't about training. Realistically, how you learn is by applying all of those trainings in your actual day-to-day work. So we want to encourage that kind of context learning.

And as well as that, we also want to build this culture where you are always looking to improve how you code. So this is speaking back to this idea of not just making people learn a language, but also becoming fluent in that. And there's a few different things we're focusing on to try and support that. One is about building good frameworks to apply knowledge. So in this environment we're building up, making sure we have good frameworks where we encourage people to follow best practice just by those frameworks.

The other is around learning via code review and seeing other people's code. Certainly, one of the ways I became a much better R programmer was by reading code written by someone who was a better programmer than me. And I had code that worked, but when I reviewed their code, I realized that I could be doing things much more efficiently and writing code that's much clearer to read and easier to use in the future. So we're really going to encourage this culture of code review, of looking at code and seeing how we can do better.

And one connected idea with that, which was actually inspired by the tidyverse days run by Posit, was having these internal hack events, which we called HackR, where we basically took all of those tools we're building that Thomas went through and highlighted issues that we thought people could take on. So we said to our wider data science audience, get involved. You can improve these tools. And this is another way of highlighting this idea of an open source culture, the idea that these tools that we're using and developing to do our work, they're not made by wizards in the sky. They're something we can work on and improve, and we can collaborate together to lead to better code for all of us.

So the last thing which is connected to that is learning as teams. So I mentioned that we're not transitioning at once. So when we're bringing teams as a whole into this new environment, we're going to make sure that they learn together, because I think another way of learning quickly is by learning with peers. So we're going to make sure that teams upskill together. We're going to support people on their journey with additional mentoring, where we share knowledge from individual to individual as well. So I'm going to pass now to Ning, who's going to talk a bit about how that transition for those teams is going to work.

Business transition: the adoption roadmap

Thank you very much, Kieran. Yeah, so I'll take over and talk about how we are making our business transition. Yeah, so as Thomas already alluded a little bit, like from Roche, we're a huge company. Within Roche Data Sciences, we have over 1,000 employees, and we are working on over 200 active modules. So as you can imagine, we have a really diverse group over there with different skill set, different background, and also we have very diverse type of projects.

And in addition to that, also, as you may already get a feeling from Thomas' presentation, that for this transition, the scale is very large, not only because we have a large team, but also because there are so many new tools available out there, including the R packages, as well as some non-R tools, including Git, Docker, Cloud Computing, and SnakeMake. This is the beauty of open source, but also you can imagine that that introduced additional complication to our business transition.

So to ensure a successful business transition, we realized that we really need to make sure that we can empower every single project team to make their own decision and also to make their own fee-for-purpose transition plan. We're kind of thinking that for each project team, when they try to modernize their data science tools, it's sort of like renovating their house. So as we all know that when families decide to renovate their house, each family has their own needs. Some families may spend more time on renovating kitchen and bathroom, and maybe other families will spend more time on renovating living rooms. And for people who are purchasing a new house, they probably will pick out the new tools and new materials when remodeling their house.

Similarly, for analytical projects, we're thinking that during this transition journey, for some of the analytical project teams, maybe they