Ben Arancibia @ GSK | Data Science Hangout
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Well, hi everybody, welcome to the Data Science Hangout. So nice to see everybody here this Thursday, feels like this week's flying by. If we haven't gotten a chance to meet before, I'm Rachel. I am clearly the host of our Data Science Hangout, but I also lead customer marketing at Posit.
The Hangout is our open space to chat about data science leadership, questions you're facing and getting to hear what's going on in the world of data across different industries. So we're here every Thursday at the same time, same place. So if you end up watching this on YouTube at a later date, and you want to join us live, there'll be a link in the details below where you can add it to your calendar too.
But at the Data Science Hangout, we are all dedicated to making this a welcoming environment for everyone. We love hearing from everybody, no matter your years of experience, your titles, industry, or even the languages that you work in. So it is totally okay to just listen in here if you want, maybe you're on your lunch break or taking your dog for a walk or something. But there's also three ways you can jump in and ask questions or provide your own perspective too.
So you can jump in by raising your hand on Zoom. And I'll keep an eye out for raised hands. You could put questions into the Zoom chat. And if it's something you wanted me to read out loud instead, maybe if you're in a coffee shop or something, just put a little star next to it in the Zoom chat. And then lastly, we have a Slido link where you can ask questions anonymously too.
But with that, I am so excited to introduce my co-host for today, Ben Arancidia, Director of Data Science at GSK. And Ben recently spoke at POSIT Conference on the need for speed and accelerating R and other open source tools for delivering clinical trial submission. So I'm excited we may get to cover a few of the unanswered questions from Slido during this Hangout too.
So my name is Ben Arancidia, I'm a Director of Data Science at GSK. I sit within our biostatistics organization, specifically within a group called SDSIH, Statistical Data Sciences and Innovation Hub. My role is really focused on what I like to call R enablement. So I spend a lot of time working with different business groups to understand how they can use R for the creation of outputs to regulatory agencies or talk to them, understand sort of what are their needs in terms of tools and then help to build tools for those specific business groups.
And then just kind of trying to understand like what are problems that people encounter. So I spend a lot of time actively listening and bringing that feedback to different groups to help kind of solve those problems.
In terms of what I like to do outside of work, I'm a big runner, specifically trail running. I do trail like ultra marathons through mountains, things like that. I have two young kids, so pretty much I'm either at home with them or outside running, you know, on some of the trails around my house.
Ben's background and journey into pharma
So the race I just signed up for is in March of next year. It is 55 kilometers, so 30, 33, 34 miles in the mountains near Asheville, North Carolina. So yeah, it's called the Mount Mitchell Heartbreaker. So I hope it doesn't break my heart, but we'll see how it goes.
Well, Ben, something I thought might be interesting to kind of kick off the conversation with as we're getting questions from everybody here is I know that you were a consultant in a previous role. It might be good to hear a little bit about your journey, but I'd also love to hear, like, what was it like transitioning from data science as a consultant to data science in the pharma industry too?
Yeah, so how I became a data scientist is pretty funny. I went to a university, I studied, like, quantitative economics and geospatial information systems and became a GIS developer right out of college working for the United States Postal Service on a federal contract. And I worked on, like, route optimization. And I was like, this is pretty cool. I really like doing this.
And then the term, like, big data came around and people became data scientists. And I thought to myself, oh, wow, that's pretty cool. Maybe I should just do that instead of, like, being a GIS developer and kind of really get into data. So that's what I end up doing is deciding, all right, I'm going to focus more on data. I can always go back to GIS later, but really focus on trying to understand, like, how do I deal with really large data sets to solve different problems?
So I ended up becoming a consultant. I worked in commercial consulting specifically for cybersecurity clients, building out, like, different tools, platforms, implementing machine learning methodologies, things like that. And it was a really great way to get exposure to lots of different problems as well as being able to build up my skills for things, like, I might not have learned at school.
And eventually what I ended up doing is transitioning into the pharma industry through a great leadership program that GSK has called the ESPRI program, which basically looks to bring outside talent who has a specific knowledge base and bring it into GSK that they want to expand upon. So my cohort of ESPRI candidates, we're all data scientists, and GSK Biostats wanted to bring data science expertise in-house, and they do it through this kind of three-year leadership program.
So what we do basically is we do rotational programs through different groups within the Biostats organization, bring our data science expertise, and then they teach us the pharmaceutical industry business essentially during those three years. And then at the end of the three years, you find a final position within Biostats.
But it's a great way to think about, oh, this person has some, like, interesting talent that we want, but they don't have, like, the business knowledge. How do we kind of marry the lack of business knowledge with the talent that we want to build? And I think that's one thing that I really love about GSK, and sorry to be like a GSK promoter right now, but being able to have that kind of space to be able to learn and develop is really crucial to being able to create, you know, those good outcomes for the business.
Lessons from consulting: storytelling and communication
What else I learned as a consultant? How to tell a good story. I think that's one thing that constantly when I'm working with data scientists who might be newer is, like, how do you tell your story about what you learned during your analysis or the outcomes or the outputs from something? I'm sure it's a trope by now, but, like, communication and being able to tell a really nice story about what it is that you did, what is it that, you know, would improve a decision maker's, you know, life is crucial.
Less is more. I think, like, one of the — I think if you can have, like, to go back to PositConf, for everyone's awareness, like, if you do a talk at PositConf, you'll go through, like, speaker training. And one of the first trainings or maybe the second one or maybe it is just the first two, you focus on figuring out, like, what is your governing idea? Like, what is the one idea that you want everyone to walk out from after seeing your talk? And I think that should be applied to how you communicate with stakeholders.
Less is more. Like, figure out, like, what it is that you want someone to walk away with based out of your meeting. Because if you're trying to show, like, a presentation, that presentation is not the only thing that that stakeholder is going to do in that day. Like, they're going to be receiving, you know, hundreds of emails per day. They are going to have to be making decisions, things like that. So the question is, like, how do you make sure that they get that one piece of crucial information you want them to have and really focus on that? So less is more. What is your one governing idea?
Less is more. Like, figure out, like, what it is that you want someone to walk away with based out of your meeting. So less is more. What is your one governing idea?
R validation in pharma
So, GSK, we do R validation. A lot of how we do R validation can actually be found on the R Validation Hub, which I think is part of the — is either a POSIT working group or is loosely related. So, how we do it is really kind of taken from there in terms of how we do it, the R Validation Hub.
We were a — we, being GSK, were a strong participant in figuring out how to do it. And then we worked with basically our QA department to make sure that we're checking all the boxes that we need to hit in order to say that, yes, we believe that this is validated.
So, within a biostats organization, you basically have two things that you need to accomplish to move assets or drugs through the pipeline. Basically, one, can you tell what — can you trust the efficacy of a drug and can you trust the safety of a drug or an asset? And how you do that is you write statistical code in order to be able to prove that, yes, we trust these statistical tests that we've written.
In order to do that, you need to do something that's called validation for your software, which is essentially to say we have tested the software and we believe that all the outputs from this software is correct. How you do validation is defined by the organization, but it's really important for you to be able to trust your software.
So, there's a lot of testing that goes into place in order to get that level of confidence within an organization that says, yes, this software can be trusted. And there's different nuances for how you do the testing. So, for example, if you're using ggplot2 for creating outputs of graphics, things like that, you can generally test it through code coverage as well as basically doing a couple simple simulations and saying, okay, does it actually create the output that we would expect if you were to put it — basically create that output?
But then there's also output — there's validation from a statistical correctness point of view. So, for example, if we're writing a piece — a package to do linear regression or doing a linear regression — or using a package for linear regression, how do you test for that statistical correctness? And that's a little bit more of a thornier problem to figure out, but it tends to require like a lot of different simulations, things like that, in order to be able to say, yes, from a statistical correctness point of view, we trust this piece of software.
So, what we do is we have it — we host it on Workbench, and what we do basically is we create what's known as a frozen R environment, which is basically an environment that has a set number of packages that we have gone through the validation process with and said, yes, we trust these packages. So, then what we do is then we create basically a frozen R — I don't know. It's not truly a container, but basically an execution script where we can then basically just take and install that in any system that we choose. So, the goal is to make it system agnostic.
Yeah, I mean, there's like two levels to the validation. It always seems like validating the packages inside of R and then validating the installation of R itself. And that latter one is the one that, you know, we've had the most challenges with, you know, how do you prove that R is actually R, is sort of the philosophy, the spirit of what FDA wants to see, and we find that surprisingly difficult.
If you see some of the talks that they've given in terms of like the R adoption series and things like that, I think they're, and also, I mean, frankly, if you just look at like statistical master's programs and things like that, like R is now kind of the tool to use. So I think from like a, call it a culture point of view, I think that's changing pretty, pretty rapidly just because of just how the workforce is changing.
Ben's team and the four pillars of responsibility
So my role is really, I like to split it up into, call it four pillars of responsibility. I have a team that focuses on building sort of end-to-end applications for trying to gather insights about data related to either clinical trials or some of our business partners outside of biostatistics.
One I can give a shout out to is called the BPH tool, which is an externally facing tool on how to treat a certain urological, it's not disease, but called symptoms. But we built that using Shiny. It's linked to like scientific manuscripts. And so one of the things that we try to do now is when we submit a manuscript, try to either submit a interactive tool or enhanced graphics with the manuscript in order to make it easier for people to understand the data.
So that's one team. I have a team that focuses on building like open source packages. One of the big things we're looking at now is like Bayesian statistics. And I have a third team that's focused on doing basically R training as well as R enablement. So we have a program called Accelerate R, which is a group of agile, it's an agile pod of data scientists and R programmers that go basically study to study, sit with them for nine to 10 weeks, train them on the use of R, and then basically leave after those 10 weeks. But really set up those studies to be able to use R effectively.
And then I have a fourth kind of like, kind of pillar of responsibility, which is focused on being engaged in like data science working groups, and different things like that in order to be able to, to talk about R as well as be able to have a kind of industry standard for how to use R for different ways for clinical, for clinical trials. And we talk about things that, you know, might sound silly, like, you know, how do you round in R versus SAS and Python, but being able to talk very confidently about those differences and why you might get differences in results is a pretty crucial aspect for our industry.
Scaling R training across 900 people
Yeah, so within Biostats, GSK Biostats, we're about 900 people. So what we started to do initially was having just open in-person as well as like virtual trainings with R. So like, I think we got to the point where 80% of Biostats took an introduction to tidyverse R training, which is like, you know, 80% of 900 is like right around 700, 720, something like that.
And what we were finding was people just weren't uptaking the training and using it on clinical trials. And the big reason we found out was because you would take the training and then you wouldn't actually use the, you wouldn't use R until like 12 or 18 months later. So what we ended up doing was deciding like, all right, let's just put R training as on demand. Like, I'm never going to write a better like intro to tidyverse training than, you know, either Hadley's written or someone else has written online. Let's just make it all on demand when you want to do it and focus purely on like support and mentoring.
So what we do is we have this like small group of people and they go, you know, basically study to study whoever wants to use R at that period of time and we work with them. And the goal is basically train that study team up and create them into like champions and then also give them the confidence to be able to go to other studies and help provide guidance as well.
So we do it through that kind of support and mentoring, but we also have like big community engagement through different like open forums where people can ask questions, things like that. So it's really focused on like spending less time on creating materials for training and documentation and more time on figuring out like, all right, how do we actually do support and mentoring as well as get feedback to build tools that actually will help our users for actually using R on clinical trials.
Overcoming cultural barriers to open source
So I actually don't believe in having people redefine themselves. I actually think about how do you help someone like learn a tool or add a tool to their toolkit in order to like enhance the way that they're working? Because at the end of the day, you are not going to turn like a statistician into a programmer or a computer scientist or things like that unless they really want to become one.
So I think the goal is like you don't actually try to redefine them. I think you try to help them add tools to their toolkit in order to be able to use it in order to enhance their work. So like teaching like a statistician Git, like you have to explain why it's important for them to know, you know, kind of the hit by the bus rule, if anyone's familiar with that.
So the the big thing that actually was the major major way that we broke a lot of ground in terms of open source was visualizations. So because one of the big things that R does so well is is ggplot and being able to create like awesome visualizations, you know, with data. And that was something that like SAS really, really struggled with. And so the wedge that we actually figured out in order to be able to start having the open source conversation was like, look at all these cool like visualizations and look how easy it is to make these visualizations compared to like other other programming languages.
And then a lot of things happen. But I think the big another thing that really has broken through cultural barriers has been there's a really, really strong industry call it collaboration called like the farm reverse where different and different companies. So like Roche, GSK, Novartis, J&J, like all these different companies are creating open source packages for the clinical trial development pipeline. So like if you can set if you as an industry can agree upon, like we all trust these open source tools. It really kind of helps to break down those cultural barriers about like, why should I use this open source tool?
And it's it's really easy to say, like, well, we want to use it because, you know, Roche is using it or we've developed it with Roche and with, you know, insert other pharma company. And you can say we we we helped create it. We should trust it. And so I think that has been one of the things that has helped with those cultural barriers.
I think you can break through, but at the end of the day, like I tried to talk a little bit about earlier, like if people want to use their proprietary tool, fine, if it helps them do their work. Like at the end of the day, like all I care about is like, can you do your work and can you do your work in a timely manner and not have like bottlenecks with platforms? And if the answer is yes, then like, who cares?
The Novo Nordisk submission and the future of open source
Yeah, there are two things that I really took away from that talk. One, it was a really similar story. We all are solving the same problem, albeit in our different ways, based off like our organizational culture or just organizational preferences.
The big thing I took away from it actually is if we are going down the open source route in terms of R and different R packages, things like that, I think it's actually going to potentially create a stronger culture of open source tools because it seemed like the. And by that, I mean, I don't know if we're going to if specific companies will be able to have internal tools that only they can use because of the issues that Novo had with using their own internal tools and trying to do a submission because of like, how do I get them to be able to install this package, which might not be available on CRAN?
So I think it actually one of the things that it's made me think a lot about is like, how much internally do we actually want to create because of like the hiccups that could cause in the actual submission process? And will that potentially result in like a stronger open source ecosystem because of just like just the making it easier in order to for regulators, whether it be FDA, EMA, Japan, whatever it might be to be able to use these tools?
And I think the big thing that I have been thinking a lot about it is like the reason why we've never encountered this, we being the pharma industry has never encountered this is because like, say, SAS is your tool of choice. Theoretically, SAS is the same from one organization to the other. And I keep harping on this, so I'm sorry, but like reproducibility is key. So if it exists in two or if your proprietary tool exists in two organizations in the same version X, Y, Z, then you kind of have to make sure that that's guaranteed if you're using open source tools. And like from a philosophical point of view, that probably means we have to open source more to make sure that everyone has access and we don't have these like very secret internal tools that cause issues with submissions.
And I keep harping on this, so I'm sorry, but like reproducibility is key. And like from a philosophical point of view, that probably means we have to open source more to make sure that everyone has access and we don't have these like very secret internal tools that cause issues with submissions.
Career advice and closing thoughts
I think the thing that always stuck with me is it's OK to say no. Like you should be ruthlessly prioritizing your time and trying to work on things you're passionate about. So it's OK to say no to certain things if you are the wrong fit. It's also if you think about like your brand management, it's sometimes nice to be wanted. So saying no sometimes is good for that brand management.
The other thing and I think is really important or the other piece of advice that was given to me and I try to live by this value is like be radically transparent. So meaning no matter what, just be like just no such thing as like, you know, holding cards close to your chest, basically say like, here's what I got. What do you have? And really focus on being able to have those open and honest conversations, which I think leads to greater vulnerability, which is like how you make connections with people and really are able to be empathetic with people and trying to understand their problems.
Yeah, I really like it because I'm able to understand more about like what do people — what are people facing? And one of the things that I've learned is like every — or one thing, one learning I've had from it is like no one's trying to do a bad job. It's just there's so many things that are going on in life that you have to people require like certain different levels of support. They all want to try. They all want to do the new cool thing. It's just everyone's at different stages.
So I think being able to connect with different teams, understand sort of how it is that they — where they are in their point of life or where they are in their development journey in order to be able to do these things. It's really important to be empathetic and learn and be able to take those learnings and bring it to other teams. And I think that's one thing that I find really cool about my job is like being able to connect with these different teams, bring learnings from different places, bring it to other teams, make those connections as well as like bring what are the problems that people are facing to like our different tools teams or to my other my tools team to be able to say like here are things that I'm hearing.
And I think like a great example of this was another talk at Posit.com from GSK. We created this open source tool called Slushy, which is built on top of RM to do environment management. So I don't know if we would have ever gotten to that point without being able to go out to these different teams and like actually listen to what are their problems for how for what they're encountering if they want to use R for their for that creation of their outputs.
If you really want to enable learning, you actually need to put the support behind it. And I think the big thing that I wish I was more focused on in that talk was you can do as much training as you want. Um, but if you're not actually using it in the creation of an output, that's that's where your you actually learn how to do things and there needs to be a really robust support team behind to help answer those questions. And I wish I kind of hit that more is just really focused on that support and mentoring. It's really hard to scale. I realize because it's, you know, actual hours and not, you know, a piece of document that someone can go read, but that that support and mentorship is crucial for for being able to change and enable an organization to use a tool.
Yeah, so that's what that accelerate our team is and we have like a giant community forum where people are able to post questions as well. I think we call it like are for clinical forum and people will just post questions or post announcements about new packages and things like that. So we have, like, not really a sort of the direct support, but we also have, like, sort of the robust community that has like a team's channel where there's an active kind of communication that occurs that answer questions or celebrate wins, things like that.
