Data Science Hangout | Mike Smith, Pfizer | Building an R Center of Excellence

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to the Data Science Hangout. I'm Rachel, be your host. If you're joining for the first time, thank you for joining us. Special welcome to those who have been joining us week after week. Great to have you here. If this is your first Data Science Hangout, this is an open space for the whole data science community to connect and chat about data science leadership, questions you're facing, what's going on in the world of data science.

I'll just add this too, if I transport into the future and you're watching this on YouTube for the first time, you can also join us live and we would love to have you here. So there should be a link that I'll make sure we'll put in the YouTube descriptions too. Just like to reiterate, we love to hear from everyone no matter your level of experience or area of work as well.

But with all that, I'm so happy to be joined by my co-host for today, Mike Smith. Mike is a Senior Director of Statistics at Pfizer. Mike, I'd love to have you introduce yourself and maybe share a bit about your work.

Sure. So I've worked at Pfizer for 29 years this year. I have started in statistics group. I then went to the modelling and simulation group in clinical pharmacology, where I was essentially leading or kind of helping mostly with the simulation side of things. So taking models that other colleagues have built and then doing predictions or simulations for clinical trial design to try and assess operating characteristics and, you know, come up with the most efficient design possible.

So I've then moved back to statistics now, where I am, I call myself a professional geek. Okay, so I have looked after our installation, our package sets, getting that into the hands of my colleagues, kind of looking at what's coming out in the R ecosystem, finding out what packages, you know, essentially being at the bleeding edge of what's coming to help train them, but also, you know, prevent that wheel spin where instead of every colleague having to do, you know, look at different tables packages and going, oh, what's the pros and cons of all these different tables package? If I do that, then I can provide a kind of best guess for them to come up with that.

Last year, before R/Pharma, I did a little poll and found out that over 1500 colleagues had downloaded R. So if you can imagine, I'm an R subject matter expert for somewhere in the region of, you know, between 600 people who are on my MS Teams channel, and, you know, the 1500 who've downloaded it. So I'm trying to, you know, service that community, build a community, find out what other people are doing and share that.

Lastly, I should say that we're a very decentralized model within Pfizer where for data science and using R, there are pockets of people throughout all of Pfizer who do some really awesome stuff with R. And, you know, so I'm in this new center of excellence team that we're just setting up. I'm trying to kind of build connections between all of those SMEs and also help the teams where there isn't an SME available. So cat herding is definitely on my job description, professional geek.

In fact, we talked about it at last week's meeting, which is about being that translator role between the users, the business people who are out there doing stuff and need solutions with R and the technical side of folks who are building things. I definitely sit in that space. So I, you know, I'm here at data science hangout, but I've never fitted a machine learning model in my life. So if you've got questions about that, you'll need to turn to one of my esteemed colleagues. But for the translation bit and the general kind of cat herding, that's why I'm here.

In fact, we talked about it at last week's meeting, which is about being that translator role between the users, the business people who are out there doing stuff and need solutions with R and the technical side of folks who are building things.

I love it, Mike. That's so impressive that Survey had that many respondents too. I'm curious how you even launched that out to everybody.

I didn't. I have a means of querying the Microsoft database that sits behind the process that people grab R. And so by querying that directly, I can pull out how many people have got R.

Having a champion there is really what you need because if you're fighting to do this from the bottom, you're going to have a long slog ahead.

Thanks, Mike. So one more question about that then. So you said finding a champion, but is there another tip that you'd give all of us listening in if we wanted to do this?

Well, I was really lucky because I found someone who'd done it before. So Doug had the expertise of who do we need to tell? What do we need to tell them? What is our raison d'etre? What's our purpose for being? And he had all that kind of ready to go. I suspect this is probably something that we could add to your community's website, right?

Yes. I was thinking that would be helpful. What's the rationale for it? Who do you have to speak to? Who do you have to convince them? How do you convince them?

Changes over 29 years at Pfizer

I know there was a great question that came in anonymously that I missed at the very beginning, but it was in those 29 years of being at Pfizer, what have been the biggest changes you've seen in the space at that time?

That's a really good question. I've been here long enough, but when I started Bayesian analysis, just don't talk to me about that, Mike. Nobody uses Bayesian methods around here. And that's flipped now so that people are using Bayesian analysis. The primary analysis of the vaccine trial was a Bayesian analysis. So that's great.

And this is kind of related to something that Libby brought up earlier, which is about decision making. We have moved much, much more to a situation where decisions are being made off the back of data. Rather than off the back of project teams kind of saying, well, I saw the last time we had a compound like this or last time we were in this situation, this is what we did. So that's what we should do this time. I mean, it hasn't been like that for a long time, but still, I think it's a big deal that the decision making is now much more, you know, show me a prediction, show me a simulation, you know, help me make this decision using data that we've got and off the back of a model or a prediction or simulation.

Working with IT and open source resistance

Thank you. Andy, I see you asked the question in the chat that it looks like a lot of people have been weighing in on as well. If you'd want to jump in and ask that live.

Hi. Sure. Thanks for taking the question. So any tips on how to work with the IT folks when we've got folks in the team who are actively opposed to the open source tools? And I'll caveat on that. I work in government. So folks in public health are pretty protective of what kinds of tools are installed in even their cloud environments. And so we're having to do a lot of re-informing and unlearning because the R they think of when we talk about it is probably circa 2007 R. And, you know, there's really quite a silo between folks that are gearing themselves towards being data engineers in IT with IT backgrounds where they're basically, you know, managing a data center and the rest of us just using the tools they want to deploy, which has been fairly limited to the Microsoft suite. And, you know, while there's sort of a community welling amongst the analysts across various programs using RStudio and wanting to use enterprise-level tools so that we can do code sharing in a more managed environment than the kind of Wild West we're working with right now. So any tips on that?

Okay. So, yeah, I think actively opposed is a hard one to fight, right? Because you'll then find people will still fear uncertainty and doubt. And that's hard to battle against. In terms of my work at Pfizer, for over 20 years now, the FDA has said that they don't endorse any specific tool for analysis. So there's a perception within the company that, oh, if you're submitting to the regulators, to the FDA, it has to be in SAS. They make the request that the transfer files that you pass to them of the data are in SAS format. But they've said for many, many years that they don't care what software you use and they can't endorse any specific software because they're a government agency and they wouldn't stand behind a particular company and their software.

So we've got that going for us. But the other part of it that I think is easier to manage is to take away the kind of wild west side of things. So when we deploy R at Pfizer, we talked earlier about R and the high performance compute grid. What we've got now is a process that says we'll take a certain version of R, a certain set of packages and a certain versions of those packages. We build it, we test it, we document it and then it goes under change control so that there are no sudden changes to that instance of R. And it's the same, it's getting easier now because we've got containers for that. So we can test and validate and qualify and document the container and then deploy it in various places.

What that then means is that as far as no one can get in and install packages in that environment unless it goes back through that whole process. So that kind of then takes that wild west away in that framework. And I'm also militant about telling my colleagues not to run anything in production on their desktop for the reasons that we discussed earlier that I can't tell what they've got and I can't tell what state it's in.

So with those two messages, I'm kind of saying, well, look, anything that's on production needs to come from here. You can do what you like on your desktop, but if you then try and run it over here, you can't come and tell me, oh, I need to get this package in there and I need it by tomorrow morning. Because the answer is, well, it needs to go through that whole cycle in order to have the confidence that this snapshot version here is still valid.

My personal view is that if you lock down something too much, people will find workarounds. So if you get told you cannot use open source, you must use this set of products, if that person wants to, they'll find a way to get around that and to sort it out. So if this way of doing things, the official way, the good way of doing things is easy, then hopefully people will do that than this nasty workaround over here.

I like that. Make the good way of doing things easy.

Yeah. So I kind of have a little bit of experience with trying to bring leadership into a line to use open source tools, especially from a financial institution that sees everything as risk. And actually given examples like Pfizer, like Accenture, like NASA, that they use R and RStudio consistently, it opens their mind to be like, oh, maybe it's not so bad. Or, for example, at the beginning, we had this thing with Microsoft Suite 2, but a 2017 version of Visual Studio Code has an R tools integration. So until you can prove your point of why it's good and how it's validated, I think there's ways to do it around it. Sometimes it just takes dedication and finding the right combination, I would say.

I had a quick comment. This is Santiago. On the subject of how to get folks to adopt the open source, I once met a small team at a bank and they had this struggle, too. And they worked with the regulators because what they were working on was regulated. They used our open source libraries and they worked with the regulators to get everything audited, approved. It took a little bit longer and it was a new process. But by the end of it, they had an open source process that they adopted and were able to replace their old SAS systems with. So it's doable.

Reproducibility and code standards

So the question is, how reproducible is your analysis? What's the probability of running an analysis from one to five, ten years ago and getting the same answer?

In terms of just the analysis, I would say that our probability is high, assuming that that analysis has been run on our high performance compute grid as a batch job. So if I take a job that was run a year ago and I rerun it on the grid, then because that tool is still there, I should be able to completely reproduce that. As you go further back, ten years from now, I'm hoping that we'll still have 90%, 95% probability because we have containers and things are locked down. The place where it gets more challenging is from ten years ago from now, because compute grid is a shared resource across many, many different lines in Pfizer, the environment that the tools are sitting on may have changed slightly. So compilers may have changed slightly, things like that. I would say that the probability is still high, but it's not as high as ten years from today.

I apologise in advance if I derail the conversation, but I'm just curious as to whether Mike thinks that his role as either a data scientist or statistician has influenced his music.

Yes, it does actually. Thanks for asking that, Stephen. I'll put the link to my band camp later on so everyone can rush out and buy it. Yes, I incorporate random stuff, generative stuff into my music. So yeah, randomness and probability and all of that features highly.

Thanks, Stephen. I see, Ethan, you had asked a question earlier as well, Brown. Friction with introducing new tools. Do you want to jump in?

Yeah, so this is, thank you, Rachel. This is going back to the conversations around building a centralised tool for everyone to use. Did you have any sort of frictions trying to get people to use it? Because if people have already built their own tool, or they're used to doing things their own way, it might be a bit difficult to convince them to start using your tool.

Yeah, so if this is the kind of central container of R and packages, I always have this friction where someone will have developed something with a package that isn't on my set, the official set. And so they'll come and they'll say, hey, I really need this. Okay. If that happens, and it's like a version of a package or a package that it makes sense that we should have included this and we just somehow managed to not include it. It's possible to kind of layer in that package. We would need to look at the dependencies of that package. So there's no point in layering in a new package that breaks all the old stuff because it needs a more up-to-date dplyr. In that case, I would just say to them, well, now is the time to go back to renv , do this and make it reproducible for you in this project as a special case.

As I said, the layering in of new packages involves retesting, re-qualifying, re-documenting, signing off, all of that kind of process. So it's not trivial. And I tend to think that if we get to that point where someone needs it for tomorrow morning, it's because they weren't paying attention earlier. When I said, you must use this version over here to do a production run. So it's possible, it's resource intensive and expensive. But now we can kind of at least say, well, with renv, create your own little project, get your own set of those packages you need, because then it's reproducible for you using renv and we'll deal with making that, incorporating that package into the next release.

The thing that I sometimes find though, and this is why I try and get people off their desktops, is when person X develops some code, passes it to person Y and person Y goes, oh, it doesn't run. Mike, can you help? Even with renv, it's a pain to have to take all of that over to here and reconstitute it and rerun it and see what the problem is. I would much rather that people kind of work off this standard, because then I know if I use the Docker and you use the Docker, we can share code, we can share projects so easily.

Do you see moving towards RStudio Cloud, because that way you can design the same versions? That's kind of what we have with the RStudio Server Pro. So when you fire up RStudio Server Pro, it will say, what version of R do you want to run? And basically that's pointing to the container and saying, use this container with this R and these packages.

Code style and documentation culture

Hey, thanks, Rachel. And Mike, this has been an awesome talk. I've loved hearing sort of your experiences and what you're doing over there at Pfizer. But one area where I have seen a lot of R programmers specifically, I mean, that's what I'm working in day in, day out. So this might expand to people programming in other languages also. But I've noticed a really big problem with larger corporations where you have small armies of R programmers. There's not always a consistent use of syntax or framing of the code or your apps or whatever it might be, making it borderline impossible to knowledge transfer and have someone come in and expect them to take over a complex app build that was done. And so I didn't know what might be the solution to this. I hate telling people we need to all code using, let's say, the tidyverse style guide or whatever it might be, because I don't want to know. I don't want to be able to differentiate between individuals when I'm reviewing code. My code should be indistinguishable from your code. But just in general, what are your thoughts on that?

You're right. The heterogeneity between people is massive. We have statisticians who will turn around and say, I'm not a programmer. And it's almost like, so you don't need to beat me with a stick about style guides because I'm not a programmer. There are clinical pharmacologists and pharmacometricians who will also tell me I'm not a programmer. But they are programming to get the work done for today and get it out the door and answer the problems and move forward. All I can do for them is to encourage them and say, look, if you use a style guide, then when this person comes to review your code, it will be so much easier for them to see what you're doing. And again, commenting and all the rest of that.

And in a sense, I encourage people to kind of use markdown for those statisticians and pharmacometricians who are doing work, because then you've got why am I doing this as well as the code that says, and here's how I'm doing it. When it comes to the programming, programming people, so the statistical programmers who are coding up for building visualizations and tables and reports, it matters more that those people follow the style guide. I mean, it's literally their job to write code that is maintainable and will last the distance and is easily reviewed. So in that instance, it's kind of, well, people ought to be following a style guide, especially if that code is going to be reused, because then it needs to be commented, tested, qualified, all the rest of it. For everyone else, I can suggest, but I can't really mandate that people follow a particular style.

I'll comment on that. And a strategy that I end up kind of using is, yeah, obviously it's great for the team as a whole, but it's also great for that person, for their future self. And everyone can think of a time where they're like, they look back at work that they've done. They're like, what was I thinking here? What did I do? And I think when you kind of pose it that way, you just can encourage them. But I think when they are a better version of themselves, three months from now, six months from now, that's also a helpful kind of way to kind of push it.

A thousand percent set. I mean, I've said that, you know, if I can, the polite way of saying it, and I'm kind of rephrasing what I've heard JD Long tweet in the last 48 hours is, right. When I open up code, even a month later, I often think, what the heck was I thinking of? You know, if I open it up a year later, it's like, I don't even recognize who wrote this mess. And that's where things like Markdown help, because you're leaving that breadcrumb trail that says, this is what I was thinking of, this is where I'm getting data from, why I'm filtering out this subgroup of people.

My quick that I made at RStudioConf a few years back was, you know, if you're writing more comments than code, then write them in RMarkdown . If you're writing more code than comments, write more comments and do it in RMarkdown. And I think that's leading by example too, where if you're building that community, when other people see your code, then they kind of are like, wow, it's so much more readable, so much more easy, and you reduce that mental overhead, I think.

My quick that I made at RStudioConf a few years back was, you know, if you're writing more comments than code, then write them in RMarkdown. If you're writing more code than comments, write more comments and do it in RMarkdown.

I'm at an age where I wander into a room and forget why I came in, you know, so that's usually only a minute.

Yeah, I was going to say, what we have is at the enterprise level, we have a commitment to clean code. And so, you know, they host workshops or they do lunch and learns, and they really try to get, you know, people to code better. It comes from the higher down. So you might be able to find a champion or find someone to advocate for you on better code development. And worst case scenario is you just, you know, have a GitHub action or something that lints their code or styles it, you know, and at least there you have something that's a little bit better to read.

Yep. And there's, you know, control I, is it, within RStudio will indent your code. Even that's a kind thing.

Yeah, can you hear me? I can. I'm very curious if you're in an airport and heading somewhere. And I'm a time zone behind, so I've missed most of the chat. I was going to ask, you may have already covered this, but for code standardization, do you have any good resources you can recommend for further reading?

No, is the very short answer. I would have to pass that back to the community here.

So for writing standardized code. If anyone has tips, feel free to put them into the chat or just come on, come on live too.

Sharing snippets is a good start. So RStudio has features where you can write snippets of code, so you can type a few words, press tab or shift tab, it will auto complete. You can share that snippet across the team.