Resources

Data Science Hangout | Mike Smith, Pfizer | Building an R Center of Excellence

video
Apr 7, 2022
1:18:50

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to the Data Science Hangout. I'm Rachel, be your host. If you're joining for the first time, thank you for joining us. Special welcome to those who have been joining us week after week. Great to have you here. If this is your first Data Science Hangout, this is an open space for the whole data science community to connect and chat about data science leadership, questions you're facing, what's going on in the world of data science.

I'll just add this too, if I transport into the future and you're watching this on YouTube for the first time, you can also join us live and we would love to have you here. So there should be a link that I'll make sure we'll put in the YouTube descriptions too. Just like to reiterate, we love to hear from everyone no matter your level of experience or area of work as well.

But with all that, I'm so happy to be joined by my co-host for today, Mike Smith. Mike is a Senior Director of Statistics at Pfizer. Mike, I'd love to have you introduce yourself and maybe share a bit about your work.

Sure. So I've worked at Pfizer for 29 years this year. I have started in statistics group. I then went to the modelling and simulation group in clinical pharmacology, where I was essentially leading or kind of helping mostly with the simulation side of things. So taking models that other colleagues have built and then doing predictions or simulations for clinical trial design to try and assess operating characteristics and, you know, come up with the most efficient design possible.

So I've then moved back to statistics now, where I am, I call myself a professional geek. Okay, so I have looked after our installation, our package sets, getting that into the hands of my colleagues, kind of looking at what's coming out in the R ecosystem, finding out what packages, you know, essentially being at the bleeding edge of what's coming to help train them, but also, you know, prevent that wheel spin where instead of every colleague having to do, you know, look at different tables packages and going, oh, what's the pros and cons of all these different tables package? If I do that, then I can provide a kind of best guess for them to come up with that.

Last year, before R/Pharma, I did a little poll and found out that over 1500 colleagues had downloaded R. So if you can imagine, I'm an R subject matter expert for somewhere in the region of, you know, between 600 people who are on my MS Teams channel, and, you know, the 1500 who've downloaded it. So I'm trying to, you know, service that community, build a community, find out what other people are doing and share that.

Lastly, I should say that we're a very decentralized model within Pfizer where for data science and using R, there are pockets of people throughout all of Pfizer who do some really awesome stuff with R. And, you know, so I'm in this new center of excellence team that we're just setting up. I'm trying to kind of build connections between all of those SMEs and also help the teams where there isn't an SME available. So cat herding is definitely on my job description, professional geek.

In fact, we talked about it at last week's meeting, which is about being that translator role between the users, the business people who are out there doing stuff and need solutions with R and the technical side of folks who are building things. I definitely sit in that space. So I, you know, I'm here at data science hangout, but I've never fitted a machine learning model in my life. So if you've got questions about that, you'll need to turn to one of my esteemed colleagues. But for the translation bit and the general kind of cat herding, that's why I'm here.

In fact, we talked about it at last week's meeting, which is about being that translator role between the users, the business people who are out there doing stuff and need solutions with R and the technical side of folks who are building things.

I love it, Mike. That's so impressive that Survey had that many respondents too. I'm curious how you even launched that out to everybody.

I didn't. I have a means of querying the Microsoft database that sits behind the process that people grab R. And so by querying that directly, I can pull out how many people have got R.

Infrastructure and reproducibility

I see. That's awesome. So I see I had asked a question in the chat if you want to jump in.

Yeah, sure. Nice to meet you, Mike. That's awesome to hear. Congrats on 29 years. That's that's very impressive and inspiring. So I guess the question would be, that's a lot of our users. And is there some central infrastructure that you guys leverage to make sure that your environments are kind of consistent between all users or kind of all the teams kind of have their own little infrastructure?

We've had R installed on our high performance compute grid now for many, many years. So like 15 years or so. So for production use, that's where colleagues should be running things. And within the last three years, really, we've got RStudio product. So we have RStudio Server Pro.

I'm kind of saying to colleagues, look, if you load up R on your desktop and you download packages from wherever, and I don't know what you've got, and I don't know what versions you've got, that's effectively your sandbox. The other bit that goes with this is that anything that we submit to regulatory agencies, we need to know what version you run, what packages you used, and it needs to be reproducible in the long term. So the regulators can and sometimes do come back to us 10 years later and say, can you rerun that analysis, but look at this subset or add in this data or, you know, questions like that. So we need to be able to run stuff from 10, 15 years ago.

So that's why I'm kind of saying to them, look, you know, leave your desktop for playing. Because when you go to a conference tomorrow, and you have to load up a set of packages for this workshop you're on, you'll forget to swap back to the standard set.

Right. Wow, 10 years. A long time.

Yeah, I mean, the other funny side of that is that on the HPC grid, we have version R191. Right, but I don't know how to program it anymore. I'm so used to tidyverse and new ways of doing things that I'm gonna like, ah, it's gonna be a problem.

Data science vs. statistics roles

Thanks, Mike. I see there's a few anonymous questions coming in right now as well. And one is, what are the differences between being a director of statistics versus, let's say, a director of data science?

I'm a senior director within the statistics department. So if you're a senior director within the data science group, you might be doing, as I said, the data science is used broadly in many, many different places at Pfizer for all kinds of different needs, you know, like in drug development terms, they're building machine learning models to look at the aspects of like, when you actually compress the tablet, you know, does that hold together? Machine learning is used there, machine learning is used in real world data. So outside of the clinical trial process.

So really, the kind of distinction between, you know, me sitting here doing my stuff being a professional geek in stats versus being sitting anywhere else within Pfizer doing data science, you know, it's, we all kind of do the things that we need to do. But the, there isn't, there are many people whose role will be called data scientists. But we don't necessarily have a data science department. If that helps answer that question.

Yeah, and I think that kind of touches on your question in the chat as well, Frank, if you want to jump in.

So for sure, it touches on it. Maybe a slightly different way to think about it. But I don't think there's going to be a really concrete answer here. The difference between data analytics and data science, or like what's called statistics and data science. People have said to me like, who cares, right? Like they're both, they're using these tools and these methodologies, and we're all solving problems, making decisions using data. It matters, though, right? Like, especially at large corporations like Pfizer or like Target, you use these terms and people automatically, right, especially non-data people, like have these automatic perceptions of what they are. And they're like, okay, this many dollars over here, this many dollars over here, like more headcount for you, less for you.

So, and like, the problem is, though, everyone had, we all know, like everyone on this call, I think knows, it's so messy, right? Like the delineation is so messy. And I was curious, I don't know, if you have perspective on that, cool. If not, I can understand that, too.

Yeah, I mean, the place it seems to matter, Frank, is in job descriptions when you're posting for new positions. I think, kind of my experience is that there are people throughout the organization who are making decisions with data or helping others make decisions with data. But, you know, those could be called statisticians, those could be called data scientists, you know, clinical pharmacologists and pharmacometricians that I talked about previously are building models, right? It's not necessarily, you know, deep learning or XGBoost, but it's nonlinear mixed effects modeling and doing that kind of stuff. So, I think once you're through the door, that distinction really melts away.

Right, as you said that, I thought, is it worth thinking about it, right? If I'm interviewing for a data X role, is it worth asking, how closely do I work with the users, the operators, the decision, whatever, like call it whatever you want, the decision makers, and maybe that gives the person interviewing better context for maybe, like realistically what they'll be doing, right? The data scientist that never gets to work on ML is going to be confused, right?

Yeah, yeah. No, that's right. So, I guess I'm just conscious of the fact that if you go and type in data scientist into the Pfizer careers website, you're going to get a whole heap of jobs. But you're also going to look at some of them and go, but that's not data scientist as I understand it. Right? Yeah. And, you know, you're right, when you're doing your job, you don't want to get into a job and realize that no one has ever used machine learning, never used machine learning models in this space at all before. And that's, you know, well, why was I hired here? But at the same time, you know, people are in many pockets using data to make decisions. And I think, you know, that's a, that's a common skill set, regardless of whether you've ever fitted a machine learning model.

Hiring for the R Centre of Excellence

There are a lot of great questions coming in. I see, Ethan, you asked a great question around hiring, if you want to jump in and ask that.

Yeah, I just want to Hello, Mike. Nice to meet you. I just want to ask the questions around hiring advisors, because I remember your one of your posts is about hiring practice. And I think you recently implemented some of the hiring practice at Pfizer, mainly around like problem solving. So just wondering what sort of skill set, expertise matter more. But Pfizer, do you look for MET students who has a bit of knowledge and you train them up in R? Or do you look for people who are proficient in R with a bit of interest in clinicals and medicines, and you can build that expertise at all?

That's a really good question. I think it depends on the role. Right. So as I mentioned, the clinical pharmacology, pharmacometrics and stats groups, they're likely to come in with either clinical pharmacology or pharmacy degrees and learn R as they're doing it. Same with statisticians. But right at the minute, I'm looking for technical R specialists who know how to turn an expression of a problem into an R thing, whether it's a function or a package or Shiny app or Markdown report. So it really is context specific. We're such a massive organisation that it's, you know, if you come in with a physics degree, but you have a long history of developing really cool R packages on GitHub, you know, that's that I want to talk to you. But for stats roles and typically within the stats department, within the ClinPharm department, they're looking for people with that specific background or degree. And then they'll learn R and various other things.

I think seeing that R Centre of Excellence in the job positions helps a bit too for getting the word out there. I see there was an anonymous question too that was what are the standard gaps in background knowledge in a new data science hire?

Right. Well, I mean, obviously, we're all looking for the unicorns, right, who have all of the skill set. And it is a tricky problem when you're hiring and you're trying to weigh up the strengths and weaknesses of each of the candidates or trying to rank them. With the roles I have in mind for the R Centre of Excellence, obviously, strong R skills are prerequisite for that. But then after that, you get into these things of, well, has this person got experience building a community? Because obviously, that's another big feature of this role. Have they got experience at problem solving? So the kind of translator role, if you will, that we talked about before of when you hear a problem, your mind's thinking over, how do I solve that problem using things I know about packages or apps or functions?

And so if the person is really strong on one or more of those things, then you can often fill in the blanks. And, again, I'm sorry, but it's largely role dependent. There are very few people that arrive fully formed and that you can just go, yeah, you work, get to work. And I would just say that I think that the people that we bring in, you know, you would try and help them plug the gaps that they have. But I think it's hard if someone's completely missing something. If they have some experience, but you know, it could be increased or better, then you can work on that. And, you know, as long as the kind of core things you're looking for are there. I mean, I'm delighted to hear how other people who are hiring, try and make that call. Because it seems to me it's very hard.

Just to add, in my experience, the issue is finding people who can wrangle data rather than do machine learning. Everybody wants to do machine learning. You have to understand the data before you can learn from it.

Yeah, yeah, that's, that's certainly true.

SAS vs. R and open source adoption

Okay, what a weird angle. Yeah, so if I remember correctly, my question was about any pushback that you've had with sort of introducing R into the workspace as opposed to the sort of assumed a giant within pharma SaaS, assuming that there's many people in there that are very amazing SaaS programmers who still blow my mind today. But have you had any, have you had to work with that sort of, well we already have SaaS and SaaS programmers, why R?

Yeah, that's certainly true. I mean, it's, it's fortunately that decision kind of has to come from the managerial level. So the kind of top level within the programming organization has, you know, if they make that call that we are going to transition from SaaS to R, you know, I can't push from the bottom up because it's just the number of people's too big. But if that person decides and it comes down from, from that level, then you've still got a battle, because there are some people who, you know, feel well, I'm a competent SaaS programmer. I know nothing about R. How the heck do I make a switch? And that's where, you know, you need to train and keep training and, you know, try and help those people make the switch.

There are many good kind of SaaS to R training courses available now. And, you know, there are many R training courses, which are really excellent. One of the places that I find you can do a really quick win is if the SaaS programmer knows a PROC SQL, then if you show them dplyr, it's kind of like, well, okay, those two things look broadly similar. So you just swap those in and you're off. And the other place, of course, is in the back end of stuff of visualizations and markdown and, you know, things like that, that, you know, SaaS programmer, those are things that SaaS doesn't do quite so well. And so you can win hearts and minds in that sense. And then you're kind of gradually migrating towards it.

Coming up on that too, there was an anonymous question that was what's the proportion of SaaS versus SPSS versus R versus Python at your organization?

Well, within the stats organization, it's about 50-50 SaaS and R. For I would, being a statistician, of course, I'd like a breakdown of, you know, years of experience, because it seems to me that people who are coming in now to the organization are much more likely to have seen R at college and, you know, done stuff at college with R compared to SaaS, because it seems to me that, you know, in the way that we used to work, we would bring people in and then train them in SaaS because the cost of entry there is fairly high. I can't speak about the rest of the organization, but yeah, that's what I've seen just within the stats group.

Thank you. Another question that was asked a bit earlier was what is the trend of data science job in big pharma like Pfizer? Will adopting R and big pharma then SaaS increase the data science jobs?

I don't know that it's related to the software. You know, the thing that's indisputable is that the amount of data we're getting is just getting more and more and more and more and more groups within the organization are turning to people who can wrangle that data and make decisions and present back evidence for making a decision using tools, whatever those tools are. And back to the original point that Frank made, is that person a data scientist? You know, that's for someone writing the job role but we're getting much more data, much more is being asked of us with that data. And being a regulated industry, we need to be able to present that data back to regulators and to underwrite our claims and provide evidence for them to make a decision. I'm sorry if that's horribly vague.

Shiny and tooling choices

But Sam, I see you asked about Shiny. Do you want to jump in and ask that question?

But Sam asked, how do you see Shiny being used at Pfizer in the next few years?

It's increasing a lot over the last five years, certainly. I'm seeing many groups who feel that everything needs a Shiny app. Every solution is Shiny shaped. Because the instant that people want to do something interactive, that's where they turn. But with Shiny comes an overhead, right? Because you've got to host it somewhere, you've got to maintain it. And then you've got to think about how to not just maintain that code base, but then improve on it and make sure that it works with the next generation of whatever.

So I'm trying to kind of be a counterbalance to that to say, is the interactivity you're looking for served via like Plotly or via a parameterized R Markdown report or something like that?

But the interactivity that people have with Shiny means that you can set something up and provide guardrails for whoever's consuming that app to look at their data more, slice it, dice it, look at subsets and just understand what's going on in the data much better. So, yeah, I can see the appeal, but I'm also trying to say, well, is that really the right tool for this situation?

Do you also have Tableau or some other kind of visualization platform? So there's a balance there, right? Because you don't want to start replicating stuff in Shiny that can clearly be done in another platform that you're honestly already paying for.

Hey, Zeus, I see you have your hand raised if you want to jump in.

Yeah, I just had a question. So you were seeing this push to relabel some data scientists as research software engineers. And so that's what I would describe myself. So I'm at a U.S. national lab now, and we have people that come to us that have data science problems. And so we help code that product or that problem in R and maybe deploy it until we have an RStudio Connect server. Would you feel that's kind of the best way to describe what data scientists, some data scientists are becoming? Or do you think adding this term research software engineer into the word cloud is just going to be more noise?

And of course, there's also MLOps, which kind of describes it too.

Yeah, I know what you're talking about. And I think that, you know, you could also describe large parts of what I see some colleagues doing as research software engineer. That would describe quite well what some people are doing. But again, it's back to terminology. And it's, you know, if you describe your role as a research software engineer, then will anyone outside of that group really know what you're doing? It may also, to be honest, it may depend on what part of the organization you're sitting in. Right. So if you're more aligned to an IT department, you know, people might be like, yeah, I totally understand where you're coming from and what you're doing. If you were within statistics and you described yourself as a research software engineer, people might say, well, what are you doing sitting in statistics?

You know, and it's just, I guess, with a company the size of Pfizer, you know, and with this, even the size of the statistics group, which is 200 people, you know, it's then you have to try and tell the people within the organization what it is you do. Yeah. Well, you know, I know in pharmaceuticals and a lot of my classmates went on to do this and they became SAS programmers, right, at pharmaceuticals. And you can consider that a software engineer, right? They really don't do any statistics anymore. They just help people write SAS code. Or, you know, if you had someone that had some R code and they wanted to package it and, you know, they don't know how to make a package, they don't know how to deploy that package, they don't know how to host this package, that's where a research software engineer would step in.

Yeah. And I'm sure there are people within the organization who are essentially doing that. In fact, one of my colleagues that I work closely with is probably doing something very much like that job, but I'm not sure what their official job title is.

Building the R Centre of Excellence

Yeah, I was, let me scroll up to what I actually said. Yeah, I was wondering if you'd worked with any synthesis chemists who had turned into data scientists and particularly asking in the sense of what do you tend to lack from a data point of view? I mean, from my point of view, you know, my entire background is in chemistry. I would like to stick with that, but turn it into more of a data perspective on chemistry.

Yeah. We had a discussion internally with some folks in the pharmaceutical R&D group who are pharmaceutical sciences group, who are the colleagues who are designing the, you know, turning the molecules into tablets essentially. And, you know, the skill set they were looking for in people doing data science, there was a massive overlap with what the statistics are part of the world we're looking for, you know, in terms of managing code, writing the appropriate tools, maintaining them, rolling them out, as well as machine learning bits.

I think it depends on whether you're saying, I want to become a data scientist and then come and work over here within a completely different realm, you know, do data science, but in finance or do data science somewhere else. As we discussed last week, I think where we were talking about is it easy to, you know, come with or the question that was asked earlier, if you come with the domain knowledge, you know, we can teach you some bits over here, but if you move to some other bits, then you're relying on like the data science bit being really super strong and then learn the domain.

Now, I personally think that the domain knowledge I've picked up allows me to more effectively solve problems within my domain. Right. So, you know, when scientists comes in and it can't really express what they want in terms of, you know, nuts and bolts, is that a shiny app? Is that a markdown? That bit doesn't matter to me. It's like, tell me what it is you're going to do with this, you know, user story. Then I can think of, you know, how do I solve that problem?

That's good. That's helpful. Thanks.

Thanks, Mike. And thanks, Nessa, for the question too. I see Sam, you asked one in the chat and I can read it for you. I actually wanted to ask Mike the same question. You said it's really cool to see Pfizer's commitment to our programming via the R Centre of Excellence. What led to the creation of this initiative within Pfizer?

So. As I said previously, we are very kind of decentralised, disparate folks. So there are subject matter experts throughout the organisation. And what we saw was that it's hard sometimes to get an effective strategy across people. You know, because it's such a big company, if six people within statistics, that's like six out of 200, are doing some cool stuff in stats, then it's easy for each other to know what they're doing. And do we have six people who are all trying to write the same function or the right access to the same data? Could we not then say, well, let's solve this problem once and then make that into a package and serve that out to everybody so that then that streamlines their workflow for the future?

We also saw that there are places in the organisation, and as we were kind of discussing earlier, that lots of people want data sciencey type stuff or research software engineer type stuff of build some stuff. If they don't have an R subject matter expert out there, we want to be able to help them solve their problem and kind of set them up with a proof of concept or build something quick, pass it on to them and say, well, here's an R Markdown report. If you need to change it for a different endpoint or for a different data set, then change it here. But other than that, you can then see which bits to tweak.

Because we're trying to serve such a big organisation, we're just seeing that having a centralised place that can be strategic. Well, let me put this positively. There's a benefit in being able to solve problems strategically. So we're not just, you know, get the problem, solve it, get it out the door, move on to the next. But we're trying to say, what pieces of Lego can we capture so that the next time we see a problem like this, we can lift that up and go, OK, we're just cobbling together bits here to solve problems quicker or to break things down like you would in building a package to say, if we put these fundamental functions into place to access data or to render this as a this thing, then we can offer those to the organisation via a package or via something that allows other people to solve that problem for themselves.

Because if you don't have that centralised place, then I fear that many people would just go, I just need to solve today's problem. This needs to be done by tomorrow. I don't have a week to spend on this. But if we've spent a week earlier, then you could get it done much quicker.

I guess part of my question there, too, is that adding on to what Sam asked is that I think a lot of big companies know they need something like this. And like they see all these challenges, but how do you actually go about making it happen and getting approval to make that a real life thing?

I was very lucky that in September last year, Doug Robinson joined Pfizer from Novartis and Doug had set up exactly like this RCOE in Novartis. So when he walked through the door, I pinned him against the wall and said, we need what you had. We then went to the head of SAP programming and said, how about it? We'd like to do something like this. Fortunately, she said 110% behind you. So then she could then be our primary sponsor and set the next layer up to say this is what we'd like to do. But it's still quite the thing when you make the pitch to the global head of biometrics and data management to say we'd like to set up this group and have them endorse it and say, yep, we need to do that.

It's a great thing. Having a champion there is really what you need because if you're fighting to do this from the bottom, you're going to have a long slog ahead.

Having a champion there is really what you need because if you're fighting to do this from the bottom, you're going to have a long slog ahead.

Thanks, Mike. So one more question about that then. So you said finding a champion, but is there another tip that you'd give all of us listening in if we wanted to do this?

Well, I was really lucky because I found someone who'd done it before. So Doug had the expertise of who do we need to tell? What do we need to tell them? What is our raison d'etre? What's our purpose for being? And he had all that kind of ready to go. I suspect this is probably something that we could add to your community's website, right?

Yes. I was thinking that would be helpful. What's the rationale for it? Who do you have to speak to? Who do you have to convince them? How do you convince them?

Changes over 29 years at Pfizer

I know there was a great question that came in anonymously that I missed at the very beginning, but it was in those 29 years of being at Pfizer, what have been the biggest changes you've seen in the space at that time?

That's a really good question. I've been here long enough, but when I started Bayesian analysis, just don't talk to me about that, Mike. Nobody uses Bayesian methods around here. And that's flipped now so that people are using Bayesian analysis. The primary analysis of the vaccine trial was a Bayesian analysis. So that's great.

And this is kind of related to something that Libby brought up earlier, which is about decision making. We have moved much, much more to a situation where decisions are being made off the back of data. Rather than off the back of project teams kind of saying, well, I saw the last time we had a compound like this or last time we were in this situation, this is what we did. So that's what we should do this time. I mean, it hasn't been like that for a long time, but still, I think it's a big deal that the decision making is now much more, you know, show me a prediction, show me a simulation, you know, help me make this decision using data that we've got and off the back of a model or a prediction or simulation.

Working with IT and open source resistance

Thank you. Andy, I see you asked the question in the chat that it looks like a lot of people have been weighing in on as well. If you'd want to jump in and ask that live.

Hi. Sure. Thanks for taking the question. So any tips on how to work with the IT folks when we've got folks in the team who are actively opposed to the open source tools? And I'll caveat on that. I work in government. So folks in public health are pretty protective of what kinds of tools are installed in even their cloud environments. And so we're having to do a lot of re-informing and unlearning because the R they think of when we talk about it is probably circa 2007 R. And, you know, there's really quite a silo between folks that are gearing themselves towards being data engineers in IT with IT backgrounds where they're basically, you know, managing a data center and the rest of us just using the tools they want to deploy, which has been fairly limited to the Microsoft suite. And, you know, while there's sort of a community welling amongst the analysts across various programs using RStudio and wanting to use enterprise-level tools so that we can do code sharing in a more managed environment than the kind of Wild West we're working with right now. So any tips on that?

Okay. So, yeah, I think actively opposed is a hard one to fight, right? Because you'll then find people will still fear uncertainty and doubt. And that's hard to battle against. In terms of my work at Pfizer, for over 20 years now, the FDA has said that they don't endorse any specific tool for analysis. So there's a perception within the company that, oh, if you're submitting to the regulators, to the FDA, it has to be in SAS. They make the request that the transfer files that you pass to them of the data are in SAS format. But they've said for many, many years that they don't care what software you use and they can't endorse any specific software because they're a government agency and they wouldn't stand behind a particular company and their software.

So we've got that going for us. But the other part of it that I think is easier to manage is to take away the kind of wild west side of things. So when we deploy R at Pfizer, we talked earlier about R and the high performance compute grid. What we've got now is a process that says we'll take a certain version of R, a certain set of packages and a certain versions of those packages. We build it, we test it, we document it and then it goes under change control so that there are no sudden changes to that instance of R. And it's the same, it's getting easier now because we've got containers for that. So we can test and validate and qualify and document the container and then deploy it in various places.

What that then means is that as far as no one can get in and install packages in that environment unless it goes back through that whole process. So that kind of then takes that wild west away in that framework. And I'm also militant about telling my colleagues not to run anything in production on their desktop for the reasons that we discussed earlier that I can't tell what they've got and I can't tell what state it's in.

So with those two messages, I'm kind of saying, well, look, anything that's on production needs to come from here. You can do what you like on your desktop, but if you then try and run it over here, you can't come and tell me, oh, I need to get this package in there and I need it by tomorrow morning. Because the answer is, well, it needs to go through that whole cycle in order to have the confidence that this snapshot version here is still valid.

My personal view is that if you lock down something too much, people will find workarounds. So if you get told you cannot use open source, you must use this set of products, if that person wants to, they'll find a way to get around that and to sort it out. So if this way of doing things, the official way, the good way of doing things is easy, then hopefully people will do that than this nasty workaround over here.

I like that. Make the good way of doing things easy.

Yeah. So I kind of have a little bit of experience with trying to bring leadership into a line to use open source tools, especially from a financial institution that sees everything as risk. And actually given examples like Pfizer, like Accenture, like NASA, that they use R and RStudio consistently, it opens their mind to be like, oh, maybe it's not so bad. Or, for example, at the beginning, we had this thing with Microsoft Suite 2, but a 2017 version of Visual Studio Code has an R tools integration. So until you can prove your point of why it's good and how it's validated, I think there's ways to do it around it. Sometimes it just takes dedication and finding the right combination, I would say.

I had a quick comment. This is Santiago. On the subject of how to get folks to adopt the open source, I once met a small team at a bank and they had this struggle, too. And they worked with the regulators because what they were working on was regulated. They used our open source libraries and they worked with the regulators to get everything audited, approved. It took a little bit longer and it was a new process. But by the end of it, they had an open source process that they adopted and were able to replace their old SAS systems with. So it's doable.

Reproducibility and code standards

So the question is, how reproducible is your analysis? What's the probability of running an analysis from one to five, ten years ago and getting the same answer?

In terms of just the analysis, I would say that our probability is high, assuming that that analysis has been run on our high performance compute grid as a batch job. So if I take a job that was run a year ago and I rerun it on the grid, then because that tool is still there, I should be able to completely reproduce that. As you go further back, ten years from now, I'm hoping that we'll still have 90%, 95% probability because we have containers and things are locked down. The place where it gets more challenging is from ten years ago from now, because compute grid is a shared resource across many, many different lines in Pfizer, the environment that the tools are sitting on may have changed slightly. So compilers may have changed slightly, things like that. I would say that the probability is still high, but it's not as high as ten years from today.

I apologise in advance if I derail the conversation, but I'm just curious as to whether Mike thinks that his role as either a data scientist or statistician has influenced his music.

Yes, it does actually. Thanks for asking that, Stephen. I'll put the link to my band camp later on so everyone can rush out and buy it. Yes, I incorporate random stuff, generative stuff into my music. So yeah, randomness and probability and all of that features highly.

Thanks, Stephen. I see, Ethan, you had asked a question earlier as well, Brown. Friction with introducing new tools. Do you want to jump in?

Yeah, so this is, thank you, Rachel. This is going back to the conversations around building a centralised tool for everyone to use. Did you have any sort of frictions trying to get people to use it? Because if people have already built their own tool, or they're used to doing things their own way, it might be a bit difficult to convince them to start using your tool.

Yeah, so if this is the kind of central container of R and packages, I always have this friction where someone will have developed something with a package that isn't on my set, the official set. And so they'll come and they'll say, hey, I really need this. Okay. If that happens, and it's like a version of a package or a package that it makes sense that we should have included this and we just somehow managed to not include it. It's possible to kind of layer in that package. We would need to look at the dependencies of that package. So there's no point in layering in a new package that breaks all the old stuff because it needs a more up-to-date dplyr. In that case, I would just say to them, well, now is the time to go back to renv, do this and make it reproducible for you in this project as a special case.

As I said, the layering in of new packages involves retesting, re-qualifying, re-documenting, signing off, all of that kind of process. So it's not trivial. And I tend to think that if we get to that point where someone needs it for tomorrow morning, it's because they weren't paying attention earlier. When I said, you must use this version over here to do a production run. So it's possible, it's resource intensive and expensive. But now we can kind of at least say, well, with renv, create your own little project, get your own set of those packages you need, because then it's reproducible for you using renv and we'll deal with making that, incorporating that package into the next release.

The thing that I sometimes find though, and this is why I try and get people off their desktops, is when person X develops some code, passes it to person Y and person Y goes, oh, it doesn't run. Mike, can you help? Even with renv, it's a pain to have to take all of that over to here and reconstitute it and rerun it and see what the problem is. I would much rather that people kind of work off this standard, because then I know if I use the Docker and you use the Docker, we can share code, we can share projects so easily.

Do you see moving towards RStudio Cloud, because that way you can design the same versions? That's kind of what we have with the RStudio Server Pro. So when you fire up RStudio Server Pro, it will say, what version of R do you want to run? And basically that's pointing to the container and saying, use this container with this R and these packages.

Code style and documentation culture

Hey, thanks, Rachel. And Mike, this has been an awesome talk. I've loved hearing sort of your experiences and what you're doing over there at Pfizer. But one area where I have seen a lot of R programmers specifically, I mean, that's what I'm working in day in, day out. So this might expand to people programming in other languages also. But I've noticed a really big problem with larger corporations where you have small armies of R programmers. There's not always a consistent use of syntax or framing of the code or your apps or whatever it might be, making it borderline impossible to knowledge transfer and have someone come in and expect them to take over a complex app build that was done. And so I didn't know what might be the solution to this. I hate telling people we need to all code using, let's say, the tidyverse style guide or whatever it might be, because I don't want to know. I don't want to be able to differentiate between individuals when I'm reviewing code. My code should be indistinguishable from your code. But just in general, what are your thoughts on that?

You're right. The heterogeneity between people is massive. We have statisticians who will turn around and say, I'm not a programmer. And it's almost like, so you don't need to beat me with a stick about style guides because I'm not a programmer. There are clinical pharmacologists and pharmacometricians who will also tell me I'm not a programmer. But they are programming to get the work done for today and get it out the door and answer the problems and move forward. All I can do for them is to encourage them and say, look, if you use a style guide, then when this person comes to review your code, it will be so much easier for them to see what you're doing. And again, commenting and all the rest of that.

And in a sense, I encourage people to kind of use markdown for those statisticians and pharmacometricians who are doing work, because then you've got why am I doing this as well as the code that says, and here's how I'm doing it. When it comes to the programming, programming people, so the statistical programmers who are coding up for building visualizations and tables and reports, it matters more that those people follow the style guide. I mean, it's literally their job to write code that is maintainable and will last the distance and is easily reviewed. So in that instance, it's kind of, well, people ought to be following a style guide, especially if that code is going to be reused, because then it needs to be commented, tested, qualified, all the rest of it. For everyone else, I can suggest, but I can't really mandate that people follow a particular style.

I'll comment on that. And a strategy that I end up kind of using is, yeah, obviously it's great for the team as a whole, but it's also great for that person, for their future self. And everyone can think of a time where they're like, they look back at work that they've done. They're like, what was I thinking here? What did I do? And I think when you kind of pose it that way, you just can encourage them. But I think when they are a better version of themselves, three months from now, six months from now, that's also a helpful kind of way to kind of push it.

A thousand percent set. I mean, I've said that, you know, if I can, the polite way of saying it, and I'm kind of rephrasing what I've heard JD Long tweet in the last 48 hours is, right. When I open up code, even a month later, I often think, what the heck was I thinking of? You know, if I open it up a year later, it's like, I don't even recognize who wrote this mess. And that's where things like Markdown help, because you're leaving that breadcrumb trail that says, this is what I was thinking of, this is where I'm getting data from, why I'm filtering out this subgroup of people.

My quick that I made at RStudioConf a few years back was, you know, if you're writing more comments than code, then write them in RMarkdown. If you're writing more code than comments, write more comments and do it in RMarkdown. And I think that's leading by example too, where if you're building that community, when other people see your code, then they kind of are like, wow, it's so much more readable, so much more easy, and you reduce that mental overhead, I think.

My quick that I made at RStudioConf a few years back was, you know, if you're writing more comments than code, then write them in RMarkdown. If you're writing more code than comments, write more comments and do it in RMarkdown.

I'm at an age where I wander into a room and forget why I came in, you know, so that's usually only a minute.

Yeah, I was going to say, what we have is at the enterprise level, we have a commitment to clean code. And so, you know, they host workshops or they do lunch and learns, and they really try to get, you know, people to code better. It comes from the higher down. So you might be able to find a champion or find someone to advocate for you on better code development. And worst case scenario is you just, you know, have a GitHub action or something that lints their code or styles it, you know, and at least there you have something that's a little bit better to read.

Yep. And there's, you know, control I, is it, within RStudio will indent your code. Even that's a kind thing.

Yeah, can you hear me? I can. I'm very curious if you're in an airport and heading somewhere. And I'm a time zone behind, so I've missed most of the chat. I was going to ask, you may have already covered this, but for code standardization, do you have any good resources you can recommend for further reading?

No, is the very short answer. I would have to pass that back to the community here.

So for writing standardized code. If anyone has tips, feel free to put them into the chat or just come on, come on live too.

Sharing snippets is a good start. So RStudio has features where you can write snippets of code, so you can type a few words, press tab or shift tab, it will auto complete. You can share that snippet across the team.