E. David Aja | You should be using renv | RStudio (2022)
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
My name is David Aja. I'm a member of the solutions engineering team at RStudio, or if you prefer, I posit solutions. I help data scientists get their ideas into production, and in the course of doing that work, I see a lot of broken things. And so that's why today I'll be talking about why I think you should use renv. Probably.
So there are two parts to this talk. The first part is going to be about why I think you should do this, and then in the second part we'll talk about how. But let's start with why. And I think the reason that I'm kind of up here giving this talk is because I think we have a standards problem. And the way I would express that standards problem is that I don't know if we have a good answer to the following question, which is what makes a project an R project?
We're probably going to need to define some terms in order to sort of make our way through that. So let's start with project. What is a project? This is the definition you get if you Google it, right? We've got an individual or collaborative enterprise aimed at a particular aim. I think that's a pretty helpful definition. But sometimes I like to try to define things functionally, right? A thing is what it does. And so what do projects do? Projects ship.
This is a slightly unusual way of talking. It's kind of a legacy of our history of writing software to physical media and then mailing it to your house. But this is kind of the paradigm that we have in mind when we talk about shipping projects, right? We've worked on something. I have it. I need to get it to you, and I want you to have a particular experience.
Some of you may be thinking about like a particular file extension that maybe you're like, it says that kind of thing there. And so I'll just say I think the .rproj or the RStudio project file is not going to be sufficient for our purposes here. And part of the reason that's true is because what's in the RStudio project file is editor configuration. And editor configuration is important to you. But when you send data analysis results to someone, they typically don't ask you how many spaces you put in a tab, right? And so the idea is you don't ship your editor.
A thing I think it's important to say, right? I'm making a lot of declarative statements here. Part of this is from a history of screwing things up pretty catastrophically. So I have actually shipped my editor. And then I ended up debugging my code in our general counsel's office. And that's not where I want any of you to be, right?
I have actually shipped my editor. And then I ended up debugging my code in our general counsel's office.
So I think we have to look for some better ways of doing work, right? So, you know, the RStudio project file is going to be necessary if you use RStudio. And I think an important thing to acknowledge, even though this is RStudioConf, is that not everyone does use RStudio, right? The RStudio project file is necessary. But for the purpose of helping us understand what a project is, it's not going to be sufficient, right?
The case for dependency isolation
So we have to look elsewhere for a definition of what a project is. And one of the things that I found really inspiring when I was trying to figure out how to more reliably deliver software was a website called 12factor.net. This website was published by people who work on the deployment platform Heroku, which helped people deploy software as a service applications to the public. But a lot of the lessons that they published there, I think, are useful even in the data professional context. And one of the things they really stress is the importance of environment isolation, right? So the 12 factor apps never rely on the implicit existence of system-wide packages. And there are a couple of things that they say are important to making this happen, right? So you have to declare all your dependencies, and then you have to isolate them from other things on the system. And you have to make sure that no implicit dependencies leak in.
And I want to dwell on the leak in thing for a moment, because I'm going to give a sort of stylized example from my past to show you how this kind of thing can happen, right? So let's say, you know, or let's say I was working on a project. I did a lot of work on the 2020 census for an advertising agency I worked on at the time. And I delivered some dashboard, and people looked at the results, and it was good. And then someone asked me to do some data analysis on the census tract level. It needed to do between a one and seven day look back. There were like 80,000 census tracts. We were doing data analysis for, at that point, several months worth of data. And so I was like, OK, I need to do something. This might be a good case for the latest DTPlier. And so I installed DTPlier. I was able to work on the analysis, and everything was fine.
Except I accidentally, by installing DTPlier, upgraded the version of DPlier that was being used in my dashboard, because the thing I haven't highlighted so far is that we have a dependency on a shared user library. The answer is not in production.
And so I think I'm going to borrow Julia's excellent metaphor from yesterday. We want to find a way of working that stops it from being possible for us to lock our keys in the car.
How to use renv
So let's talk about how we do that, right? The first step is going to be to isolate your dependencies. And so in order to do that, you're going to need to install the renv package. And then you're going to initialize it. You have a couple of options for doing that. You can call the init function. There's also a little box you can check as you're creating a new project in the RStudio IDE. And that will create, and this will sort of overwrite your libpaths, right? So by default, you have kind of a user library and a system library. And renv will take that over and say, OK, now you have a project-specific library. And then your system library.
And so what that looks like, if we go back in time, I properly isolate my libraries. I have this user library in my dashboard. And then the project library gets the upgraded versions. And that means that if I start working on new projects, say I start doing experimental things with new packages and tidy models, I'm not really concerned about breaking other things in my workflow. And the other thing that's nice to note here is that you can adopt this incrementally. So you don't have to move all of your projects at once. You can see that my dashboard is still pointing in my user library. I can go and upgrade that surgically later on if I want to.
And you might be asking yourself, OK, well, you're making me maintain this isolated library. Does that mean I need to reinstall every package from scratch each time? The answer is no. Renv maintains a package cache. So once you install a package, it's available to you to reuse in other projects if they have the same dependency. So that first installation might take a little bit of time. But subsequent installations will go much faster.
Once we've isolated the dependencies on our system, the next thing we need to do is to discover what they are and write that down somewhere. And so calling the renv snapshot function is going to give us some information about the versions of packages we use and which ones we have, where we got those packages from. Because as we've heard, shared drives, RStudio Package Manager, the R universe, all of these are different places where you can get R packages. And so you want to be able to keep track of where you got them, as well as the version of R you were using for your project.
And so you can see here, this is a package I've installed from the R universe. And so we're keeping track of the information about the git commit. And we're writing all of this down in the renv lock file. And I think the renv lock file is the project artifact that we're looking for.
I think the renv lock file is the project artifact that we're looking for.
Part of the thing that it's going to enable us to do is to collaborate more effectively. And so renv is itself very helpful in giving you status messages about what you need to do next. And this is going to be really helpful for contexts where either you're picking up an old project or you're sharing a project with a teammate who doesn't have the same package set on their computer.
So you can see in this case, I've installed some packages, or I'm sorry, I've pulled down a lock file. And packages are in the lock file, but they're not yet on my system. And renv is going to tell me what to do. And this is a similar kind of situation where I started using a package, but I haven't installed it yet. Calling renv status is, again, going to tell me what to do to get right. And if I remove a package from my project, then I want to remove those dependencies as well, because I don't want to ship unnecessary things to my deployment target, whatever that is.
And so this is going to be the workflow. We're going to initialize a project in renv, we'll install the packages we need, we snapshot, we ship, and then later we can restore that environment. Once we have the project in renv, install, snapshot, ship, and restore.
When renv may not apply
I did say probably at the beginning. There are a couple of things where this may not apply to you. One, I'll call it a bad argument, and one better one. I don't intend to reuse this code. This is too much overhead. I think one of the things about being a data professional is that you are, in fact, going to have to reuse the code. Someone is going to ask you, hey, I got numbers from finance that don't match yours. What's going on? Can you rediscover that? And you're just like, you're going to be in that cycle of iteration.
And so this is a tweet I pulled from somebody who was talking about this as I was avoiding working on this talk. And I think it really perfectly encapsulates that if you just treat things like they're throwaway, they're going to come back, and you should just work like your work is valuable.
If you are purely a package developer, then you are in a somewhat different circumstance, because the execution target is not necessarily like getting someone to call your plumber API. It's CRAN reverse dependency checks. Renv does have some tools that you can use to isolate environments if you need to do specialized testing. And the article in that on the renv documentation is pretty excellent, as is the rest of renv's documentation.
So in summary, I think we have an opportunity here to coalesce on a project standard. I encourage you all to use renv. And I want to say a special thank you to Kevin, without whose work none of this would be possible. Thank you.
