Resources

Javier Luraschi | Datasets in Reproducible Research with 'pins' | RStudio (2020)

video
Nov 1, 2020
3:05

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Yeah, thank you everyone, and welcome to the Strange Downloads Lighting Talk, where you're going to learn how to use the pins package to make sure your data science workflow remains reproducible.

Now I know that most of you take reproducibility for granted. You think or you want to live in a world where everything is reproducible, and you know, to be fair, we have like great tools like rmarkdown, which was designed from the very beginning to be reproducible.

However, you know, you want to be able to copy paste stuff anywhere, right? You want to be able to grab R code and place it on a different R session. You want to be able to grab that same code and put it on a different machine, and that all should work really nicely. And that's the case. And if that's the world where you live, you're very lucky.

The upside down of data science

Well, let me tell you now about a very, very dark place, which I happen to call the upside down of data science. So in this world, not everything is as reproducible as it may seem, and you might find that there's code that requires very strange downloads.

So for instance, in this case, we have a package that supports Python and R, and when you copy paste the code, it basically doesn't work, right? It's going to fail, because one thing that you find over and over is that there's like a local file named whatever.csv, and when you run the code, it doesn't exist. So you need to scroll all the way up and figure out, like, how exactly to download the file, then you download it, then you put it on a specific path, you change the code to point to that path, and when you want to rerun the code in a different session or a different machine, you do it all over again.

Introducing the pins package

So is there a better way to do this? Well, today we're going to find a way of closing this awful portal to the upside down of data science with a new pins package, which basically, all it does, it allows you to download a remote resource locally.

So all you have to do is say pin, and then you have a URL, and it basically converts the remote URL into a local URL, and that's about it. Then you can make use of that resource, and not only that, but the pins package allows you to cache the resource, so if you rerun the code, we're not going to be redownloading this over and over again, and if you happen to lose internet connection and you run this code, the package isn't smart enough to not rerun the code and just use the cached version and make sure your code doesn't break.

the package isn't smart enough to not rerun the code and just use the cached version and make sure your code doesn't break.

But what else? I'm sure a lot of you work with data sets, and sometimes you tidy your data set, and then you wish you could share that with others, and the pins package also allows you to do that. You can say pin with a specific board, and then all you have to do is register your board, so for instance, you have boards for Kaggle, GitHub, RStudio Connect, Azure, Google Cloud, and S3, and whenever you pin a remote local data set, you basically are sharing it in these remote cloud providers or products.

That's about it, and let's see if I can get a little demo to work. No, it won't work, so thank you. That's it.