December 2022 Webinar: The R Workflow – Dr Ryan Johnson from Posit

Transcript#

This transcript was generated automatically and may contain errors.

Hello everyone and welcome to our December NHSR Webinar, The R Workflow. My name is Lynne Howard and today I'm pleased to be joined by Ryan Johnson, who is a Customer Success Representative at Posit, formerly known as RStudio . Today's webinar is being recorded and will be available later on the NHSR Community website and YouTube page. If you have any questions, please put them in the chat and Ryan will either answer them as we're going along, or at the end of the session, depending on the amount of detail that is needed to answer them.

If you think of any questions after the webinar, please do feel free to contact us, either using our Twitter page or via our very popular Slack channel, both of which again will be shared in the chat or you can find the links on the community web page and community members will be more than happy to help you. At the end of the webinar, we're going to use Mentimeter to gather feedback and that link again will be shared in the chat. Please do let us know how you found the session and we do appreciate your feedback, so thank you. So with more ado, over to you Ryan.

Great. Thank you, Linda. Great intro. I apologize everyone for starting a little bit late. Hopefully we'll be able to play some catch up here and still sneak it in within the hour. But again, if we have questions as we go through, just feel free to pop those into the team's chat. So today's session is going to be on the R workflow and I know we're going to be collecting some survey data afterwards. I'd be really interested to hear what everyone thinks about this presentation because it's brand new. I actually had a previous version of it, but I didn't really like it that much, so I redid it for this presentation, so certainly would appreciate any feedback. It's going to be a bit of a whirlwind, so we're going to talk about a lot of different things, but the whole goal is just to expose you to a bunch of different topics, tools that you can use to improve your workflows in R.

So, a good rule of thumb is that, you know, don't overengineer workflows if you don't need to.

But like I mentioned in the last slide, you know, not all apps are going to be this simple. So, it is important to know when to not necessarily overengineer, but to think about different ways to engineer your applications, your reports, so that you can scale accordingly.

The faithful Geyser dataset

So, let me just get everyone up to speed on what exactly is this data. So, looking again at our server function in our Shiny application, we're using something known as Faithful. All right? So, this is a built-in data set to R. So, when you download R onto your computer or server, the Faithful data set's already there for you. And it's just a good data set to play around with, try out some visualizations for you to test out. And specifically, we're going to be extracting the second column as a vector and saving it to the value of X. So, this is what, just kind of a snippet of what this data looks like. I'm just showing the first 14 columns, or rows, sorry. And we have two columns in this data set. You can see eruptions, which is the first column, and the second one is this waiting column. And these correspond, at least in the first column, this is the amount of time in minutes. Every time Old Faithful, which is a big geyser somewhere in the western United States, I think it's in Yosemite, every time it erupts, it takes that long in minutes. And then from time to the next eruption is shown in minutes over here in the second column.

So, as a reminder, we're going to be extracting the second column as our data for this application. And I'm showing you that data right here. So, this is the second column of the Faithful Geyser data set extracted as a vector. And you can see it's a little over 270 numeric values in length. And that's it. It's a pretty simple data set. You can see the numbers right here. It all fits nicely onto the screen.

But again, going back to that previous slide, what if this data changed every single day? Like maybe they just continued to add data every single time Old Faithful geyser erupted. What if the data wasn't built into R? Maybe it was stored somewhere else and you needed to import it into your workloads. What if others wanted to share this data with your other teammates? Sure, it's pretty easy when the data is built into R, but what about if it's not? And then what if this data was not this small? Maybe it was actually millions and millions of rows in length and hundreds of gigabytes in size. That definitely kind of changes how you're going to approach this data set.

Introducing Pins and Posit Connect

So that brings us to our first workflow. So rather than just simply having a data set built into the Shiny application, we're going to take that data set and we're going to save it as something as a pin. And this pin, we're going to pin it to something known as Posit Connect, which is one of our professional tools. This is kind of our publishing platform, which we'll talk more about here in a second. But let's talk a little bit about pins. Maybe some of you have heard about it on the call and maybe some of you haven't. And that's okay. I think pins is a really underutilized tool, which can help improve a lot of your workflows. So pins is an open source R package, just like Shiny, something that we've developed here at Posit. And what it allows you to do is publish or pin data, models, any other R object to a board, right? And that makes it really easy to share across projects and also with your colleagues. And so like I just mentioned, you can pin these objects to boards and these boards could be a variety of things. But for this workflow, we're going to leverage Posit Connect as our board. So just like you take a piece of paper and pin it to a cork board, you can pin your data and pin it to a Connect board. And it really makes your data so you can easily update it. You can version it. So it just makes your data a little bit more flexible.

So we're going to pin our data to Posit Connect, but we need an additional tool to basically house all the code in order to do this. And we're going to leverage something known as Quarto. So Quarto is something we're really excited about. It's a brand new tool that we announced at our conference back in July, I believe, June or July. And it's very similar to R Markdown. So if anyone on the call is familiar with R Markdown, you can basically consider it R Markdown 2.0. But it's really tailored to scientific and technical publishing. And what's unique about it, as opposed to R Markdown, is you can create these using whatever language you want. So you can use R, which is what we're going to do, but you can use Python, Julia, Observable, and you can use whatever IDE you want as well. So we're going to stick within the RStudio IDE, but you can also create Quarto documents using VS Code, Jupyter, or any other text editor.

Creating a Quarto document to pin data

All right, so what we're going to go ahead and do here is we're going to take our data set, that second column from the gadget data set, and we're going to pin it to Posit Connect. And we're going to do that using Quarto. All right, so I'm going to come back here to the RStudio IDE. So we have our Shiny application. I'm going to go ahead and close out of this, and I'm going to open up a Quarto document. So you can see in these starter scripts, we have Quarto document. I'll go ahead and select this. I'm just going to say NHS pin data, hit create.

Hit create. And here is our Quarto document. And you can see by default, it's leveraging our visual editor mode, which just makes working with these documents really nice and pretty. But you can also edit them and using source code, which looks much more like your typical R Markdown. But we'll stick with visual because I do think it's nice to play around with. It does come kind of pre-built with some code, some text in there. But I'm just going to go ahead and delete all of this so we can start fresh. All right, so let's go ahead and take that faithful Geyser dataset, and we're going to create a pin and pin it to RStudio Connect. So we're going to step through this bit by bit. The first thing we need to do is load our packages. All right, so I'm going to go ahead and insert an R code chunk here, and I'm going to load the library pins package. So because we're taking this faithful Geyser dataset and pinning it, we need to make sure we have pins. I'm going to go ahead and run this and just make sure I have pins in my environment. Looks like it loaded just fine, but if it didn't, you just have to install that.

All right, after that, we're going to go ahead and filter and save our data. And I'll go ahead and have another R code chunk here. And we're going to, just like in our Shiny application, we're going to save our data as an X variable. All right, and we're going to assign it to the faithful Geyser dataset, just the second column. All right, so I can run this and just make sure that looks good. You can see in my environment pane, I have X. I can run it down here. Yep, that all looks correct. So the goal now is we want to take this data and we want to pin it to RStudio Connect. All right, so that's going to be the next section here, pin to Posit Connect. So I'm going to go ahead and just copy a few things from that GitHub repository I shared in the chat. And the first thing here is our board. So I mentioned that pins, you need a place to actually pin your pins. In this example, we're going to be using Posit Connect. And so I have the URL to our demo server of Posit Connect right here. All right, we call it Colorado. I have no idea why, it's just what we call it. So this is the actual server of Posit Connect we're going to be using. And you do need to supply a Connect API key just so that Connect knows who's pinning this data set. So we're going to use the pins board RS Connect function to basically register this board. So I'll hit play here. And you can see connecting to RStudio Connect or Posit Connect, and that looks correct. All right, so we've now registered it, that's good. And now we're going to go ahead and write the pin. And it's a very intuitive function called pin write. All we have to do is supply the board. So we'll just leave that as board, the data set x, and then we can give a name as well. So I'm going to go ahead and call it faithful geyser data. And that's it. So I'll go ahead and run that code chunk. And you can see it's going to write as an RDS file writing to pins faithful geyser data. So that's it. Think of this as like saving it to like a Dropbox or an S3 bucket. We're just taking a data set and we're saving it to Posit Connect so that others can use it or you could potentially use it in other workflows.

Now I'm going to switch over to Posit Connect here. I'm just going to refresh here and show you what this pin looks like. All right, so here is that data set, that pin data set we just pinned. I can click on it. It's not going to show you much, but what it does show you, which is really helpful, is it gives you the code so you can import this data set into another script or another workflow. So loading the pins, registering the board, instead of pin write, we can use pin read. All right, so we're going to use this here in a second, but this is what a pin looks like once it's hosted on Connect.

Publishing to Posit Connect and job scheduling

Now actually, let me go back to this document really quick. So this is a Quarto document. Now this data set, it doesn't change. Every time I run this command, it's going to be the same data set. But again, think about what if your data changed every single day and you might want to rewrite this pin every single day so it's updated with that new data. So I'm going to go ahead and save this document. I'm going to call it test Quarto pin geyser. I'm going to hit save. Now the first thing I want to do here is I want to publish this Quarto document to Posit Connect. Just like we published a pin, we're going to publish it to Connect, but we're going to use a kind of a canonical publishing workflow. So we're going to click on this little blue button right here. And we want to publish this to RStudio or Posit Connect. Publish document with the source code. So if you want to set it up for job scheduling, you do need to make sure you include the source code. It's going to ask, you know, what Connect server. So we're going to use that Colorado Connect, which I mentioned before. We can leave the title the same. We're just going to publish the single Quarto document, which is that QMD ending. We'll hit publish. So if you've never published anything to Connect before, that's pretty much it. Once I hit publish, RStudio takes care of the rest. It's going to capture my environment. So what packages I'm using, what versions of those packages, what version of R am I using. It sends all that information to Posit Connect. Connect reads it, replicates my environment, and then publishes this Quarto document.

So let's give us a few more seconds to run.

And then once it's done, it should automatically pop open in Connect. And here we have that Quarto document now hosted on Posit Connect. And what I'll do first and foremost is I'm going to open this up to everyone here on the line. So I'm going to set the sharing settings to anyone, no login required, and hit save. I'm going to grab the URL here at the top, come back into the chat, and paste it here. So now everyone here on the line can see that Quarto document we just created.

And once we have it here, one of the important features I wanted to demonstrate is job scheduling. So you can see over here on the right hand side, we have the schedule tab. And let's say I want to update this pin every single morning. So I can schedule it, select my time zone, when I want it to start, and run daily. So it's pretty much all ready to go here. Run every weekday, Monday through Friday, or every other day, every day. And that all looks good. Hit save. And now this pin will automatically be updated every single morning at 841 AM.

All right, so just a powerful way to kind of improve your workloads, especially if you have data that needs to constantly be updated. You can set it up to run using Quarto. You can do this with R Markdown as well, and even Jupyter Notebooks.

Building a Plumber API for the analysis

Okay, so coming back to our slides, and apologies if I'm going a little quick here. I know we're only about 20 minutes left. But this is our starting point. So we had our Shiny application, we had all of our analyses, and the data within the Shiny application. And what have we done so far? Well, effectively, we've taken the data and moved it outside of the Shiny application. All right, so now the Shiny app, this data is within a pin, with the help of Quarto posted on Posit Connect.

So let's move on to our analyses. And I mentioned previously, there's really only one analysis in the Shiny application. And that is the calculation of this bins variable. So you can see we're using the seek function, it's going to take the min value of the x data, so that faithful gadget data set, it's the max value, so min and max, it generates a vector of length input bins. So whatever that slider bar is set to, it's going to be that number of bins.

So what does this actually look like? So we have our Shiny application right here, and we have the number of bins set to seven. All right, we can see this number right here is gonna be set to seven, and then we get this vector right here, it's actually seven plus one, I'll kind of explain why here. All right, so there's bins, and this example of seven computes to this numeric vector. And there's actually eight values here, because it corresponds to every single border of these bins. So starting over here on the left-hand side, this left bin, that's numeric value 43, and you have one, two, three, four, five, six, seven, all the way to the left-hand side, which is eight. And those all again correspond to this bin vector.

Not a very compute-heavy analysis, but just for the sake of conversation, now what if these analyses were more compute-intensive? So you had this massive simulation or machine learning model you're computing, that could take a long time to run, it could leverage a lot of CPU and memory. And what if you wanted to access the results using a different language? So maybe you've created a model, for example, but you want to use your model within Python or Julia or something like that. How could you do that? So we're going to do that using another tool called Plumber. Plumber is a way for you to create APIs using nothing but R code. So if you're like me, when you first started with your R coding journey, the concept of an API was so foreign to me and so scary to me, I didn't even want to touch it. But Plumber makes things really easy. We're going to go through an example here of creating an API. And really, ultimately, what you're doing is taking your normal R code that you've already written, and you're decorating it. And I'll show you what the decorations look like here in a second. But the one thing you do need to know for creating a Plumber API is you do need to know how to write an R function. We're going to go over that here now.

All right, so let me just go ahead, and we're going to create and publish a Plumber API. So I'm going to come back here to RStudio, and within RStudio Workbench, we're going to close out of this Quarto document. I'll clean my screen here using Ctrl-L. And let's go ahead and create a Plumber API. So starting with that same starter script dropdown menu, you can see we have Plumber API right here. So go ahead and click on this. I'll call this NHS. And the goal of this API is to compute the bins. So I'll just call that, and I'll put API here at the end. And hit Create.

All right. So this is an example Plumber API. So similar when you create a Quarto document or a Shiny application using this dropdown. It has some stuff pre-populated in here, but we're not going to worry too much about this stuff. So I'm going to go ahead and delete pretty much all of this except for the library Plumber. There's also this comments up here just for the author. We really don't need those either. So I'm going to go and delete those too. So we're really just starting with library Plumber. So the first thing I mentioned before is that in order to create an API using Plumber, you do need to write a function. So this function, the whole goal is to calculate the number of bins for your histogram. So I'm going to just create an example function. We're going to call it foo for right now. And we're going to use the function function to create this function. All right. That sounds a lot of functions. So I'm only going to take one argument. That's going to be the number of bins. And then once you've defined your arguments, we basically open up the body of our function. So we use these curly brackets and everything, all the kind of the code you compute is going to happen within these curly brackets. The first thing we want to do is I want to obtain that pinned data set. So I mentioned before we have that pin on Posit Connect. Let's go ahead and access that.

Now before we do that, we first have to connect to that board again. So I'm going to come up here and I'm going to first copy and paste some code. So we're going to library pins and make sure the pins packages updated. And this is basically the same code that we had in our last Quarto document. We just want to make sure we have the connect board registered. Now once we have that, we can read in pinned data. So we'll do that in the next step here, this data set, we're going to stick with x. And we're going to do pins. Pin read very similar to pin write this and read, we're going to do board. And then we do need to give it the name of our pin data set. So I come back here to right here, you can see this is the name of our pin data set. So it's my first name followed by the name we gave it. So go ahead and copy that and pin it here. All right, so that's going to pull in that pin data set from the pin on Posit Connect rather than pulling it from that built-in data set. Now once we have that, we want to calculate calculate bin breaks. All right, so this is going to be the code which we extracted from that Shiny application. So we're going to use that seek function, and we're gonna find the min value of x, the max value of x, and the length out is going to be the number of bins, so n bins. And I just want to make sure this is numeric. So it's one thing important with APIs is sometimes they're fed in as character values. So I just want to make sure this is converted to numeric.

All right, and then we also want to make sure we do plus one.

All right, and that should be pretty much it. So let's make sure this works as intended. So I'm going to go ahead and source this foo function. So if I come into my environments here, you can see the function foo is now in there. And I can try it out down here. So let's say foo, and I'm going to say seven. All right, I run that. We get a little message here, don't worry too much about that. But you can see here we have returned the numeric vector of our bin breaks. So that works out well. So this function is performing as we intended. I'm going to go ahead and delete the name

December 2022 Webinar: The R Workflow – Dr Ryan Johnson from Posit

Transcript#

Overview of the typical R workflow

Introducing the demo Shiny application

Walking through the Shiny app code

When to engineer your workflow

The faithful Geyser dataset

Introducing Pins and Posit Connect

Creating a Quarto document to pin data

Publishing to Posit Connect and job scheduling

Building a Plumber API for the analysis

Featured software#

plumber

rstudio

Shiny