Resources

Gagandeep Singh & Xu Fei | Yes, you can use Python with RStudio Team! | RStudio (2022)

video
Oct 24, 2022
18:23

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everyone. I'm Gagandeep Singh, and I'm joined by my colleague, Xu Fei. We are solutions engineers at RStudio, soon to be called Posit. And we are here to answer a question that we get asked a lot, which is, can you use Python with RStudio Team? And the answer is, of course, yes, you can.

Very recently, I had to answer this question, so I'm going to start by sharing a story with you. I'm an Indian national. In order to be here in the U.S. to attend this conference, I need to get a visa. And as part of the immigration process, I have to get an interview with an officer, where they ask you questions like, where are you going? What will you be doing? So I told them that I'm going to this conference, I'm giving a talk about using Python with RStudio. The officer said, oh, I know Python, but I didn't know you could use it with RStudio.

Well, I told him that's exactly what my talk is about. Unfortunately, officer Johnson could not be here with us today. But if you're in the audience, we are assuming that you're a multilingual data scientist who has access to RStudio products and wants to use Python with them. Or you are an R user who works with a lot of Python colleagues, and you may or may not have access to our products, but you're looking for ways to collaborate with them.

The bike share use case

So it is one thing to say that you can use Python in RStudio products. We're here to show you how it works with a use case. We're in the D.C. area right now, and there is an active bike share program outside called Capital Bike Share. We went out and took the picture yesterday, and this is the station right outside of the hotel. As you can see, there are bikes docked into the bike station. When you go to the company's website, you will see a map that shows you all the bike stations. When you click on one of the stations, it will show you how many bikes are available right now from a live data feed.

Now, what it doesn't tell you is how many bikes are going to be available in the future. As we know, the bike availability varies a lot by the time of day, the location, and we, if we want to have a smooth trip experience, we want to be able to predict how many bikes are going to be available in the future. So because there's a live data feed, we can actually grab the data, store it somewhere, and make a predictive model so that we can actually know how many bikes are going to be available in the future.

So if you were here at the very conference back in 2020, you probably have seen a talk that uses the similar data set, does everything in R packages. It's actually from our colleague Alex. So this project has been successfully running for over two years. So a lot of data has been processed and stored as interoperable data assets that we'll get back to in a second.

When I joined the company last year from more of a Python data science background, Gagan and I started talking. So since we have all these data, can we do something different? Can we do something entirely in Python and using our pro products? And we wanted to do this for two reasons. Number one, well, just because we can, why not? Well, that's not really a good reason. The second reason is actually we thought this could be a very common scenario where in our pro product customers, the teams that work in one language may want to collaborate with another team in a different language using interoperable data assets that's completely deployed in the pro product ecosystem.

So we are also thinking that the conference is going to be in D.C., so we actually get to go to D.C. and ride those bikes. And here we are.

Workflow overview

So let's recap the workflow for the bike prediction use case. As in with any other project, it starts with the data source, which in this case is the capital bike share API, and the data is refreshed every day. Then the objective is to import this data, transform it, and save it either in a database or an interoperable data asset like bins, so based on the frequency of the update. Then we have to consume this data, build a prediction model to show the number of bikes for each bike station. Then we have to move this model to a deployable location where the output, the predictions of the number of bikes can be shared with other applications. And speaking of other applications, we have to build a way to share the predictions with our users.

As in with the previous talk, this work is already available and running successfully all in R. So Shufay and I, when we were looking at this use case, we realized that we have access to this data already in the form of the database and the bins that the R ETL process is updating. So we decided to, in order to address the needs of a team which uses a similar workflow, where part of the workflow is in one language and part of the workflow is in another language, we decided that we'll build the model, build the dashboards in Python, all within our RStudio products.

So let's take a look at the application itself. So this is developed in Shiny for Python, as you heard yesterday, deployed on RStudio Connect, which we'll get into in a second. You see a map of the DC area, and there are circles with numbers in them. These represent this cluster of bike stations. If you click into it, it zooms in, and eventually we'll see blue circles that represent the stations themselves. The bigger the circle, the more bikes that are available, and if you click on one of the circles, it's going to give you the predictive graph of the number of bikes available in the next 24 hours.

RStudio Team products

So you heard of RStudio Team and RStudio professional products, and how does ‑‑ what are they? And why do we need them? And how does Python fit in here? RStudio Team is actually an umbrella term for three professional products, namely Workbench, Connect, and Package Manager.

As Python data scientists, we need a place to start writing our project in code, and we need an environment to code. And so Workbench offers a centralized server‑based environment that allows you to code in your favorite data science IDEs, so Jupyter and VS Code.

As your project matures, you may want to start sharing your content and deploy it somewhere, and that's where your RStudio Connect comes in handy. It's designed exactly for that, and allows you to deploy a very wide variety of data science assets, especially the interoperable ones, such as pins, APIs, and reports like Quarto and Jupyter notebooks, and interactive dashboards like Plotly Dash, Streamlit, and, of course, our favorite and the latest, Shiny for Python. And if you want to install packages into a centralized location, and that's not directly from PyPI, you can use Package Manager for that.

So what are we going to show you in the next few minutes is that we're going to show you a workflow from developing the model using existing interoperable data objects in the Workbench, and we'll deploy that to Connect, and I'll come back and visit the application itself, we'll make a change, and we'll update it live to the Connect deployment.

Demo: model building in Workbench

So let's see the demo. As Shufei mentioned, we're going to start with, like, we already have access to the data, so we're going to start with the model building process. So in order for me to start working, I need access to the IDE, so I go to our demo Workbench, and as Shufei said, it supports multiple IDEs. A controversial opinion, I like to start my work in notebooks, especially when I'm doing model building and data exploration, so I'm going to choose the IDE JupyterLab here, and I'm going to launch a JupyterLab session. I've already launched it here, and this is the notebook I'm using to train the model and deploy it.

So I'm beginning with all the basics, starting with importing all the packages that I need in my work, setting up my environment, providing API keys and database passwords. I've also built some functions for repeatable code that I'm going to use later, and this is where I start making a connection to the database, which has all the data about the bike stored. So this database is being updated by the R ETL process, so I don't need to do the data cleaning and exploration. I can directly import this data as a data frame in my Python environment, and I can create a testing and training data set and start building my model.

So for this work, we chose the random forest model, but another iteration of this work can be building different kinds of models and comparing their performances. But let's take two random forests for this one, and now I've built the model, I need to test it, so I create my testing set, and I compare the results with the actual data.

What I have done here so far is what we do every day as data scientists, right? And maybe you all can do it better than I did. But the next challenge for me is to move this model out of my notebook into a place where other applications can consume the predictions it's making. And this is where I'm going to use a combination of Vetiver and RStudio Connect.

The next challenge for me is to move this model out of my notebook into a place where other applications can consume the predictions it's making. And this is where I'm going to use a combination of Vetiver and RStudio Connect.

So as Julia introduced Vetiver in her keynote yesterday, and also Isabel gave a great talk about using Vetiver in MLOps operations, I'm going to skip explaining Vetiver, but I'm going to show you how I use it in my work. So in order to start working, I will convert the random forest model that I just built into a Vetiver model so that Vetiver can interact with it. And once I've converted that, I will pin this Vetiver model onto an RSConnect board. Yes, this is the same pins package that is available in R, but now it's also available in Python, and it's used in Vetiver for model versioning. And once I've pinned the model onto RStudio Connect, I will deploy it as a fast API endpoint onto Connect itself. And I'm using the inbuilt function from Vetiver called deploy RSConnect. So what this function is doing is I'm passing the name of the pin directly to the function, and it's converting this function into a fast, converting this model into a fast API and directly deploying it on my RStudio Connect instance. So you can see, right, like from a few lines of code, under 10 lines of code, I was able to move this model from my notebook into Connect using Vetiver.

So I've done the deployment to save us time here. Let's see what the deployment looks like. So this is my deployed model as a fast API onto RStudio Connect. So I get access to these docs that the fast API provides. I can see the API URL. I can see what kind of features it's taking. I can make a sample prediction. I can play with it here. As this API is running on RStudio Connect, I get access to all these features, which is what Connect provides. So I can decide who I want to share this API with. I can change the runtime credentials for this API. I can manage it here.

Scheduling model updates

What I've done so far is run the model once on the current state of the database, but the R process is updating this database every day because the number of bikes are changing every day. So that data is also getting refreshed. So my next task is to update my model with the latest data so that it's always available with the right data, the current data.

So for this work, I'm also again using RStudio Connect. RStudio Connect provides you the ability to deploy a Jupyter notebook and schedule it. So I'm going to deploy this Jupyter notebook using the Git-based deployment in RStudio Connect. This is one of the many types of deployment available. I'm going to use Git-based to make sure that my code on my Git repo is consistent with what is available on RStudio Connect.

And to save us time, this is a short talk. I've already deployed this notebook on RStudio Connect, and this is what it looks like, the same notebook that I had in my IDE. And I'm going to use the schedule feature of RStudio Connect, create a schedule to run this notebook every time the data is updated.

Demo: live deployment of the Shiny app

All right. So, so far we have seen the process of developing the model in Workbench and deploy to Connect and let it run as a notebook on a regular basis. For this part, I'm going to show you the app itself, and we're going to make a change in the app and we'll live deploy to our server, Connect server, and we'll see how it works.

So right now, just take a look back at the app again. This time I wanted you to pay attention to the color of these circles. They look kind of blueish. Just keep that in mind. It's going to be relevant in a second.

I'm going to show you something here. So we made this application in Shiny for Python. It's very, very exciting to use this package, and I learned a lot from this. And, in fact, I actually find the documentation and this online examples, it's kind of magical WASM tutorials, really helpful, and if you want to learn more about it, Winston has a talk coming up right after our talk, so definitely stay for that.

What I wanted to highlight here in the code is the interoperability that really saved us a lot of time. As Gagan mentioned, we can actually use pins, the pins package in R, in Python, so this is the Python pins, and I'm reading a pin that's generated by my colleagues Sam and Alex. This pin was generated from a portal document that's running on the schedule, so basically it takes the bike station information, processes it, and it stores it in the pin and it updates it every day. So I don't have to do anything, I just have to read it into a data frame and use the data frame in my application, so it's really handy for me. And at the same time, also use the vetivert endpoint to use the Gagan's API to predict the number of bikes.

So here, the next, I'm going to show you the update. Now, I mentioned the color, right? I don't really like dark blue color, it looks a little depressing to me. Gagan, do you have any suggestions?

Yeah, let's make it orange, one of our new colors.

Okay, so let's change it to orange. All right. Change made. And so what I will do in this case, we're going to deploy to the connect server, and I'm going to run it right now, and we'll explain in a second what it means. So you probably cannot see everything, so I'm just going to explain in general concepts. First, when you deploy to connect, as connect supports a very large list of contents that you can host and you can deploy, the content we chose is shiny because we support it, but if you want to use Streamlit or Plotly Dash, these interactive contents, dashboards, totally support it as well.

On the other hand, the connect server, you need to specify where it is, so I'm going to just go to the app, refresh a little bit, and you also have to pay attention to the virtual environment because in Python it's very important, you want to isolate your environment so that you have sufficient package isolation, and so what connect does at this point is when you send the command, it uses the RSConnect package. This RSConnect Python package allows you to programmatically deploy the content to connect using a command line, and it wraps your content into a bundle first and uploads the bundle that includes the application itself, the requirements.txt, so that connect first will look for the compatible Python version on the server itself and will use the compatible version of Python to run the app and uses the requirements.txt to rebuild the environment, so each deployment is sandboxed, it's robust against future deployments, it does not impact previous deployments, it guarantees to run all the time, so that is really handy for us, and at the end you can see that there's success with the little green links here, and let's go back to the app again, let's zoom in a little bit more, see the orange circles, right, and if you click on one of them, and you're going to see the predictions, so all right, we made a change, and it's live on connect again.

Wrap-up and next steps

So, let's take a look back at what we just did. We started off with our base ETL jobs that's been running and it's been processing on the connect, and we started building Python models using vetiver, pins, all the good things in Workbench, and we deployed to connect so that it keeps running, and we also used the content that's deployed to connect on the API in our Shiny for Python dashboard to serve our purpose to actually inform us with future deployments, future predictions of the number of bikes.

Now, what does it mean to you? If you're already working Python and you have access to the RStudio Team professional products, well, you can just start using your favorite Python tools today. There's really no need to wait. And if you work with Python colleagues, we just showed you a very simple example, but I hope that it can serve as a little starting point for you to collaborate and extend the capabilities across teams between Python and R teams so that all within the RStudio Team what we can offer, and we definitely hope you can take it to the next level than what we did.

I hope that it can serve as a little starting point for you to collaborate and extend the capabilities across teams between Python and R teams so that all within the RStudio Team what we can offer, and we definitely hope you can take it to the next level than what we did.

So what's next for you after this talk? We introduced a lot about our pro products today, so Shufay and I and most of our customer-facing team is at the lounge right outside the room, so come see us, have a chat. All the assets that we built and all the code behind it, we have made it public, and if you have access to our pro products and want to start using Python, there's documentation on enabling Jupyter and VS Code sessions in there, and as Shufay was saying, you can deploy multiple different contents on connect, so the deployment guide is also available. We also use some really exciting Python packages that we introduced as part of the conference, so there's great documentation available for that. And that's our talk. Thank you so much.