RStudio Pro Product Lightning Series Meetup ⚡️

And that can both save cost over time, but also serve as an enabler for your data science teams because individuals can perform analyses regardless of the resource requirements of those analyses.

So just a little bit of an idea of the infrastructure behind it and architecture behind this. And if this isn't important to you, don't worry about it. This is not something that's critical to understand in order to use the platform. But I do find this a little bit interesting that users, so our little happy users over here on the left-hand side, come into RStudio Workbench. There's a central server that's running RStudio Workbench. That server gets set up as part of your SageMaker domain setup process.

This is licensed through AWS License Manager. So if you wanted to run RStudio Workbench in SageMaker, you do need an RStudio Workbench license. And that license is delivered and managed through the AWS License Manager. There's a couple of different ways to do that. I'm not actually going to dive into the details here in this conversation. But just note that you do need a license and it gets delivered through AWS License Manager. And then individual user sessions, so these users come in, they say, hey, I need to run a session of RStudio. And then these individual sessions run in their own separate dedicated compute instances that are running a Docker container that supports the session execution of RStudio Workbench.

So we're going to take a look at what all this means in practice. But in my case, it's helpful to have kind of a mental model of what's happening behind the scenes. So maybe this proves useful. If it doesn't, like I said, this isn't something that's necessary to understand in order to use the platform. Okay. So with that little bit of background, we're going to jump into SageMaker now and just kind of take a look at what this looks like in practice.

This is my SageMaker control panel within AWS. So I have users here, I can see different services and apps that are available. I can see over here on the right hand side, I have RStudio. But for an individual user, so I'm James, I'll come in here and under the James user, I can click launch app. And then I'll see that RStudio is one of the applications that I can launch inside of SageMaker. And if I select this, it'll open up. In fact, just do it here. If I select this, it'll open up the RStudio Workbench homepage, which looks like this.

Again, if you've used RStudio Workbench previously or currently, this page likely looks familiar to you. If you've only used like the open source RStudio desktop, this page is probably unfamiliar because the desktop doesn't have any equivalent page here. And that's just because like the desktop setup is a little bit different. You're not typically running concurrent sessions launched from the same route, which is a common use case in RStudio Workbench. But the idea is I can come in here, you can see that I have a couple of sessions already running. We'll take a look at those in a moment, but typically this would be blank. And the first thing I would do is start a new session.

And once I pull this open, I can name my session, whatever the case is. I can define the editor that I'm going to use. In this case, I'm on SageMaker and the only editor that I have available to me in RStudio is the RStudio editor. And that's the familiar, again, if you use RStudio on any context, desktop, server, whatever the case is, RStudio is that familiar editor. The cluster is SageMaker. Again, I do not have the option to choose anything other than SageMaker here because I'm working inside the SageMaker environment.

And then down below under options, here's where I have two things that I can really adjust. The first thing is the instance type. So this is the EC2 environment that I want to run the session on. So I have a very small kind of default environment that contains just a couple of virtual CPUs and four gigabytes of RAM. I think it's a pretty small instance, but for some workloads that might be totally sufficient. And then I can scale all the way up to some of these. Some of these are GPU instances. If I was trying to do GPU workloads, some of these compute optimized instances have up to 128 CPU cores and hundreds of gigabytes of memory available. And I can choose as the user, what kind of resources I need for my session before I start the session.

So I make the choice for my instance type. The other option that I have is the image that I want to run. There's currently one base image that SageMaker provides, which is this image that I have selected here. It's actually starting to get a bit old. So we're working with the SageMaker team to see if we can get an update here for the default. But there's also the opportunity if you find that this default is insufficient, maybe it doesn't have the right version of R or it doesn't have the right packages installed, or it doesn't have the right system dependencies, whatever the case is, you have the ability to define your own Docker image and make that available within SageMaker.

So you can be really flexible around what type of operating system and environment your session is running in by creating these custom images and then configuring them to be made available to users within RStudio on SageMaker. Once I've gone through this whole process, I would click start session. It actually takes a little bit of time to start a new session inside of SageMaker because of all the behind the scenes work that needs to be done. We have to provision an EC2 environment. We have to copy an image into that environment. We have to run the image as a container. And so it can take a couple of minutes to start a session. So I'm actually just going to use the sessions that I already have started just so that we can avoid waiting around.

But if I look, for example, at this RStudio session and look at the info here, I can see that this is running on, it should tell me right here, this is running on a T3 medium instance, and it gives me some additional details about the environment that this is running on, if I'm interested in it. If I click into the session, this is where we will all likely feel at home, right? This is the RStudio IDE that we have likely gained quite a bit of experience with. Again, whether that experience has come from the desktop or something that's browser-based, like what we're using here, doesn't matter. The user experience is essentially the same. I can write R code. I can write R Markdown documents. I can execute that code in the console. I can view plots and navigate my files, and I can interact with Git, and I can view connections and make connections to databases. Anything that I would expect to be able to do, I can do inside of this environment.

But I wanted to highlight, if I look over here and I look at, let's just do this, if we look at this particular session, we can see that I have two CPU cores available to me, because this is a pretty small instance that I've chosen. And then if we look at the available memory, I have about four gigabytes of memory on this particular server. And the thing that's important to note is that this is like my own little server. It's like I've created my own little tiny environment in the cloud, and it's mine and mine alone. So these resources, even though they might be small, aren't going to be constrained by other users doing the same thing.

They're just mine to consume, which is really nice. What that means is if I start a session, and then my coworker also starts a session, and they do something unexpected, and their session ends up getting stuck, they consume more memory than they have available, or they max out the CPU, or whatever they might do, then that's something that they'll have to work through. They'll have to close down that session and maybe start a new one with more resources. But it doesn't affect me. If somebody else's session goes a little bit awry, it's not going to affect mine, because they're all totally independent. There's no shared underlying infrastructure here, other than the home server, and it's not actually executing anything. It's just farming the execution out to these different environments.

Okay, so here's that example. Now, let's come back to the homepage for a moment, and let's look at this larger session. Again, notice that these are two separate sessions that I'm running as one user from RStudio Workbench inside of SageMaker. And if we remember right, we had two available CPU cores in the previous session, and we had about four gigabytes of RAM. If we look at the same statistics here, let's just run it in real time so that we know that we're getting the right results. In this session, I have 48 cores available, and I have 92 available gigabytes of memory. So quite a vast difference between these two sessions. And this isn't even one of the largest sessions that I could have created.

Right, again, I can go up to over 100 cores of available CPU power and several hundred gigabytes of memory if I need to. And again, this is just my environment. So if I decide as the data scientist or as the statistician that I need a lot of power for my analysis, maybe I'm dealing with really large data or I'm planning on doing something massively in parallel, whatever the case is, I can choose the right instance when I set my session up so that I have support for the work that I'm going to be doing.

The last thing that I want to show in here, I'm going to come back to our other session for a moment, is I can also, because I'm inside of the AWS ecosystem, and more specifically because I'm working inside of SageMaker, if I wanted to do something like kick off an asynchronous training job, given some data that I had stored in S3, I can do all of that by using the SageMaker SDK. So right here, let me expand this a little bit right here. This particular portion of this R Markdown document loads in the reticulate package, which is an R package for interfacing between R and Python. And then there's not really a well-developed R SDK or R package for working with SageMaker, but there is a really well-developed and Amazon-supported Python package for working with SageMaker. So I'm going to use the reticulate R package to bring in this Python SageMaker package, and then I can use that to interact with my session.

And I can do things like create training jobs and create pipelines and execute training jobs and submit batch predictions. And I can orchestrate all of that from within RStudio here on the SageMaker platform. And the thing that's kind of fun is if I, in fact, if I just run this chunk of code here, it'll run through and everything, and then it will print out this role identifier down here that I'll use later on in this script to orchestrate some things.

But if you look, and I apologize for the scrolling back and forth, but if you look, I'm at the very beginning of my script here, I'm at the very beginning of my session, and I haven't actually passed in any credentials, right? I haven't said like, here's my token or here's my key or here's my password. And the reason for that is because I'm already running on Amazon infrastructure. And when SageMaker provisioned the little EC2 server for my session, it supplied that server with an IAM role that matches the role that I have as a user. So whenever I then use that particular server to interact with other services within SageMaker or within AWS, like I'm doing here, it automatically picks up those credentials and just applies the appropriate permissions.

And that's just, that's nothing more than just a convenience factor, meaning that once I get into this environment, if I wanted to query something that was stored in S3 or interact with other SageMaker services or submit queries to Athena or whatever the case is, I can do all those operations without needing to constantly supply some sort of token or identifier to authenticate myself to those other services. My identity just kind of follows me around in this particular case.

Okay, so I'm going to go ahead and kind of wrap things up there and come back to slides here. I don't really have a concluding slide, but I did want to just come back to these points here and highlight once again that one of the distinct advantages of this particular environment is that as an organization, I don't have to maintain a lot of like custom infrastructure to support flexible compute, right? This is like, I set up my SageMaker domain and then I just let SageMaker handle everything else. So I don't need to worry about how to spin up instances when users need them or how to tear them down. SageMaker will take care of all that for me. And it allows my individual data scientists to operate with resources that match the workload that they're doing, whether those resources are small or large, the burden falls to the individual data scientist to make the correct decision about, do I need a lot of stuff? Do I need a ton of CPU and a ton of memory, or can I get away with just a smaller instance?

We recommend, strongly recommend running everything in Kubernetes, including the Connect container image itself.

Another question over on Slido, and this is the first time I've actually seen Posit in the question. Is Posit working on a good way for our Studio Connect admins to easily add, configure, or manage multiple versions, such as multiple Python versions, R versions?

Not sure from what angle to answer this question from. Do you mean like, what, so what is the pain point that you're hoping we address with multiple versions of R and Python? Right now we have this story where you're just layering in, you're adding more and more versions of R and Python. I think we could do a better job of helping you identify when to remove older versions of R and Python if you feel like that's necessary to move your content forward. But there are some server API tools that we make available. You can run reports to determine which run times are being used on your server to get a sense for that, to understand who has what content, how it's being used, whether it's time to pull a version of R and Python off the server if that's important. Again, to some folks it's not. They keep versions of R and Python on their servers for a long time. But I'm interested in hearing a follow-up from this person on like, was that what they expected me to say or did I misinterpret the question entirely?

A anonymous question that came in, and I would love to ask this to everybody on the call too, is what type of content do you hope Connect will be able to host in the future?

Yeah, I'd absolutely love to hear from the audience what you'd like to see Connect be able to host in the future. I think what we'll be focused on in the short term are some feature parity work between R and Python. You've seen us make a lot of investments into the Python space and Python community to help make Python users feel like we, you know, we care about them too. We're not just RStudio with the R stuff. We want to support Python users also. So, we're really interested in some of the core functionality that's available for R users and making that more accessible to Python folks. We'll be doing some content work around that and then integrations is going to be a big focus for us going forward. And I'll just leave it at that like nebulous word.

I can share that in the chat on their website. It says RStudio is becoming composite in October. So, it is coming soon. But I will share a link where you can learn more as well.

Thank you for answering the poll. I see that 11 people answered the survey there. If you didn't and are curious what the heck I'm talking about here, on the meetup questions, Slido, we also put a poll where each of product managers here, Joe, James, and Kelly put a question to the audience too. So, really appreciate your feedback there as well.

But thank you all so much for sharing these updates with us and sharing what's new. And thank you all listening in for all the great questions as well. Hope this was helpful. I'm curious to hear your feedback. And if you'd like to see more of these as well. I really enjoyed it. Thank you so much, Joe, James, and Kelly.

Yes, thanks for having us. Thank you, everyone. Thanks, Rachel. Thanks, everybody. I'm just going to do one more plug to say if you are curious and joining future events, you can also use the short link here. We do have meetups every Tuesday at noon in a data science hangout every Thursday at noon. And love getting to meet all of you at these events too. So, love to see you at future ones as well. Have a great rest of the day, everybody. Bye.

RStudio Pro Product Lightning Series Meetup ⚡️

Transcript#

RStudio Package Manager: sharing internal packages

Q&A: Package Manager

RStudio Workbench on Amazon SageMaker

Q&A: RStudio Workbench on SageMaker

Content execution in Kubernetes with RStudio Connect

Q&A: Connect and Kubernetes

Featured software#

cli

rstudio