RStudio Pro Product Lightning Series Meetup โก๏ธ
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi friends, happy Tuesday. Welcome back to the RStudio Enterprise Community Meetup. Hope everyone's having a great start to the week. If you just joined now, feel free to say hi through the chat window and maybe where you're calling in from. Today we are doing something a little different and have three of our product managers here with us today for a lightning series to hear what's new, ask questions, and provide feedback.
If you like this format, please let us know. I'm hoping the recording will be a helpful resource for others as well. So we'll have three lightning talks for you today. One, sharing internal packages with RStudio Package Manager. The second, running RStudio workloads in the cloud with Amazon SageMaker. And third, content execution in Kubernetes with RStudio Connect.
A special welcome if this is your first time joining us today. This is a friendly and open meetup environment for teams to share use cases, teach lessons learned, and just meet each other and ask questions. They happen every Tuesday at noon Eastern time. I'll share a link in the chat where you can also find out about upcoming meetups too. But it's just rstd.io slash community dash events. Together, we're all dedicated to making this space inclusive and an open environment for everybody, no matter your experience or industry background.
So during the event, you are able to also ask anonymous questions if you wish through a short link, which I'll share here on the screen in a second as well. But we'll try to answer as many questions as possible and want to hear lots of questions from you all. Each of our product managers joining us here today provided a question as well, so that we can learn more from you too. And so that will be at the same Slido link. There's an anonymous poll there as well. But with that said, I would love to bring my colleagues Joe, James and Kelly on stage with me here as well to introduce themselves.
James, would you want to get started in introducing yourself?
My name is James Blair. And I work at rstd.io as a product manager for cloud integration, and happy to be with you all today.
Awesome. Kelly, you want to go next?
Sure. I'm Kelly. I'm the product manager for RStudio Connect.
Yeah, I'm Joe Robertson. I am the product manager for the RStudio Package Manager.
RStudio Package Manager: sharing internal packages
All right. Well, yeah. Hi, everyone. So I just wanted to do a quick lightning talk about the sharing internal packages with RStudio Package Manager. And I apologize if I have technical difficulties with this.
So for those who aren't aware what Package Manager is, it's one of our tools for centralizing the governance of your data science packages. So really what you can do, everything from mirroring your public repositories internally. So whether you are looking at using a CRAN or bioconductor or PyPI public repository, you want to bring that inside your network and host it from a server locally in your network.
Mirror that there. You can also curate custom subsets of those public repositories to make kind of a new meta subset repository. One of our primary features there is also being able to time travel to older versions of packages. We keep a snapshot of every day of all of these major big repositories. And so we can actually, if you need to, for reproducibility or any other reason, roll back to a specific date of CRAN, for example, you can just easily roll back to that and let's reproduce my project and my code against the version of the packages that were in place when I developed those.
And then most importantly, for today anyway, what we're going to talk about is how you can share internally developed packages using Package Manager. So there are, the concept of revolves around what we call local package sources, which allow you to take a, build your packages, however you normally would do that, normally using like, you know, R directly to build or any of the tools in there, and then add those to your local source in Package Manager.
That makes them stored in there and visible as a separate repository, or can also be joined to a CRAN mirror source to make one unified repository that contains both your public and private packages, which makes it easy because then all of your users can access both CRAN packages and your local packages from a single repository URL in R. And similarly for Python, for PyPI.
And so being able to do that, and then also supporting the more advantages than just say some other alternative for local sources include support that we now have that we just recently added for custom pre-built binary packages. For those who are new to the world of binary packages, most packages that you receive from, especially on Linux, for example, on CRAN are source packages, which means that in order to install the package on your local R installation there, you actually have to download the package and actually build the package when you install it.
CRAN provides binaries, pre-built binary packages for both Windows and Mac OS, which allow you to just download, install it without having to worry about building it there. We extend that support to also Linux, as well as providing our own, through Package Manager, providing, we build all of the CRAN packages, for example, on Linux, various, about eight different distributions, so that your users can, no matter what distribution they're on, what platform they're on, they can install packages easily. And then we also support multiple R versions of packages, so if you need different versions of your package for, and you have some users on older versions of R, you can support both of those simultaneously, and as I said, multiple platforms.
So one way we make this easier than just the manual method of taking, you know, building your own packages and uploading them to Package Manager is through Git Builders, which we actually include as a feature within Package Manager, and we include simple package builders that can, that you can point to a, if your packages are stored in a Git project, monitor that Git project for any changes on, whether it's a change in the tag, or a change in the, or just any push to the Git repository, and then whenever that change happens, we will actually build that, rebuild that package from the new source, and publish that package straight into Package Manager, making that available to whoever you have using that local source repository in, from Package Manager.
It makes it simple to get out of the box, out-of-the-box functionality that can just make it easy to integrate with, you know, when you have users who are publishing updates to Git and want to be able to bring those in to make them available. One limitation of our Git Builders is they are best for simple, quick builds. They only handle source packages, so like I just was mentioning with the binary packages, you have to do another, yet another more manual step on that.
However, we've just released some new features to allow better integrations with your own build pipelines that give you a far more customizable feature for building packages. And so we both, so what we've released are some tools to actually integrate remotely, whether you're, where your build pipeline, you know, we take build pipelines. Most people either have these in, you know, there's several most common ones out there. GitHub has their own GitHub workflows and GitHub actions that allow you to build, have GitHub build them automatically in the background and then publish those. And then also, you know, there's common build infrastructures, commonly, you know, continuous integration, continuous delivery pipelines like Jenkins that can be brought up as well.
And all of these, to say, you can then just integrate, plug remote, our remote publishing API directly into your build pipeline and then build whatever you want, source binary, add whatever package build and validation features, whether you want to add checks and validation to your build, dependency scanning, any sort of automated testing you want to have in that. Do all, basically do the world of anything you would ever want to do to make sure that you create the best packages and then automatically at the end of that process, publish those to an updated version of your packages to package manager.
And so, basically, it allows you to take advantage of the full functionality that's available in these build pipelines, including things like custom build triggers, where, as I mentioned, our Git builders can trigger based on, say, updates to a Git repository. The build triggers in a lot of these build pipelines include things like if you just want to have a nightly build, for example, that every night I want this to build and publish a new version. And also much more, a wide range of reporting and notifications so that you can actually send people notifications, hey, there's a new version of this package that's been built and published, and here's the changes and things like that. Everything that you would want in a full build pipeline.
So, all that to say, why does it matter? Why should you want to care about this? I mean, most of you probably know that developing packages is probably the best way to share your reusable code with others. And so, many of you are probably consumers of packages, but also develop your own things. It would be great if I could share this with others on my team, even outside my team, if you wanted to. And, you know, all of the main both R and Python have great ways and built great ecosystems around packaging up your code into a reusable format that you can give to others.
Package Manager helps you easily make those packages available within your organization, and as I pointed out, in a really unified repository way, which makes it easy for those that are actually using your packages to access them and browse them and find them easily. Which is why, for example, you can actually, Package Manager also has a user interface that you can view from your web browser, so you can actually curate a repository of these packages and make it very easy, including documentation, explain, you know, descriptions of why these packages might be useful for people who are trying to find them, and really curate your own CRAN or PyPI within that.
And so, in summary, you know, local sources are currently available for R packages. We are working on Python package support that will be coming very soon before the end of the year, which is very exciting, so we can bring that across the entire ecosystem of because our goal in the upcoming few months and year is to provide the same level of support for Python packages that we currently do for R packages. And definitely feel free to check out packagemanager.rstudio.com, which is our public package manager instance, to explore. We have public mirrors of CRAN, Bioconductor, and PyPI that you can easily configure with your R or Python and see some of the benefits of there.
Automatically, free on our public site includes taking advantage of our snapshots and time traveling to earlier versions. And if you have any more questions, feel free to reach out to RStudio, and we're happy to give you more information and show you more of the features of package manager. Thank you, everybody, and thanks for the questions.
Q&A: Package Manager
I see Laura asked a great question on Slido. So Laura said, keep on local source first, add to CRAN mirror source. Is there a major headache, annoyance, or difference on user or package maintainer side for internal packages?
Well, for internal packages, I hope I understand the question here. I guess whether you want to keep it on a separate local source that you just keep it on a separate local source that you just curate yourself versus doing it the unified, adding it, merging it for a unified source with a CRAN mirror. From a user perspective, it definitely is easier if you have one unified URL, because you only have to give your user and have them configure one repository URL in their R client or RStudio, wherever they're using R.
Though there can be, and even with a unified, the one advantage can be with keeping it separate is you actually have a separate repository just of your local packages, which can make it a little easier for discoverability with that there too. So they can just see these are all of our local packages versus a unified repository that has everything. But then they also have to configure multiple repository URLs to have both the local package source and the CRAN package source. But that's not, it's kind of a one-time deal sometimes that you just have to do once, and it can make it easier to keep things separate.
The question is, how does the external publishing feature work? Is the CI pipeline tool pushing it via an API? Is that feature documented already? Yes, it is. And so we have released basically a remote version of our, the command line interface that we use on the server for interacting with administering package manager. And so now we have a downloadable remote API that you can use, and it's secured via API tokens that you generate. So a server administrator will generate an API token that they then give to whoever wants to use this publishing API, and then they can actually use that remotely, hook that up there, and just use the same commands that you would use on the server itself on the remote CLI to publish that to package manager. And that is documented in our documentation, and I'll find a link and post it into the chat here.
RStudio Workbench on Amazon SageMaker
Well, I'm excited to be here. So like Rachel said, we're going to talk briefly about running RStudio workloads in Amazon SageMaker. And just as like a really quick introduction, we'll be focused on RStudio Workbench today. If you've never used RStudio Workbench before, the easiest way to think about it is it's a version of RStudio essentially that's accessible through a browser that's running on infrastructure elsewhere. That could be infrastructure within your own organization. It could be infrastructure in the cloud, like Amazon SageMaker or something else. But it's typically something that you would access through your browser.
And the advantage, one of the distinct advantages of RStudio Workbench is that it gives you access to a much larger collection of compute resources than you would have just on your own little like local desktop or laptop installation. And we'll highlight that when we talk about the integration with SageMaker today. The other piece of this is obviously Amazon SageMaker. I'm not going to go terribly in depth into the SageMaker platform as a whole. Again, we'll be focused on the RStudio Workbench integration, but Amazon SageMaker is an entire collection of different tools and resources on the Amazon Web Services platform that's centered around the task of machine learning and data science.
So there's tools for model training and monitoring and deployment and prediction and batch jobs and all kinds of different things. And then at the core of all that, there are a couple of different development environments that you can use to orchestrate some of this work. And RStudio Workbench is one of those environments that can be made available. So with that little bit of an introduction, I want to give just a little bit of background about RStudio Workbench on SageMaker. It's been around for about a year. So it launched in November of 2021. We've seen really good adoption with it. We've seen a lot of customers happy with the engagement, happy with integration, and it's worked really well for them.
The thing that's most interesting or I think is most interesting about this integration is the fact that Amazon worked with us and we worked with them to build this custom implementation of what we call the launcher, which is just a component of sorts that's involved in RStudio Workbench that allows external architectures to run sessions and workloads. Natively, Job Launcher supports Kubernetes and Slurm, which is something that we've seen a lot of organizations adopt. But then SageMaker implemented their own unique implementation of this that's backed by EC2 as an environment.
The other core things that I think really highlight the strengths of RStudio Workbench on SageMaker are the fact that it's low-maintenance infrastructure. So you set up your SageMaker environment, but then you don't need to set up additional components to support the workloads that are being run. SageMaker actually takes care of all of that, which is just a nice convenience factor. And the other component here is that you have the ability to create these quote-unquote right-size sessions. So users can go in and say, hey, I need a lot of CPU and a lot of memory because I'm doing something really big. Or another user might go and say, look, I'm not doing a huge scale analysis. I don't need a bunch of resources. And so they can be more selective about the resources that they need. And that can both save cost over time, but also serve as an enabler for your data science teams because individuals can perform analyses regardless of the resource requirements of those analyses.
And that can both save cost over time, but also serve as an enabler for your data science teams because individuals can perform analyses regardless of the resource requirements of those analyses.
So just a little bit of an idea of the infrastructure behind it and architecture behind this. And if this isn't important to you, don't worry about it. This is not something that's critical to understand in order to use the platform. But I do find this a little bit interesting that users, so our little happy users over here on the left-hand side, come into RStudio Workbench. There's a central server that's running RStudio Workbench. That server gets set up as part of your SageMaker domain setup process.
This is licensed through AWS License Manager. So if you wanted to run RStudio Workbench in SageMaker, you do need an RStudio Workbench license. And that license is delivered and managed through the AWS License Manager. There's a couple of different ways to do that. I'm not actually going to dive into the details here in this conversation. But just note that you do need a license and it gets delivered through AWS License Manager. And then individual user sessions, so these users come in, they say, hey, I need to run a session of RStudio. And then these individual sessions run in their own separate dedicated compute instances that are running a Docker container that supports the session execution of RStudio Workbench.
So we're going to take a look at what all this means in practice. But in my case, it's helpful to have kind of a mental model of what's happening behind the scenes. So maybe this proves useful. If it doesn't, like I said, this isn't something that's necessary to understand in order to use the platform. Okay. So with that little bit of background, we're going to jump into SageMaker now and just kind of take a look at what this looks like in practice.
This is my SageMaker control panel within AWS. So I have users here, I can see different services and apps that are available. I can see over here on the right hand side, I have RStudio. But for an individual user, so I'm James, I'll come in here and under the James user, I can click launch app. And then I'll see that RStudio is one of the applications that I can launch inside of SageMaker. And if I select this, it'll open up. In fact, just do it here. If I select this, it'll open up the RStudio Workbench homepage, which looks like this.
Again, if you've used RStudio Workbench previously or currently, this page likely looks familiar to you. If you've only used like the open source RStudio desktop, this page is probably unfamiliar because the desktop doesn't have any equivalent page here. And that's just because like the desktop setup is a little bit different. You're not typically running concurrent sessions launched from the same route, which is a common use case in RStudio Workbench. But the idea is I can come in here, you can see that I have a couple of sessions already running. We'll take a look at those in a moment, but typically this would be blank. And the first thing I would do is start a new session.
And once I pull this open, I can name my session, whatever the case is. I can define the editor that I'm going to use. In this case, I'm on SageMaker and the only editor that I have available to me in RStudio is the RStudio editor. And that's the familiar, again, if you use RStudio on any context, desktop, server, whatever the case is, RStudio is that familiar editor. The cluster is SageMaker. Again, I do not have the option to choose anything other than SageMaker here because I'm working inside the SageMaker environment.
And then down below under options, here's where I have two things that I can really adjust. The first thing is the instance type. So this is the EC2 environment that I want to run the session on. So I have a very small kind of default environment that contains just a couple of virtual CPUs and four gigabytes of RAM. I think it's a pretty small instance, but for some workloads that might be totally sufficient. And then I can scale all the way up to some of these. Some of these are GPU instances. If I was trying to do GPU workloads, some of these compute optimized instances have up to 128 CPU cores and hundreds of gigabytes of memory available. And I can choose as the user, what kind of resources I need for my session before I start the session.
So I make the choice for my instance type. The other option that I have is the image that I want to run. There's currently one base image that SageMaker provides, which is this image that I have selected here. It's actually starting to get a bit old. So we're working with the SageMaker team to see if we can get an update here for the default. But there's also the opportunity if you find that this default is insufficient, maybe it doesn't have the right version of R or it doesn't have the right packages installed, or it doesn't have the right system dependencies, whatever the case is, you have the ability to define your own Docker image and make that available within SageMaker.
So you can be really flexible around what type of operating system and environment your session is running in by creating these custom images and then configuring them to be made available to users within RStudio on SageMaker. Once I've gone through this whole process, I would click start session. It actually takes a little bit of time to start a new session inside of SageMaker because of all the behind the scenes work that needs to be done. We have to provision an EC2 environment. We have to copy an image into that environment. We have to run the image as a container. And so it can take a couple of minutes to start a session. So I'm actually just going to use the sessions that I already have started just so that we can avoid waiting around.
But if I look, for example, at this RStudio session and look at the info here, I can see that this is running on, it should tell me right here, this is running on a T3 medium instance, and it gives me some additional details about the environment that this is running on, if I'm interested in it. If I click into the session, this is where we will all likely feel at home, right? This is the RStudio IDE that we have likely gained quite a bit of experience with. Again, whether that experience has come from the desktop or something that's browser-based, like what we're using here, doesn't matter. The user experience is essentially the same. I can write R code. I can write R Markdown documents. I can execute that code in the console. I can view plots and navigate my files, and I can interact with Git, and I can view connections and make connections to databases. Anything that I would expect to be able to do, I can do inside of this environment.
But I wanted to highlight, if I look over here and I look at, let's just do this, if we look at this particular session, we can see that I have two CPU cores available to me, because this is a pretty small instance that I've chosen. And then if we look at the available memory, I have about four gigabytes of memory on this particular server. And the thing that's important to note is that this is like my own little server. It's like I've created my own little tiny environment in the cloud, and it's mine and mine alone. So these resources, even though they might be small, aren't going to be constrained by other users doing the same thing.
They're just mine to consume, which is really nice. What that means is if I start a session, and then my coworker also starts a session, and they do something unexpected, and their session ends up getting stuck, they consume more memory than they have available, or they max out the CPU, or whatever they might do, then that's something that they'll have to work through. They'll have to close down that session and maybe start a new one with more resources. But it doesn't affect me. If somebody else's session goes a little bit awry, it's not going to affect mine, because they're all totally independent. There's no shared underlying infrastructure here, other than the home server, and it's not actually executing anything. It's just farming the execution out to these different environments.
Okay, so here's that example. Now, let's come back to the homepage for a moment, and let's look at this larger session. Again, notice that these are two separate sessions that I'm running as one user from RStudio Workbench inside of SageMaker. And if we remember right, we had two available CPU cores in the previous session, and we had about four gigabytes of RAM. If we look at the same statistics here, let's just run it in real time so that we know that we're getting the right results. In this session, I have 48 cores available, and I have 92 available gigabytes of memory. So quite a vast difference between these two sessions. And this isn't even one of the largest sessions that I could have created.
Right, again, I can go up to over 100 cores of available CPU power and several hundred gigabytes of memory if I need to. And again, this is just my environment. So if I decide as the data scientist or as the statistician that I need a lot of power for my analysis, maybe I'm dealing with really large data or I'm planning on doing something massively in parallel, whatever the case is, I can choose the right instance when I set my session up so that I have support for the work that I'm going to be doing.
The last thing that I want to show in here, I'm going to come back to our other session for a moment, is I can also, because I'm inside of the AWS ecosystem, and more specifically because I'm working inside of SageMaker, if I wanted to do something like kick off an asynchronous training job, given some data that I had stored in S3, I can do all of that by using the SageMaker SDK. So right here, let me expand this a little bit right here. This particular portion of this R Markdown document loads in the reticulate package, which is an R package for interfacing between R and Python. And then there's not really a well-developed R SDK or R package for working with SageMaker, but there is a really well-developed and Amazon-supported Python package for working with SageMaker. So I'm going to use the reticulate R package to bring in this Python SageMaker package, and then I can use that to interact with my session.
And I can do things like create training jobs and create pipelines and execute training jobs and submit batch predictions. And I can orchestrate all of that from within RStudio here on the SageMaker platform. And the thing that's kind of fun is if I, in fact, if I just run this chunk of code here, it'll run through and everything, and then it will print out this role identifier down here that I'll use later on in this script to orchestrate some things.
But if you look, and I apologize for the scrolling back and forth, but if you look, I'm at the very beginning of my script here, I'm at the very beginning of my session, and I haven't actually passed in any credentials, right? I haven't said like, here's my token or here's my key or here's my password. And the reason for that is because I'm already running on Amazon infrastructure. And when SageMaker provisioned the little EC2 server for my session, it supplied that server with an IAM role that matches the role that I have as a user. So whenever I then use that particular server to interact with other services within SageMaker or within AWS, like I'm doing here, it automatically picks up those credentials and just applies the appropriate permissions.
And that's just, that's nothing more than just a convenience factor, meaning that once I get into this environment, if I wanted to query something that was stored in S3 or interact with other SageMaker services or submit queries to Athena or whatever the case is, I can do all those operations without needing to constantly supply some sort of token or identifier to authenticate myself to those other services. My identity just kind of follows me around in this particular case.
Okay, so I'm going to go ahead and kind of wrap things up there and come back to slides here. I don't really have a concluding slide, but I did want to just come back to these points here and highlight once again that one of the distinct advantages of this particular environment is that as an organization, I don't have to maintain a lot of like custom infrastructure to support flexible compute, right? This is like, I set up my SageMaker domain and then I just let SageMaker handle everything else. So I don't need to worry about how to spin up instances when users need them or how to tear them down. SageMaker will take care of all that for me. And it allows my individual data scientists to operate with resources that match the workload that they're doing, whether those resources are small or large, the burden falls to the individual data scientist to make the correct decision about, do I need a lot of stuff? Do I need a ton of CPU and a ton of memory, or can I get away with just a smaller instance?
Q&A: RStudio Workbench on SageMaker
Yeah, there's not a good spot for those. So it's kind of this interesting intersection of two things, right? There's the RStudio Workbench component that goes through its regular release cycle. And so we provide extensive release notes with every release to RStudio Workbench. But then SageMaker updates kind of on their own cycle, right? SageMaker, for example, right now is several versions behind on Workbench, and they're working on rolling out a more recent release as we speak. So there's those two components. There is a section, and I'll find it here in a moment and pass over to Rachel so it can be surfaced in the chat and in other places here. But there's a piece of documentation that Amazon maintains that highlights what's different or unique about RStudio Workbench and SageMaker.
And so the real answer to this question is kind of a combination of two things. One is checking those differences to see kind of what Amazon's reporting in terms of like what things they've not necessarily changed, but there's certain features that they've disabled. And there's certain things that are a little bit different about how RStudio Workbench operates in SageMaker. And then the other piece is identifying what version of Workbench is being run in SageMaker and then checking the release notes that we put out that correspond to that version. And that will let you know kind of what's changed with that particular version.
I am aware that Amazon doesn't do a great job of making it really transparent in terms of what version of Workbench they're running. As far as I know, the only way to find out is to go into Workbench on SageMaker and check the version from there, which isn't ideal. And so I brought this up with them and they have plans, or at least they've shared with me that there's some plans to provide that version number in their documentation. So it's a little bit easier to do that exercise of comparing back to the release notes. So basically the short answer here is there's not a great single source for that. There's a couple of different places I would look to kind of piece that story together.
Yeah, excellent question. So SageMaker itself as a platform doesn't have any sort of auto shutoff features. And the same is true on the SageMaker Studio side. So SageMaker Studio is a Jupyter-based platform that SageMaker rolled out several years ago when the SageMaker platform first came onto the scene. And that's where a lot of SageMaker development has happened historically. And then now there's this new RStudio component that you could also use. In both cases, these kind of ephemeral EC2 instances can be created pretty easily by end users to do different tasks and analyses. But there's no sort of like auto termination functionality right now. It's certainly something I brought up with the Amazon team and something they're aware of. But for now, there's no sort of auto termination. So it really falls back on the end user to clean up after themselves by shutting down their session. So from RStudio, if you just like close out of the session, and then from the homepage, you can choose to like quit the session out completely. And once that action has happened, the EC2 instance is terminated. And so you don't continue to incur cost at that point. It's just a matter of training users and helping them understand that usage pattern.
Nick asked, how is persistent data handled for these kinds of deployments?
Awesome. I'm so glad this question was asked because I meant to cover this and realize I totally didn't. So Amazon, I think, has done a really, really awesome job in this particular department. The whole integration, I think, is great. But this particularly like shows a lot of thought. And what happens is when you create a SageMaker domain, there's a persistent EFS mount that gets created as part of that domain. And that mount stores user home directories. And this is all behind the scenes stuff. It's not anything you need to specify or create yourself. It's just part of the process of what happens.
And so as an individual user in RStudio, what this means is if I go in and clone a repository or download a package and install it or whatever the case is, unless I'm doing something atypical, all that stuff's going to end up in my home directory. And that home directory is automatically persistent. So if I start a session and I install a package and then I close the session, and two weeks later I start a new session, that package is still going to be there. And work that I was working on in my home directory or a subsequent subdirectory is also going to be there. That all just follows me around and I don't have to do anything else to make that happen. It just is part of the experience that Amazon provides.
Content execution in Kubernetes with RStudio Connect
Yeah. Thanks. All right. Thank you for inviting me. When Rachel asked what's new with Connect, what can we talk about for Meetup this week? I am in the fortunate position of having a lot of options because we've released a bunch of cool things in Connect this year. We tend to release monthly. So if you want to follow along with what has landed in the product on a month-by-month basis, you can do that by looking at our release notes, which are kept in the product documentation page.
But I wanted to highlight and call special attention to this body of work on office content execution for Connect on Kubernetes because it represents an investment project that we've been working on literally since I became a product manager here. So going on two years. And we hit a huge milestone this summer when we announced our public beta release at the RStudio conference. And I wanted to go through what that means for us and talk about how we're moving towards general availability for this feature set as well.
So that'll be what I cover today. I'll back up, though, a little bit to start and talk about what Connect is, if you haven't heard of Connect. It's our platform for publishing all of the things that you build in R and Python, your data products, to a server that can be used to execute your content and share it with other people inside of your organization. So you can create things like Shiny apps, interactive Python applications, APIs, documents. You can run those things on a schedule. You can share them with people. And Connect helps you do all of those things in a way that you can manage as a data scientist without having to worry about standing up the necessary infrastructure yourself once you get over the hurdle of putting Connect online itself.
But today, if you are using RStudio Connect, fundamentally, even though you can run multiple versions of R and multiple versions of Python for your different content, you can have multiple package sets that support your different content items. Those content items are sandboxed apart from one another so that my Shiny application doesn't mess with Joe's Python app. All of those things, fundamentally, are running using local processes on the same host or container as the Connect server itself. And that's what we're calling attention to on this slide because it highlights our next iteration of what comes next for Connect.
So what we've enabled now with the RStudio launcher product and packaging that inside of Connect itself is a method of off-host content execution specifically in Kubernetes environments. So using this paradigm, Kubernetes is used to execute content and jobs are running external to the container that's running Connect itself. So we've added this layer of separation.
Why you might want to do that is to make use of native execution paradigms. You might want that additional layer of separation between your content items. You might want to run different underlying container images or operating systems, even, to support different content items. It could make migrating from operating system to operating system easier in the future. A lot of great benefits come from adopting this kind of execution paradigm. So it can be very powerful both from a user perspective as a data scientist and also from an admin perspective. If you or your organization has decided to adopt Kubernetes, we now support that type of environment execution in a much more native way.
So if you are of the admin perspective, these are kind of the nitty-gritty. The Connect requirements for running in-off-host execution are a modern version of Connect itself. It should be running, hopefully, like the latest version of Connect, which our latest release is our September edition. But at the very least, use our July edition, which is when we announced the public data. You need a valid license, a TLS certificate, and you need to use a Postgres database, and you need some sort of file storage, either NFS or EFS, to set this up.
You'll also need, obviously, a working Kubernetes cluster, the kubectl Kubernetes command line tool, and Helm, which is this package manager for Kubernetes. If you want to see a wild speedrun installation that I attempted at the RStudio conference, you can watch the RStudio Conf 2022 talk, where I go through all of the documentation for running through a Helm chart installation on EKS from start to finish. That was all captured. I probably need to rerecord that because I had the most COVID while I was trying to give that talk remotely. So, we'll get a cleaner version of that recorded and put up on YouTube pretty soon here, but shows you just in a very short amount of time how quickly you can get up and running with Connect on Kubernetes.
And then I did want to cover what public beta means. So, this is an open invitation to try it out, provide feedback to us. Support as you interact with it for regular Connect today is fully available, but we are saying because we do not have the full feature set that we would like to enable available yet, use and production isn't recommended at this time. But we'd love to talk to you if you're interested, if your IT admins are interested in having this conversation about moving your Connect instance to Kubernetes. We'd love to have that architecture review and intro call with you. We're currently doing some more hands-on guided installations, just so that our audience can get a feel for installations, just so that our internal teams can get more training on this new paradigm for Connect as well.
So, if you are interested in kicking the tires on the beta, now's a great time because you get even more attention from our solutions engineering and support teams as we're all working together to learn about how this will all work as we move towards general availability. When will it be generally available is a question that I've been getting a lot and we don't have a date for that yet. We'd love for you to sign up to get product information emails and that'll be the first place that you'll likely hear about it when the project goes GA.
There are a couple of things that we want to get done before going GA, content level resource requests and limits, content level service account authorization, an image management API, and some additional admin tool. But once we get all of those things landed, GA will follow shortly after that.
The last thing I wanted to make sure I talk about is that this does not fundamentally change the user experience for you as a publisher. So, just because we're saying that like now we have the ability to run content on Kubernetes in a more native off-host execution paradigm does not mean that you as a data scientist, if you're watching today, need to go out and teach yourself a bunch about what Kubernetes is and how to set it up yourself. Those are things that you should rely on your IT team for. And for you, the user experience largely is unchanged.
So, if you're used to interacting with RStudio Connect, this is Connect running on Kubernetes. If you go to the documentation page, you can scroll down and you'll see that what we have here are you used to see your available versions of R, Python, and Quarto. You'll now see a list of available execution images. These are the images that your admin group will make available for you to use alongside your Connect instance.
So, what would happen here if I publish just using my typical normal publishing pattern, either push button or get back deployment, Connect will use its version matching algorithm to select an image on my behalf. So, matching the appropriate version of R, Python, picking one out of this list, if there are several available, and associating that with my piece of content.
If I wanted to target a specific image, say, this image for R344 rather than this one, I could do that by using an image flag either in the RSConnect deployment function or by specifying it in my manifest file directly when I go to write manifest, either using the RSConnect tool inside of RStudio with R or the RSConnect Python package with Python.
So, really, not a whole lot has changed here. You'll see over the next several months if you start using off host execution, some user interface changes as we allow more content specific features to be specified inside the UI itself, inside the dashboard, or pre-published. But other than that, today, things are largely unchanged and will grow and add features only iteratively from here. So, that's all I had to cover. Just a quick overview. Again, there's a longer form version of this talk available on the conference website here if you want to see the full speedrun installation. That's available, too. And that's all I have.
Q&A: Connect and Kubernetes
But one of the questions is shown on the screen here. Can the Kubernetes implementation use Fairgate or EKS? I don't know about Fargate, but definitely EKS, yes. EKS is what I did the speedrun installation on.
Yeah. I mean, I think that is a question that is up to you and your organization. Do you have the resources to run a Connect server yourself? This is a self-managed product, so you need to have someone at the IT admin level with Linux administration capabilities to run a server product on the behalf of a data science team. And if you do have that available, Connect offers a lot more content types, a lot more power with scheduling, and a lot more visibility into how you're using the product, who is accessing your content items. You have a server API that you control, publishing all the way to admin functionality and auditing. So you get a lot more with Connect, but Shiny Apps IO is amazingly powerful if you don't have that available to you.
You still get some basic auth. You have to be okay with how you're managing your data. If you have data privacy issues, Shiny Apps IO might just be off the table entirely for you to start with. But if you don't have those issues, if you don't have an IT team, if you have a smaller use case for a shorter amount of time, Shiny Apps IO is incredibly powerful and I recommend it to anybody. You get a couple of apps for free on Shiny Apps IO, so it's a great option for undergraduate or graduate students or non-profits who are just getting started with data products on data science teams. I think it's amazing.
One is, what is the recommended architecture to run Connect with Kubernetes? Connect on a VM and content execution on Kubernetes versus running everything else or running everything on Kubernetes.
We recommend, strongly recommend running everything in Kubernetes, including the Connect container image itself. So the pattern that we had originally launched for Workbench, where you have Workbench outside of Kubernetes and you're launching sessions into Kubernetes, that is not holding for Connect. We have done the implementation of Launcher a little bit differently for Connect and we prefer if you use the Helm charts to run the installation process from start to finish. So that sets up everything inside of Kubernetes for you. The Helm charts are amazing and really easy to use. And so if you have the ability to keep everything inside of Kubernetes, I strongly recommend that you do that.
We recommend, strongly recommend running everything in Kubernetes, including the Connect container image itself.
Another question over on Slido, and this is the first time I've actually seen Posit in the question. Is Posit working on a good way for our Studio Connect admins to easily add, configure, or manage multiple versions, such as multiple Python versions, R versions?
Not sure from what angle to answer this question from. Do you mean like, what, so what is the pain point that you're hoping we address with multiple versions of R and Python? Right now we have this story where you're just layering in, you're adding more and more versions of R and Python. I think we could do a better job of helping you identify when to remove older versions of R and Python if you feel like that's necessary to move your content forward. But there are some server API tools that we make available. You can run reports to determine which run times are being used on your server to get a sense for that, to understand who has what content, how it's being used, whether it's time to pull a version of R and Python off the server if that's important. Again, to some folks it's not. They keep versions of R and Python on their servers for a long time. But I'm interested in hearing a follow-up from this person on like, was that what they expected me to say or did I misinterpret the question entirely?
Yeah, I'd absolutely love to hear from the audience what you'd like to see Connect be able to host in the future. I think what we'll be focused on in the short term are some feature parity work between R and Python. You've seen us make a lot of investments into the Python space and Python community to help make Python users feel like we, you know, we care about them too. We're not just RStudio with the R stuff. We want to support Python users also. So, we're really interested in some of the core functionality that's available for R users and making that more accessible to Python folks. We'll be doing some content work around that and then integrations is going to be a big focus for us going forward. And I'll just leave it at that like nebulous word.
Thank you for answering the poll. I see that 11 people answered the survey there. If you didn't and are curious what the heck I'm talking about here, on the meetup questions, Slido, we also put a poll where each of product managers here, Joe, James, and Kelly put a question to the audience too. So, really appreciate your feedback there as well.
But thank you all so much for sharing these updates with us and sharing what's new. And thank you all listening in for all the great questions as well. Hope this was helpful. I'm curious to hear your feedback. And if you'd like to see more of these as well. I really enjoyed it. Thank you so much, Joe, James, and Kelly.
Yes, thanks for having us. Thank you, everyone. Thanks, Rachel. Thanks, everybody. I'm just going to do one more plug to say if you are curious and joining future events, you can also use the short link here. We do have meetups every Tuesday at noon in a data science hangout every Thursday at noon. And love getting to meet all of you at these events too. So, love to see you at future ones as well. Have a great rest of the day, everybody. Bye.
