Resources

David Maguire | Robust R Deployments: Building a Pipeline from RStudio to Production | Posit (2022)

video
Oct 24, 2022
14:17

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hello, everyone. My name is David Maguire. I am a data scientist at DV01. We are a fintech startup, and we provide analytics for investors in fixed income.

So something that I've heard a few times over my career is that R is bad for production. Sometimes engineers say, oh, this looks great, but let's put it in another language for production. They say R is ad hoc, it's for analysis, there's no package management. But today, I want to dispel some of those notions, and I want to show you that R is great for production.

The Tapecracker application

So first I'd like to talk a little bit about what I work on. So I work on Tapecracker, which is a web application that has a few components. There's a JavaScript frontend, there's a Scala backend, and then we have a machine learning microservice in R, which provides predictions that are used by the two other components of the application.

And this application is used live by customers. They can get on at any time. They expect their results immediately. So there's very high requirements for availability and dependability. Our sales team demos this app live to clients, to prospective clients, so it has to work every time. If there's a failure, we could lose a prospective client. So this machine learning microservice has to be very robustly engineered to satisfy what our customers expect of us.

What is an R microservice?

So I'm going to tell you how can you raise an R microservice to thrive in the wild. So here we have our R microservice, this cute little giraffe. So the first thing we're going to do is we're going to put our R microservice through school. We're going to teach it what to do once we release it out into the wild. The end result of that is that once we release it into the wild, it can interact with other microservices, other applications written in different languages seamlessly and just have a fun time.

So let's talk a little bit more concretely. What is an R microservice? So it's basically R code behind plumber and Docker. So we take our R code, our machine learning model, and we put it behind a plumber API. This allows any other application to hit our machine learning model, and it only has to provide the expected input, and it gets an output. So any other service doesn't need to know anything about the R language.

And then we wrap around that plumber API. We put it in a Docker package. This makes our API portable. So once it's in a Docker package or a Docker image, we can deploy it into the cloud in a variety of destinations.

So a little bit more concretely, in our circumstance at DV01, the microservice wild, this is our web application. It provides a lot of information that investors need to make decisions. And we have a JavaScript frontend. We have a Scala backend, and then behind that even, we have a data engineering pipeline in Scala. And we have our R machine learning microservice sits on top of that and is able to provide results to each of the components in that pipeline.

So this is a great setup for a data scientist, because I can focus on what I'm good at, which is building models and building data science projects, and then the software engineers that I work with, they can build the end-to-end system.

DevOps principles for robust deployments

So in pursuit of building robust R deployments, I want to introduce three DevOps principles. The first is continuous integration, and that's where we build and test our packages. The second is continuous deployment. This is where we take our tested packages and release them into production systems. And then the third is cluster computing, which is a common destination for applications in production in the cloud.

Continuous integration

So let's start with continuous integration. This is a simple continuous integration workflow. So we start at the top left, where a data scientist is working in R, in RStudio, making updates to the model, making updates to the code. So whenever a data scientist makes an update to the code base, they will push that to a version tracking tool such as GitHub. At that point, the application will be built from the source code, and then we're going to run automated tests on that.

And this happens automatically. Once I push, all that happens, and then I'll get a report back. If the tests fail, I know I have to go back and fix something. If the tests pass, I can then progress. I can merge my changes into the master brand, and then we will have an application that is ready for deployment.

So let's look a little bit closer into this first part of the CI pipeline. So in the CI build step, basically we have data scientists working in RStudio, writing R code, and we are going to git push, and that's going to trigger GitHub actions to kick off the continuous integration pipeline. There's many options for CI tools. We use GitHub actions at DV01, and then the GitHub actions is going to build a Docker image based on the updated code, and what's important here is that I'm pushing my code frequently.

That means that I can get instant feedback on what each piece of code that I edit, and so I know where changes happen. If I push 500 lines of code at once and something fails, it's going to take me a long time to figure out what goes wrong, but if I push a couple of lines at a time, I can isolate the problems very easily.

Automated testing pyramid

So now let's look at this next step. We're running the automated tests. So within continuous integration tests, we have three broad categories of tests. Sort of there's this pyramid structure. The base are unit tests. These are the most numerous tests that we have. Of those, we have integration tests, which are still a good amount of those, but less than unit tests, and finally, there are end-to-end tests, and the test that package is very useful for writing these types of tests.

So starting at unit tests, this is the base of our testing pyramid, and the goal here is to test functions, individual lines of code, and really look at each component of your R package. So here I have a trivial example of a unit test, where I define this add 20 function, and then I use test that to assert that if I add 20 to 10, it should equal 30, and I define this, and this is going to run every time I push updates to my package, so if I were to make any changes to this function that would break it, I would get immediate feedback, and these type of tests are something that you should have a lot of.

For example, it's a really good idea to create unit tests for your features to make sure that all the features that go into your models have tests and never break or never give unexpected input. If you're imputing missing values at prediction time, that would be a good thing to unit test. You could also test data structures, like make sure that functions return data frames of expected structures and things like that, so writing a lot of these unit tests allows you to make sure that your code doesn't break in unexpected ways, and it helps you identify failures.

So the next level is integration tests. So for integration tests, the goal is to make sure that our microservice can interact with other services that it will need to, so we may be interacting with databases or APIs, so we're going to write tests to make sure that our R container can interact with every service it needs.

And finally, there are end-to-end tests, and this is where we're looking at the whole microservices application as a whole, so this is, in our case, looking at our whole web app. It doesn't really tell us much about our R microservice, but it's necessary to have a holistic view of everything. Where we actually get useful input about our R package are the two previous categories of tests.

Now when we're writing tests, we want to write them at the right time. If you start writing tests too early, you can waste a lot of time, but they save a lot of time in production. So if you start a new project, when you're in the exploratory phase, doing EDA, initial experiments, if you start writing a lot of tests, you're going to slow yourself down. But as you progress, as you get your feature set more concrete, as you get closer to production, you should start introducing first unit tests, later integration tests, and once you're really close to production, to releasing a microservice, you're going to want to make sure that the end-to-end tests are properly configured.

Continuous deployment

So next we're going to talk about continuous deployment. At the end of the continuous integration pipeline, we have our container that's ready for deployment. What is going to happen here is that this pipeline is going to take that container and it's going to push it through our different environment. Here at DV01, we have a staging, release, and production environment. There's many ways to do continuous deployment, but this is just one example.

So once the container is ready, it's going to be pushed to staging, and in staging, there will be some end-to-end tests that run, and we'll just make sure that the whole application looks good as a whole. If that goes well, it moves on to release, and then finally it will go to production. And this is important because we have a customer-facing application, and we have zero tolerance for any failures or for downtime. So we really need to make sure that we have a lot of steps in place to catch any errors that might come up.

And this is important because we have a customer-facing application, and we have zero tolerance for any failures or for downtime.

And each of these environments, the staging, release, and production, is a full replica of the production environment. It has our web application, it has the R microservice, as well as all the other services that make up the application.

Cluster computing with Kubernetes

So let's talk a little bit about the last concept, which is cluster computing. So in the cloud, a common way to run jobs is in Kubernetes. So here we have our production cluster, in which we have three replicas of our machine learning application. So it's important that we're not just running it once, we're running it several times so we have redundancy.

So when another microservice comes along and says, hey, R, give me some predictions, it's going to first go to a load balancer. The load balancer is going to look at all the replicas of our R machine learning microservice, and it's going to route it to one that is free and that is error-free. And then the R microservice will respond back. And then the full microservice application can continue doing its work.

And so what's important here, as I mentioned, this is a replica set. So in Kubernetes, you have replica sets, which basically says that for this application, you're going to have always going to have three instances of this application in this case. And if one of the replicas goes bad, it will be torn down, and a new one will be spun up immediately. If we're getting high traffic, we can route new traffic to a free replica of our microservice. So it's very important. It gives us redundancy, and it allows us to handle high request loads. This also allows scaling.

Conclusion

So in conclusion, I just want to show that R is great for production with proper CI and CD practices. Continuous integration helps you build and vet your R packages. Continuous deployment helps you release updates seamlessly. plumber is an important component for a microservice, because it packages your R code behind an API, where any service that has no knowledge of R can get results from your R code. And then Docker creates these portable software packages that you can run on your laptop, you can run in the cloud. And then Kubernetes is essentially an orchestration engine that will take your dockerized application and run it in the cloud. It helps you maintain redundancy, it has health checks, it helps make sure that your application doesn't crash.

And thank you. If anyone has some questions, I'd be happy to take them.