Resources

Tidymodel prediction workflows inside databases with orbital and Snowflake

video
Nov 27, 2024
21:49

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everyone. Happy almost Thanksgiving. Welcome to this month's installment of Data Science Workflows with Posit Team. For those of you who haven't met me, this is my first time giving one of these demos, so probably all of you. I'm Nick Pelican. I'm a senior solution architect here at Posit. I'm here to tell you about one of the coolest R packages I've seen yet, and that's orbital.

So to lay out today's agenda, I'm going to give you a quick introduction to tidymodels, orbital, and then I'm going to show you how you can use it in your own R modeling workflows. Let's get started.

What is orbital?

First up, what is orbital? Orbital is an incredible package. It was introduced at this year's PositConf. I'd highly encourage you to go check out the talks. They're now available on YouTube. Emil Hviefelt from the tidymodels team gave that talk.

The goal of orbital is to enable running predictions of tidymodels workflows directly inside databases. So what you can think of is kind of almost like a three-step process. First up, use the tidymodels tools that you're familiar with to create an R model. You can use tools like Recipes, Parsnip to create that model, and then use orbital. And what orbital does is it takes all the steps of that tidymodel and converts it into SQL that a database like Snowflake can understand.

You can then use Snowflake, use any other database to either run the predictions of that model, or you can actually use that SQL to then actually deploy the model directly to Snowflake or any other database as a native object.

Just a quick refresher for those of you who aren't familiar with workflows or maybe haven't used it in a while. Workflows is a part of the tidymodels package, tidymodels universe. And what it lets you do is bundle together your data pre-processing steps, your modeling steps, and post-processing steps into one R object. So you can combine all the things that are needed to run your model, that can feature engineering, model fitting, post-processing, into a single portable R object.

And the secret sauce of orbital is that it does that conversion over the entire workflow object. So it'll convert all of your feature engineering steps. It'll convert your model fitting steps. It'll convert basically anything that you can fit into workflows. You can put into orbital, and then that'll turn it into SQL that can be then run on a database.

Why use orbital?

You might be asking yourself, why would I use orbital? Why would I want to put my model fitting into a database?

Number one, putting your model fitting in your database allows you to share your models with anyone super easily. Because you're taking these models, converting them to SQL, what you're doing is you're putting them into a language that your database can understand, and you're putting them into a language that basically almost anyone can access. So if you can take that model, you can take your model, turn it using orbital into SQL, deploy it onto Snowflake, anyone else on your team, whether they're using something like Python, or whether they're using something that doesn't use code at all, something like Power BI, can now access the outputs of your model.

And number two, probably most importantly, it makes your model really, really, really, really, really fast. One of the really cool things about databases is they have a ton of compute power behind them. They're really good at doing things like feature engineering, like data preprocessing. And especially because your data is already there, one of the things that fitting your models directly in the databases does is it removes the step of having to download data onto a different machine, download it onto an R machine. Instead, the model fitting can be done directly in the database, which tends to make that make the model fitting incredibly fast.

And number two, probably most importantly, it makes your model really, really, really, really, really fast. One of the really cool things about databases is they have a ton of compute power behind them. They're really good at doing things like feature engineering, like data preprocessing.

Demo: fitting and deploying a model

So let's get into it. I'm going to open up Posit Workbench here. I'm using the Posit Workbench Snowflake native app. I got my RStudio session open. And here I have a Quarto document that fits a model and converts it to Orbital and deploys it directly to Snowflake. And I'll take you through that. Let's run it together. Let's check out what it does.

So first up, I'm going to connect to Snowflake. I'll load in all my packages. One of the really cool things about running Posit Workbench on Snowflake is that it takes care of all the Snowflake authentication for you. You can see connected to Snowflake. Don't worry about credentials. Put in my connection string and I'm done.

Next up, I'm going to connect to my data. For this demonstration, I'm using the Lending Club data, publicly available data set. You can find it on Kaggle. Super easy to find. Concerns, got about two and a half million different rows of specifically loans in the southern U.S. states. Actually get a count on that and see how many rows we're actually working with. About 2.3 million rows here.

And because I don't want to fit this model against all 2.3 million rows, what I'm going to do is I'm going to download a small sample, 5,000 rows in this case. And I'm really only interested in 2016. This data that we're working with here spans 2014 to 2018. I'm just going to limit myself down to as if I'm fitting this model in the year 2016. And then the other thing I'm going to do here, specifically what this model is concerned with doing, is predicting the interest rate of a given loan. And here I'm going to cheat a little bit just to keep this demo a little bit short. I'm not going to show you all the feature engineering and feature discovery steps that I've done before. Instead, I'm going to pick four features that I know will work in this model.

So the term of the model, credit utilization, credit open to buy and all utilization variables. And I'm also going to do some conversion here just to make sure that the interest rate is a numeric variable. I'm going to download that. And then next up, I'm going to fit my tidymodels. I'm going to create my tidymodels workflow. So what you can see is first off, I create a recipe. A recipe is a series of preprocessing steps that I feed to tidymodels. Note that recipe preprocessing steps is going to be what Orbital eventually converts into SQL.

So first up, in my recipe, I've defined the variable I'm trying to predict, in this case, interest rate. I'm predicting that against all the other variables in my data set. I'm using the lending club prep data. I'm going to dummy the term variable. I know there's only two different values for term. So does term equal 60 months or not? I'm going to turn everything else into a numeric vector. And then I'm also going to normalize all of my numeric variables. And then I know that I've got some missing values in open to buy and utilization. So I'm going to use mean imputation there. And then at the end of everything, if I've got any missing variables, especially if I've got something like interest rate missing, what I want to do is just make sure I drop those. Specifically, in this case, I'm fitting a linear model that doesn't work well with missing values. So I just want to make sure I'm not trying to fit against anything missing.

Next up, I'm also going to indicate to workflows that I'm going to be using linear regression. Then I create my workflow. I add my linear model. And I add my rest and my preprocessing recipe. Let's do that. You can see, so I've got this preprocessing recipe. If you mutate steps, normalize step, imputation step, and then a final filter. And the model here is just a simple linear model.

Now let's fit the model. Again, this fitting is happening in my RStudio session running on Posit Workbench. So what I'm doing here is I'm fitting the model. I'm also telling I'm also grabbing some metrics from it. In this case, I want root mean squared error. I want mean mean absolute error. And I want the R squared in the model. So what I do, first up, I fit the model. Then I calculate the metrics against the data I pulled earlier. And I display them. You can see here my RMSE is 4.34, my MAE is 3.3, and my R squared is 0.225. Not great, but it'll work for the purposes of demonstration. This model is not going to win any awards, but we're here to talk about cool tech, not model fitting.

Model versioning with vetiver and Posit Connect

And then what I'm going to do here is I'm going to use the vetivert package and Posit Connect to actually take care of versioning my model. Versioning a model is incredibly important. As you're creating models, as you're experimenting with models, as you're tracking model performance, one of the most important things you can do is to track what version of the model you have active and track the performance of that specific model version. So what I'm going to do here, I'm going to connect to Posit Connect using board connect. I'm then going to create a vetivert model using the fit workflow object I created just above. I'm going to name it using that model name I put above. And then I'm going to add, as model metadata, the metrics that I calculated. And then I use vetivert pin write to write that to Posit Connect.

You can see here I'm calling the model version, so I'm seeing all the different versions of this model that I've already deployed. You can see I've got version 11 that's active. That's the model that I just fit. Then previously, leading up to this, I built a couple other model versions. So those are those model versions are not active, but they are saved for posterity. If I ever want to roll back to those model versions, they are available on my Posit Connect server.

And what I want to do here is I also want to grab the version of the model that I just deployed. This is going to come in handy later for some of the interactions we're going to do with Snowflake. Run that. Let's actually check out what model version we're on now. That's version 11.

Converting the model with orbital and deploying to Snowflake

So now I'm at the stage where I've created this model, I've evaluated it, I'm ready to deploy it. So first up, what I'm going to do is use orbital to convert this model. So what I do is I create an orbital object by just calling orbital on the fit workflow. And then I call orbital SQL. And what orbital SQL does is it actually shows me the SQL statement that's being generated by this model. So what you can see here is all those pre-processing steps I went through above. So you can see me dummying the term variable. You can see all of my normalization steps. You can see the mean imputation. That's what's happening right here.

And the really cool thing is you can actually see here, creating this dot pred column in SQL. What it's doing is it's actually fitting the model itself. So these are the model coefficients from my fit, from my fit tidymodels workflow, all converted to SQL.

And then what I can do is I can take that orbital object and then using the predict function against my lending club data. And again, this lending club data object is the entire lending club, the entire loan data table. Previously, we were only working on the sampled data set that I pulled down earlier. I can run this and what this compute function is doing is running this and saving it as temp interest rate, saving it as a temporary table called temp interest rate predictions. You can see that already ran. That took about two seconds. That was running my model against all 2.3 million rows of that table, and it ran in about two seconds. Orbital, one of the coolest things about orbital is how fast it is.

That was running my model against all 2.3 million rows of that table, and it ran in about two seconds. Orbital, one of the coolest things about orbital is how fast it is.

Let's take a look at what those predictions look like. So I'm accessing the predictions table that I just built. And those are the predictions again, running directly in Snowflake. That predictions table exists in Snowflake. It's already there.

And let's take a look at actually how many predictions just happened. So what I'm going to do is I'm going to use the count function. Again, we've got 2.3 million predictions that just happened in about two seconds.

Deploying the model as a Snowflake view

So I'm ready to deploy this to other folks in my organization. So what I can do, I've got a couple of different options to do that. What I could do is just take this predictions table. You can see in this in compute, one of the things you can do in compute is I can set... I can set temporary equals false. That will write the predictions table back to Snowflake as a permanent table. In this, I've only created a temporary table.

Or a really cool option is instead to write this model prediction function to Snowflake as a view. If you're not familiar with what a view is, a view is basically a SQL query. It's a table that exists as a SQL query. It's a SQL query that you can save into any database, Snowflake in particular, but pretty much any database supports views. And what you can do there is anytime that view is called, that query runs.

So what I'm going to do is create some SQL to create a Snowflake view. So what I do is, what I'm going to do is take my full lending club table. And then I'm going to use this orbital inline function. This is a really cool feature of orbital. What it's doing here, what orbital inline does is instead of converting the model object to SQL, it converts it to dplyr statements. And because I'm using dbplyr here, what this does is it means that those dplyr statements can then be converted themselves into SQL.

So you see me create a... I'm going to add all the functions needed to run this model. And then what I'm trying to do here is I'm building a joint table. So all I'm selecting of these predictions is an ID column. That's the column that will join by the table I'm about to create to the original source data table in the predictions column. And then I run this remote query function. This is in the dbplyr package. And this is what actually does the SQL. This is what actually converts dplyr code or standard R code into SQL.

And what I can do here, let me just run this really quickly and show you what that creates. What you can see is that's written out a ton of SQL, including all of those steps required to fit my model. And then what I'm going to do is take that SQL and use it to create a view in my Snowflake database.

So what I'm going to do first is just name my view. What I'm naming is the model name I put up before and then the model version that I got from Vediver. And then all I've got to do is using glue SQL in front of this big collections, using glue SQL in front of that all in front of this big long SQL statement. I'm just adding create a replace view and my view name as and that's all. That's all it's needed to create a view. And then I take that that SQL. Pipe into DB execute. That's in Snowflake now.

Let's run that. And zero rows changed. Let's check out what that main view looks like. So I'm going to do here using my connection to Snowflake. I'm going to access the main view. Get the first 500 rows and collect it into my local R session.

And then what I also want to do is create a latest view. What this means that this is going to be a view that lets people access whatever the latest version of my model is without having to worry about which version is active. So if somebody queries the latest view, they'll always be on the most recent view or whatever version of the model is active. They're not going to be getting data from a previous model that might not be as good.

So I'm going to do here is. Create a main view and that main view name is going to be the name of my model underscore latest. And then I'm going to create a replace view my main view name. And then that's just select star from the version. In SQL, that's basically just creating an alias. All I'm doing is this. This main view is just an alias that will get the most current version of my deployed or. Let's deploy that. And zero rows changed.

Let's check out what that main view looks like. So I'm going to do here using my connection to Snowflake. I'm going to access the main view. Get the first 500 rows and collect it into my local R session. And that's it. So now I have a join table that I can share with other people. I can share with other people using Snowflake in my organization, whether they're using R, Python, whether using Power BI, SQL itself. Anyone can now access the outputs of my data science. You're super cool capability.

Cool example of this. Cool example of this is if I go into the Snowflake UI, Snowflake has a SQL editor called Snow site. What I've done here is I've written just a quick query to try and access that that predicted that model data that I've deployed earlier. So I'm going to do here is select just the I.D. term alone amount in the predicted interest rate. From. The model that from the view that I just deployed. And you can see here I'm using this in the same in the way you'd expect the joint table to be. So I'm connecting my loan data as a. To my interest rate to that view that I created just a second ago earlier as B. I'm also running a filter on it. This is something that any BI analyst could run. If I hit run. A second to warm up. And that's it. There's the predictions from my model running in Snowflake on SQL as just a native object. Again, this can be accessed by anybody else in the organization, which I think is a super cool capability.

Wrapping up

Just to wrap up. What else can you do? What else can you do? What else should you can you do use to to get some information about this? First up, I'd recommend you check out our blog post. Our blog Isabella from our developer relations team and I just wrote a blog post. That's kind of a long form version of what you just saw. I'd highly encourage you to check it out. It also includes a couple other great features that you can take advantage of in the positive stack. Things like using connect to do your model monitoring. And we'll link that in the video description as well.

And stay tuned for more updates. We've just scratched the surface of what the orbital package can do. And we've already got some incredible customer stories and we've had some customers have incredible success using orbital. And we couldn't be more excited about that. So please. Get in the comments, get in the chat. Ask any questions you might have. I'm super excited to answer them and I'll see you next time. Thanks, everyone.