Tidymodel prediction workflows inside databases with orbital and Snowflake

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everyone. Happy almost Thanksgiving. Welcome to this month's installment of Data Science Workflows with Posit Team. For those of you who haven't met me, this is my first time giving one of these demos, so probably all of you. I'm Nick Pelican. I'm a senior solution architect here at Posit. I'm here to tell you about one of the coolest R packages I've seen yet, and that's orbital.

So to lay out today's agenda, I'm going to give you a quick introduction to tidymodels , orbital, and then I'm going to show you how you can use it in your own R modeling workflows. Let's get started.

What is orbital?

First up, what is orbital? Orbital is an incredible package. It was introduced at this year's PositConf. I'd highly encourage you to go check out the talks. They're now available on YouTube. Emil Hviefelt from the tidymodels team gave that talk.

The goal of orbital is to enable running predictions of tidymodels workflows directly inside databases. So what you can think of is kind of almost like a three-step process. First up, use the tidymodels tools that you're familiar with to create an R model. You can use tools like Recipes, Parsnip to create that model, and then use orbital. And what orbital does is it takes all the steps of that tidymodel and converts it into SQL that a database like Snowflake can understand.

You can then use Snowflake, use any other database to either run the predictions of that model, or you can actually use that SQL to then actually deploy the model directly to Snowflake or any other database as a native object.

Just a quick refresher for those of you who aren't familiar with workflows or maybe haven't used it in a while. Workflows is a part of the tidymodels package, tidymodels universe. And what it lets you do is bundle together your data pre-processing steps, your modeling steps, and post-processing steps into one R object. So you can combine all the things that are needed to run your model, that can feature engineering, model fitting, post-processing, into a single portable R object.

And the secret sauce of orbital is that it does that conversion over the entire workflow object. So it'll convert all of your feature engineering steps. It'll convert your model fitting steps. It'll convert basically anything that you can fit into workflows. You can put into orbital, and then that'll turn it into SQL that can be then run on a database.

Why use orbital?

You might be asking yourself, why would I use orbital? Why would I want to put my model fitting into a database?

Number one, putting your model fitting in your database allows you to share your models with anyone super easily. Because you're taking these models, converting them to SQL, what you're doing is you're putting them into a language that your database can understand, and you're putting them into a language that basically almost anyone can access. So if you can take that model, you can take your model, turn it using orbital into SQL, deploy it onto Snowflake, anyone else on your team, whether they're using something like Python, or whether they're using something that doesn't use code at all, something like Power BI, can now access the outputs of your model.

And number two, probably most importantly, it makes your model really, really, really, really, really fast. One of the really cool things about databases is they have a ton of compute power behind them. They're really good at doing things like feature engineering, like data preprocessing. And especially because your data is already there, one of the things that fitting your models directly in the databases does is it removes the step of having to download data onto a different machine, download it onto an R machine. Instead, the model fitting can be done directly in the database, which tends to make that make the model fitting incredibly fast.

And number two, probably most importantly, it makes your model really, really, really, really, really fast. One of the really cool things about databases is they have a ton of compute power behind them. They're really good at doing things like feature engineering, like data preprocessing.

That was running my model against all 2.3 million rows of that table, and it ran in about two seconds. Orbital, one of the coolest things about orbital is how fast it is.

Let's take a look at what those predictions look like. So I'm accessing the predictions table that I just built. And those are the predictions again, running directly in Snowflake. That predictions table exists in Snowflake. It's already there.

And let's take a look at actually how many predictions just happened. So what I'm going to do is I'm going to use the count function. Again, we've got 2.3 million predictions that just happened in about two seconds.