Resources

Predicting Lending Rates with Databricks, tidymodels, and Posit Team

video
Jun 26, 2024
34:05

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

I'm Garrett, and I'm going to share an example with you today about how you could do analysis with R on data that lives in Databricks. So we're going to start with a dataset that lives on Databricks. I'm going to show you how I ran some tidymodels code on it without going too much into the weeds of tidymodels. And then I'm going to show you how I used the model that resulted in a Shiny app that also continued to talk to the data that lived back in Databricks so a user could gain information about that data that I wanted to put at their fingertips.

Now everything we're going to do today is in this GitHub repo, github.com, posit-dev, databricks-finance.app. And I'll save this URL myself, I'll use it in a few minutes.

And then if you're following along later, or would prefer to go into the modeling in more detail, or just read at your own pace, there is a guide in the readme of this repo that will take you to this page here, which maybe I'll actually open up in its own tab, and we'll take a look at later.

Because this will tell us what we're doing when we get through all the modeling. All I want us to know right now is what our end goal is, and that is we are trying to build a Shiny app that looks something like this app, where a user can enter five pieces of information, and I'll take them less than a minute, and then ask the app to predict what sort of rate they would get if they applied for a loan right now. The app gives them a rate based on a model that we're going to build together that uses data from lendingclub.com that is on Databricks. The data is all about loan applicants and their loans and the rates they got and other information, so we're going to use that to predict rates. But since it is a prediction, I'm also going to show everyone who puts their information in the distribution of rates that people who put similar information in got, so there won't be any surprises about the ultimate rate should they decide to go ahead and do the full application for their loan.

Overview of the workflow

All right, so that's sort of how we're thinking about using data today, but now I want to look at the details. So first of all, I want to show you this is in fact a GitHub repo, and we're going to mostly focus on the script.r file until we get to the app. And the place I'm going to run that code is in Posit Workbench. That's the professional server version of the IDEs that Posit uses. It's a product that allows you to use a server version of RStudio IDE or VS Code or whatnot. Some collaboration features and whatnot in there. When you log in, it looks like this. I'm going to start a new session. I'm going to select the RStudio IDE. Since it's a demo, I think I'll splurge and we'll use a large instance. I'll start that session. It'll take a few minutes to launch, but when that opens, you'll see the familiar RStudio IDE. But when I run code there, it's going to be running on a Posit Workbench server. It's not going to be running on my desktop, and there's advantages to that.

But that's just one of two places we'll be running code today. So pay attention for a chance later, I'll tell you where else we're running code. Now, this is the four-pane RStudio IDE. I have made a little switch to it. Normally, I use it with the console down here the way it comes, but if you go to this icon here, you could move the console from the left to the right, or you could go to pane layout and make all sorts of changes. But I think if I put the console where I run code over here on the right, I could put a script on the left, and that'll make it easier for us to just go through this code together.

Now, the code I want to go through and run lives on GitHub in that repo. So the way I'm going to get it is I'm going to start an RStudio project. This is something you could do in any IDE. Go to project and start a new project. My project is going to be based on a version control repo, so I'll click version control. It's a git repo, and I have that URL that I copy and pasted, so I'll put that in there. Then choose a directory name, and I'll create that project.

And RStudio is going to download or clone all the files in that repo, put it where I asked to in my file browser, and start a new project so I could come back here. It'll be easy for me to come back to this project and get right into this flow of work after I work on several other things. I want to come back to what I'm doing here today. So let's give that a minute to restart the session. And here we go. We have all the files that were in that GitHub repo, and I'm going to open script.r. All right. So as you can see, this is a heavily annotated script. So if I go too fast today or you want to spend your own time on this, you should be able to do everything I'm doing between this script and the guide that shows the final product. You'll be in good hands.

Loading packages and connecting to Databricks

I have some packages that I want to load. I think I'm going to need to load a few others that are implied, but I realize allegedly I didn't put on this list, perhaps. DBI and dbplyr. And the reason I want to make sure to load these packages is these are the packages that you would use in R to work with data that lives on a database. And when you work with data that lives in Databricks, you could take advantage of the same things. The workflow I'm going to describe today is the workflow that we've designed to help our users work with data that lives elsewhere, wherever that may be. And as with R, we try to use the same syntax across different use cases. So let me load these packages with library.

And then the first thing I want to do, if I'm going to do any analysis is connect to my data. And today that data lives on Databricks. Now I have an account on Databricks where that data is. So I'll sign into my account. Here's Databricks. And then I'll show you in a minute where that data is. But we need to tell RStudio that I have an account on Databricks and which account that is and where my data is. So RStudio can help R find that data and we will be off and running. The way we're ultimately going to connect to that data is with this first function here, dbconnect. It's a function that comes from the DBI package with the Databricks function from odbc, which is built to help R connect to Databricks. The Databricks function looks for environmental variables that allow you to connect to your account on Databricks. You could pass those variables as arguments to the function, but that's not the best idea if you're saving on GitHub like this. I wouldn't want to write in like my token or API key and then put it up on GitHub where it could be scraped, for example, maybe. Instead, I'm going to use environmental variables and the Databricks function will search for specific environmental variables that contain information related to Databricks. And if it finds them, it will use them to build that connection.

So that sounds kind of abstract, but let's make it real. If you want to make an environmental variable with R, you could do that by making a .renviron file. That's a hidden file. Sometimes it's kind of tricky to find and open stuff, but there's a function in the usethis package called editRenviron. And if you run that, it'll either open your .renviron file or make an empty one and open that for you if you don't have one already. So I'm going to run that. And here I have my file. It's already pre-populated, but let's pretend it's not. I'm going to erase some of these things here and then rebuild them with you so you'll know what I'm up to.

There's four environmental variables that this file is defining called Databricks host, Databricks cluster, Databricks token, and HTTP path. The first three are the names that the Databricks function is going to look for. And if I try to run that function without these defined, it'll tell me I couldn't find this name and that name. And if you want to know what those names are, I also put them up here in the script so you'll be able to find them. The last variable is called HTTP path. I'm just making that up and asking R to go fetch it here for this HTTP path argument in the function. It's a bit sensitive. It's not as sensitive as the rest, but I decided to do that.

All right. So now I need to define those variables. Let's talk about what those variables are. I'll give you a sense for why you might need them to connect to Databricks and where you find the values for them. So Databricks host, the first variable, that is the Databricks server, the host that you use to get out to Databricks. So if you go to your Databricks account, it will be not this whole thing here, but everything up to the dot com. So for me, that's R studio dash partner dash ploset dot cloud dot Databricks, what have you. So I'm just going to paste that in there. And now R will know what my host is. The cluster is going to be my compute. So if you use Databricks, you know that you have to arrange to have compute to run the code that you send into Databricks. I have a cluster set up for myself. That's where I run the code. It looks like it's running because I have the option to terminate it. If it was not running, I'd have the option to start it. And it's important to have it running because I'm going to use it here in a second from R. But if I want to know the name or the ID of this cluster, it's kind of weird, but I get it from the URL. After clusters, there is an ID. It's a random string of numbers and letters that goes from the backslash to the question mark. Let's go copy and paste that.

And I'll put that into cluster variable. Now, you could pass this information in the Databricks, but with just that, it wouldn't let you run compute on my cluster. You have to pay for the compute you run. I need some way of telling Databricks that I am me and I am authorized to run compute on this cluster. And I could do that with a token. And I thought about how to show this and keep things secure. And then I realized it's actually going to work out. And I can show you the whole process so you could follow along and I'll be secure. So the way you get your token is you go to Databricks and you go to your profile and go to settings.

And then on the second sidebar, go to developer. So making tokens is sort of a developer sort of thing to do. And we want to manage our access tokens. So I have a variety of tokens. I've been doing a variety of things and I want to generate a new token to use today for this exploration. So I'll just write generate token. And I'm going to make it last for one day. And since I'm pre-recording this video by a day, by the time you see this, this token won't work for you. And that's why I'm able to show you this right now. But this is what the process will look like for you. You'll have to buy your own server and generate your own token. But I think everyone knows what I'm saying here. So let's pick a name for this token. I'll call it workflow demo. And then I'll generate it and then I'll copy it. And as Databricks says, I won't be able to see it again. So make sure you copy your token when you have it. And now I'm going to save it to this token variable. So there I have the token that should authenticate me into the server. The last thing I need to supply is the HTTP path. And really that's when you use dbconnect with Databricks, that is the path to where your data actually lives on Databricks. So that's the cluster doesn't necessarily say where it is when you pass as a separate argument.

All right. The data I'm using is a SQL warehouse. It's called Posit Warehouse. And if I navigate to it, SQL warehouses, Posit Warehouse, and then I go to connection details, it'll tell me the HTTP path to this warehouse. So I'm going to copy that and I will paste it here in my R environment. And now think about it. I mean, this is like the information that R needs to follow the map to get the data I actually want to use and all the passwords it needs to use to follow that path and get through to the final location. I'm going to save it and then I'm going to use it. The way R uses the dot R environment file is when you start a new R session, it loads this file and it creates all the environmental variables that are in this file, which means if you just created this file right now in the middle of this R session, these variables aren't going to exist for you yet. You have to start a new session that will load this file at the beginning to create those variables. So after you make this file, you'll need to start a new R session by going to session restart R. That's what I'm going to do. So I'm just restarting R and in the process it's scooping up this R environment file. It's defining these variables globally in the background.

The way R uses the dot R environment file is when you start a new R session, it loads this file and it creates all the environmental variables that are in this file, which means if you just created this file right now in the middle of this R session, these variables aren't going to exist for you yet. You have to start a new session that will load this file at the beginning to create those variables.

And if I ever want to access one of those variables, I can use this system sys dot get environment function in the name of the variable. Like I said, Databricks accesses three of those by default if it finds them. And then for HTTP path, I'm manually telling R to find that variable. All right. So this is a new session. Those variables should exist. So I should be able to run this code and connect to data in Databricks. So I'm going to run that and it says, oh, it couldn't find dbconnect. It's a little awkward. It's from the DBI package, but maybe I didn't run that.

No, let's give it another try. Okay. Oh, yes. I restarted my R session. So I should reload my packages. Okay. So it ran that time. And we could tell it ran because it finished and there's no error. But also this connections panel opened up. And this is all of the data in that SQL warehouse. I could drill down into it. And in fact, down here, this lending club table is the table we're going to use today. And it tells me all the variables in the table and their types and whatnot. So it's a good way to explore the data on Databricks from R. Now, all that data is still on Databricks. I haven't imported any of it into R yet, but I could explore what's on Databricks through this pane.

Working with data using dbplyr

All right. Now, let's work with that lending club table. You notice it's in the Hive Metastore catalog in the default schema. I apologize if I'm not getting my terminology right for Databricks. But then the table is called lending club. I'm going to need those pieces of information to access that table. Now that I've built the connection object, I could use that connection object to refer to specific tables that are in the database I'm connecting to. And what this line of code is telling R is build a TBL, I guess we could call it, using this connection and find in that connection the file that's in the Hive Metastore catalog in the default schema is named lending club. And I'm going to save that to lending club underscore dat. And now I could use lending club underscore dat as if it were that file.

If I look at it, I'll see the top of that file. This is a TBL printout. Essentially, I can see some of the values in there. If I wanted to do something to it, like maybe I want to take every column that starts with the word annual and make it as numeric. I'm going to do that with the mutate function from dplyr. And then maybe after that, I just want to select the interest rate and the columns that start with annual. And then after that, I just want to look at the first six rows or what have you. That's what this R code would do, and I can run it. Oh, yeah. And it's going to say, I still haven't loaded all of my packages. So let's get that out of the way. I'm running code on that table. And I get a result back. So, it's taken that large table and it's made two found two tables to begin with the annual, made them numeric. There's a lot of missing values in one of them. And it just gave me back those three columns that I asked for.

But the interesting thing is that the data I'm working on, I have yet to import into R. Lending club.dat is not a data frame full of data. It's actually an SQL or SQL query that refers to a data frame of data. And I could see that by running show query on lending club.dat. So, this is the SQL that returns the full table. And if I run code on lending club.dat, there's a R package in the background called dbplyr that I have loaded that's going to take that R code and translate the SQL and just append it to the SQL that is stored in lending club.dat. And I could see the results of that if I had show query to the end of there. So, what I just did was basically this SQL. Now, I'm an R user, not a SQL user. So, it's handy for me to be able to write SQL like this. But there's also another reason we're doing this. First of all, representing a table, I guess we could call it an imaginary table with the SQL that you would use to build it is called a view. And lending club.dat is sort of like a view for us. We're using it that way. We could treat it like a real table. But to get the result of it, we're going to have to run the SQL that makes that table and then apply all our extra code onto what comes back there to get our result.

This is how we work with data in a database or in Databricks that we don't want to bring into R. We use these view-like objects. We write code on it and we let the dbplyr package translate that code to SQL in the background where we don't really have to worry about it. And then when we execute that code, R will send the SQL through the connection to the data warehouse which will run the code there. So, this code that we're running, and this is the important part, it's not being run in R. It's being sent to Databricks. And it's being run in Databricks. And Databricks, depending on the resources you put there, can run code a lot faster than R. And that's the first advantage of using this odbc dbplyr setup. The second advantage is, well, the first advantage, you're running the code in Databricks, so it's going to run quite fast. The second advantage is you can keep your data in Databricks. And Databricks is made to store very, very big amounts of data. R isn't made to store big amounts of data. To pull that data into R would be painful. So, being able to keep it in Databricks and run the code there and just pull in the results or the tidbits that we want is really efficient. We can use bigger data than we normally use in R, and our code will run faster than it normally does in R. And dbplyr will translate almost all of the tidyverse transformations and much of base R and math and everything in the SQL for you.

R isn't made to store big amounts of data. To pull that data into R would be painful. So, being able to keep it in Databricks and run the code there and just pull in the results or the tidbits that we want is really efficient. We can use bigger data than we normally use in R, and our code will run faster than it normally does in R.

But there are some R functions it's not going to know how to translate. You could write an R function today that no one's ever heard of and use it, but dbplyr won't be able to use it. And then there's some more sophisticated functions like the modeling we'll do in tidymodels that work in R but can't easily be translated to SQL. So, every once in a while when we're working like this, we'll want to pull the part of the data or the results that we've created into R as a traditional data frame table and use it in R as if it was an R data frame. There's one very important function we need to know that will let us do that, and that is the collect function.

So, we could write as much code as we want on one of these views, and then when we want to pull the results into R, we could append collect to it, and R will send the code over, the code will be executed, but then R will pull the result back and either throw it on the screen or save it to an object if you arrange to do that, which I am going to do. So, when I run this, code's being sent over, it's being run in Databricks, the result is coming back, it's being saved in LendingClub subset.

Hardly where I want to type that, I'll just look at LendingClub subset, and now we can see the results of that code. It's subtle, but this is a real R table we can see by the metadata up here. It's a table six by three, whereas when we looked at before, this is not a table. It's saying we're looking at a Spark SQL table that came from this database. If we don't collect the code and we ask to look at the result, R will bring back the first 10 lines like we saw up here, about the first 10 lines of this table that stayed in Databricks, but it shows us the first 10 lines, and if our result's only six lines long, six rows, we get to see the whole thing, but it's still not in R. If we want to bring in R, we use the collect function, and it will bring every row, not just the first 10. Every row will come there. All right, so best practice here, if you're planning to use data in Databricks with this, is to do as much computation as you can before you run collect. Normally, that means we'll be bringing smaller amounts of data back with collect, and everything will run faster because using smaller data inside of R. Not only that, everything you write before you run collect should run faster because it's running in Databricks. Collect's sort of like the pivot point.

Cleaning data and fitting the model

All right, so that is how we're going to use data in Databricks. We've built this connection. I've used the dplyr tbl function to refer to my dataset in Databricks, and everything downstream from this object is just going to be making SQL code that will then get run on Databricks. At some point, I'll decide I've run everything I want to on Databricks, and I'll collect what's left over and bring it into R, and then I'll start modeling with R's tools in R. So let's begin. The first thing I want to do is clean the data. This dataset has a lot of columns that I just can't use for prediction. They describe things that happen in the life of a loan after you apply, like did you default on the loan, that sort of thing. But we're trying to predict based on knowledge that's available when you apply for your loan what the bank will say the interest rate is. And so first, I remove all these columns I want to ignore to build a training dataset. Then I'm going to change the type of a lot of columns to numeric. They're all basically stored as strings in the database. I'm going to create some new variables that I think might be useful predictors. And then eventually at the very end, I'm going to take what's left over and collect that into a table in R using the collect function.

So let's run this code. I'll call this Lending Club Dat. So that will be the table we use going forward. There's quite a bit more code running here, so it goes a little slower. But it didn't take too long. And now I have a huge table, and this is just a plain table in R. It's 130 rows long. And this is the data I'm going to use for my training and test datasets.

But there's some things I'd like to do this day that were easier to do in R than in Databricks. And so I'm going to start with this table. And now that it all exists in R, I'll run some functions like I'll recode some variables, I'll impute some means, and I'll replace missing values with what I think are reasonable guesses. And then I'll make some list of column names that I could use later. At this point, you've cleaned your data. You might want to double check that things make sense. So first I could check to see if I've gotten all the missing values. And I could see how many missing values are in each column. It's kind of hard to go through all of these. So maybe I will instead filter down to just I'll test for columns that have more than zero missing values and sum up. And I see that there's actually three columns of missing values in them. Like, for example, bcopenbided, bcutil down here. So those are columns where when I fit my model, I'll have to impute some value for the missing value. That's good to know. Maybe I want to make sure that there's no factor with less than two levels.

All right. Now it's time to model. We're not going to spend a lot of time here modeling. But suffice it to say, you would build training and test sets. You'd come up with a recipe of preprocessing steps to apply to your model. You'd specify what the model is. In this example, we're going to use a LASSO model to do some variable selection. And we're going to tune this penalty parameter with cross-validation. And so this code will do that. And then finally, we'll put together our final model with the best penalty parameter cross-validation yielded, and we'll build a model. If you want to know how to do this, and if that interests you, then go back to this article and read through all the steps to hear the why we do it, to see the code, and to follow along yourself. But explaining tidymodels would really take us far afield from explaining Databricks. So I'm not going to go into that. I'm going to jump forward to the point where we have a model that we want to use in our Shiny app. So let me just go ahead and run all of this code. Let's go fit a model and make some plots and whatnot.

This is the cross-validation step. It's fitting, I think, like 500 models at that point. So this step might take a couple seconds. All right. And now let's take the model that resulted. It's called LinLinearFit. And let's save that to an RDS object that I could package along with my Shiny app. I think my app in this case will be fastest if I just include the model with the app. But I have written code here that explains how you could deploy that model in different ways, such as creating a web API with the Vetiver package in Plumber or pinning it with the pins package and so on. That won't be necessary today. Today, we will just put our model with the app.

Building and publishing the Shiny app

The app that we built or that I built is in the GitHub repo under the app directory. And it contains the app.r file along with the model RDS that I created and some helper files. You can see that at the start of the app, I call the three helper files. Two of them just write, you know, functions and UI elements I need to build this app. But the first one is where all the action happens in terms of authenticating to Databricks, because this app will need to use data in addition to the model. If we look at that authenticate file, we see that we're just using the exact same code that we used in the script to connect to Databricks. And just like in the script, it needs to rely on these environmental variables, which I already have loaded in my environment.

So, if I've done anything right, this app should run. I might need to load a few packages. Here we go. Here's the app. I'm running it locally. It's connecting to Databricks. If I ask to predict the rate, it works. And that's fine. But I'm running it locally with my own local environmental variables. Ultimately, we want to take this app and publish it somewhere else where other people could use it and it could connect to Databricks for them without me being there to give my environmental variables. Now, I'm going to show you one way to do that.

The first thing we're going to do is we're going to publish the app. So, I click the publish button up here. It looks like one little blue hand passing a ball to another little blue hand. I have some accounts here. Let's add a new account just to see what that looks like. I'm going to add an account from Posit Connect. This one will be called Pub.DemoPositTeam. It's an account I have access to, but hasn't been connected to this workbench. So, I'll click next. It says, well, let's pass along the credentials you used to sign into workbench to this Posit Connect account and see if you have access there. And I do. And it recognizes me based on this credentials and says, do you want to connect to this workbench instance? Yes, I do. So, I click connect. It says successful. So, it's really easy to connect at this point.

And now I can publish to this Pub.DemoPosit.Team account. I should pick a name for the app. That's maybe a little more descriptive than app. And then I will publish it. Now it's publishing the app over here. What it is publishing is just the scripts and files needed to make this app. Once the app is loaded, I will go and provide the environmental variables that the app needs to connect to Databricks. So, here's the app. It's trying to load. It's not going to succeed because it's going to hit an error when it tries to connect to Databricks. I need to supply those remaining environmental variables. And the way I could do that with RStudio Connect or Posit Connect, if I publish there, is to go to the Vars tab for my app. And then I could add environmental variables. These are variables that no one using my app will have access to. They won't be able to see them. But Connect will use them to service my app and connect to Databricks. And then the variables I want to add are the same ones that I had over here. So, it's a very simple process. I take the name of the variable, put in the name slot, and I fetch the value of the variable without quotation marks. And I put it in the value slot. And then I add the variable. And when I'm all done, I can click save.

Now, I happen to know that this isn't going to work exactly. Well, I'll add all these. So, sometimes I just have to wave my hands here. But adding these variables isn't going to be enough to save the day for this app because, unfortunately, this instance of Connect uses a different version of R than the one I built this in, which doesn't need to be a problem. But for one of the packages, unfortunately, it is a problem. But that's okay because I have a workaround that we can use. Copy and pasting. So, I'm reasonably certain there will be no typos.

All right. Now, I save these. Click okay. So, now they are part of the app. And the app is going to update. And I suspect there's going to be a version problem. Yeah. So, there's some sort of a problem. I'm not going to say the password. I can look at the logs and see what's happening. And it says, oh-ho, this R version is 4.3, but the mass package requires greater than 4.4.0. That's okay. I have another version of this app running that I'm going to show you. And it is called... well, let's narrow this down. The things made by me. That clicked into something by accident. There's the app we just made. But the one I want to show you that is running is this one. And if we look into the Vars tab of its controls, you can see I created these three variables that it needed to make this work. And it does indeed work.

If you want to take the time, you could add a UI element in your app, maybe with a password widget that asks the user to write their own credentials. And once you have those, you can use the user's credentials to connect to Databricks or a database or what have you. And if you want to take even more time, you can create a reactive UI such that what the user sees in the app is keyed into what permissions their credentials have. So, a manager might see quite a bit in their app because they could fetch all of that with their credentials. Whereas an employee might not see everything because they don't have access to all that with their credentials. All the basics are there. It's just for you to elaborate on them. This is the easiest way is just provide a global set of credentials that everyone will be using in the background. Unfortunately, everyone who visits this app will be using my credentials in the background. So, it might not stay up for too long. But now you know how to do it. And with that, I'll thank you for your attention. Bye.

If you want to take the time, you could add a UI element in your app, maybe with a password widget that asks the user to write their own credentials. And once you have those, you can use the user's credentials to connect to Databricks or a database or what have you. And if you want to take even more time, you can create a reactive UI such that what the user sees in the app is keyed into what permissions their credentials have.