Predicting Lending Rates with Databricks, tidymodels, and Posit Team

The way R uses the dot R environment file is when you start a new R session, it loads this file and it creates all the environmental variables that are in this file, which means if you just created this file right now in the middle of this R session, these variables aren't going to exist for you yet. You have to start a new session that will load this file at the beginning to create those variables.

And if I ever want to access one of those variables, I can use this system sys dot get environment function in the name of the variable. Like I said, Databricks accesses three of those by default if it finds them. And then for HTTP path, I'm manually telling R to find that variable. All right. So this is a new session. Those variables should exist. So I should be able to run this code and connect to data in Databricks. So I'm going to run that and it says, oh, it couldn't find dbconnect. It's a little awkward. It's from the DBI package, but maybe I didn't run that.

No, let's give it another try. Okay. Oh, yes. I restarted my R session. So I should reload my packages. Okay. So it ran that time. And we could tell it ran because it finished and there's no error. But also this connections panel opened up. And this is all of the data in that SQL warehouse. I could drill down into it. And in fact, down here, this lending club table is the table we're going to use today. And it tells me all the variables in the table and their types and whatnot. So it's a good way to explore the data on Databricks from R. Now, all that data is still on Databricks. I haven't imported any of it into R yet, but I could explore what's on Databricks through this pane.

Working with data using dbplyr

All right. Now, let's work with that lending club table. You notice it's in the Hive Metastore catalog in the default schema. I apologize if I'm not getting my terminology right for Databricks. But then the table is called lending club. I'm going to need those pieces of information to access that table. Now that I've built the connection object, I could use that connection object to refer to specific tables that are in the database I'm connecting to. And what this line of code is telling R is build a TBL, I guess we could call it, using this connection and find in that connection the file that's in the Hive Metastore catalog in the default schema is named lending club. And I'm going to save that to lending club underscore dat. And now I could use lending club underscore dat as if it were that file.

If I look at it, I'll see the top of that file. This is a TBL printout. Essentially, I can see some of the values in there. If I wanted to do something to it, like maybe I want to take every column that starts with the word annual and make it as numeric. I'm going to do that with the mutate function from dplyr . And then maybe after that, I just want to select the interest rate and the columns that start with annual. And then after that, I just want to look at the first six rows or what have you. That's what this R code would do, and I can run it. Oh, yeah. And it's going to say, I still haven't loaded all of my packages. So let's get that out of the way. I'm running code on that table. And I get a result back. So, it's taken that large table and it's made two found two tables to begin with the annual, made them numeric. There's a lot of missing values in one of them. And it just gave me back those three columns that I asked for.

But the interesting thing is that the data I'm working on, I have yet to import into R. Lending club.dat is not a data frame full of data. It's actually an SQL or SQL query that refers to a data frame of data. And I could see that by running show query on lending club.dat. So, this is the SQL that returns the full table. And if I run code on lending club.dat, there's a R package in the background called dbplyr that I have loaded that's going to take that R code and translate the SQL and just append it to the SQL that is stored in lending club.dat. And I could see the results of that if I had show query to the end of there. So, what I just did was basically this SQL. Now, I'm an R user, not a SQL user. So, it's handy for me to be able to write SQL like this. But there's also another reason we're doing this. First of all, representing a table, I guess we could call it an imaginary table with the SQL that you would use to build it is called a view. And lending club.dat is sort of like a view for us. We're using it that way. We could treat it like a real table. But to get the result of it, we're going to have to run the SQL that makes that table and then apply all our extra code onto what comes back there to get our result.

This is how we work with data in a database or in Databricks that we don't want to bring into R. We use these view-like objects. We write code on it and we let the dbplyr package translate that code to SQL in the background where we don't really have to worry about it. And then when we execute that code, R will send the SQL through the connection to the data warehouse which will run the code there. So, this code that we're running, and this is the important part, it's not being run in R. It's being sent to Databricks. And it's being run in Databricks. And Databricks, depending on the resources you put there, can run code a lot faster than R. And that's the first advantage of using this odbc dbplyr setup. The second advantage is, well, the first advantage, you're running the code in Databricks, so it's going to run quite fast. The second advantage is you can keep your data in Databricks. And Databricks is made to store very, very big amounts of data. R isn't made to store big amounts of data. To pull that data into R would be painful. So, being able to keep it in Databricks and run the code there and just pull in the results or the tidbits that we want is really efficient. We can use bigger data than we normally use in R, and our code will run faster than it normally does in R. And dbplyr will translate almost all of the tidyverse transformations and much of base R and math and everything in the SQL for you.

R isn't made to store big amounts of data. To pull that data into R would be painful. So, being able to keep it in Databricks and run the code there and just pull in the results or the tidbits that we want is really efficient. We can use bigger data than we normally use in R, and our code will run faster than it normally does in R.

But there are some R functions it's not going to know how to translate. You could write an R function today that no one's ever heard of and use it, but dbplyr won't be able to use it. And then there's some more sophisticated functions like the modeling we'll do in tidymodels that work in R but can't easily be translated to SQL. So, every once in a while when we're working like this, we'll want to pull the part of the data or the results that we've created into R as a traditional data frame table and use it in R as if it was an R data frame. There's one very important function we need to know that will let us do that, and that is the collect function.

So, we could write as much code as we want on one of these views, and then when we want to pull the results into R, we could append collect to it, and R will send the code over, the code will be executed, but then R will pull the result back and either throw it on the screen or save it to an object if you arrange to do that, which I am going to do. So, when I run this, code's being sent over, it's being run in Databricks, the result is coming back, it's being saved in LendingClub subset.

Hardly where I want to type that, I'll just look at LendingClub subset, and now we can see the results of that code. It's subtle, but this is a real R table we can see by the metadata up here. It's a table six by three, whereas when we looked at before, this is not a table. It's saying we're looking at a Spark SQL table that came from this database. If we don't collect the code and we ask to look at the result, R will bring back the first 10 lines like we saw up here, about the first 10 lines of this table that stayed in Databricks, but it shows us the first 10 lines, and if our result's only six lines long, six rows, we get to see the whole thing, but it's still not in R. If we want to bring in R, we use the collect function, and it will bring every row, not just the first 10. Every row will come there. All right, so best practice here, if you're planning to use data in Databricks with this, is to do as much computation as you can before you run collect. Normally, that means we'll be bringing smaller amounts of data back with collect, and everything will run faster because using smaller data inside of R. Not only that, everything you write before you run collect should run faster because it's running in Databricks. Collect's sort of like the pivot point.

Cleaning data and fitting the model

All right, so that is how we're going to use data in Databricks. We've built this connection. I've used the dplyr tbl function to refer to my dataset in Databricks, and everything downstream from this object is just going to be making SQL code that will then get run on Databricks. At some point, I'll decide I've run everything I want to on Databricks, and I'll collect what's left over and bring it into R, and then I'll start modeling with R's tools in R. So let's begin. The first thing I want to do is clean the data. This dataset has a lot of columns that I just can't use for prediction. They describe things that happen in the life of a loan after you apply, like did you default on the loan, that sort of thing. But we're trying to predict based on knowledge that's available when you apply for your loan what the bank will say the interest rate is. And so first, I remove all these columns I want to ignore to build a training dataset. Then I'm going to change the type of a lot of columns to numeric. They're all basically stored as strings in the database. I'm going to create some new variables that I think might be useful predictors. And then eventually at the very end, I'm going to take what's left over and collect that into a table in R using the collect function.

So let's run this code. I'll call this Lending Club Dat. So that will be the table we use going forward. There's quite a bit more code running here, so it goes a little slower. But it didn't take too long. And now I have a huge table, and this is just a plain table in R. It's 130 rows long. And this is the data I'm going to use for my training and test datasets.

But there's some things I'd like to do this day that were easier to do in R than in Databricks. And so I'm going to start with this table. And now that it all exists in R, I'll run some functions like I'll recode some variables, I'll impute some means, and I'll replace missing values with what I think are reasonable guesses. And then I'll make some list of column names that I could use later. At this point, you've cleaned your data. You might want to double check that things make sense. So first I could check to see if I've gotten all the missing values. And I could see how many missing values are in each column. It's kind of hard to go through all of these. So maybe I will instead filter down to just I'll test for columns that have more than zero missing values and sum up. And I see that there's actually three columns of missing values in them. Like, for example, bcopenbided, bcutil down here. So those are columns where when I fit my model, I'll have to impute some value for the missing value. That's good to know. Maybe I want to make sure that there's no factor with less than two levels.

All right. Now it's time to model. We're not going to spend a lot of time here modeling. But suffice it to say, you would build training and test sets. You'd come up with a recipe of preprocessing steps to apply to your model. You'd specify what the model is. In this example, we're going to use a LASSO model to do some variable selection. And we're going to tune this penalty parameter with cross-validation. And so this code will do that. And then finally, we'll put together our final model with the best penalty parameter cross-validation yielded, and we'll build a model. If you want to know how to do this, and if that interests you, then go back to this article and read through all the steps to hear the why we do it, to see the code, and to follow along yourself. But explaining tidymodels would really take us far afield from explaining Databricks. So I'm not going to go into that. I'm going to jump forward to the point where we have a model that we want to use in our Shiny app. So let me just go ahead and run all of this code. Let's go fit a model and make some plots and whatnot.

This is the cross-validation step. It's fitting, I think, like 500 models at that point. So this step might take a couple seconds. All right. And now let's take the model that resulted. It's called LinLinearFit. And let's save that to an RDS object that I could package along with my Shiny app. I think my app in this case will be fastest if I just include the model with the app. But I have written code here that explains how you could deploy that model in different ways, such as creating a web API with the Vetiver package in Plumber or pinning it with the pins package and so on. That won't be necessary today. Today, we will just put our model with the app.

Building and publishing the Shiny app

The app that we built or that I built is in the GitHub repo under the app directory. And it contains the app.r file along with the model RDS that I created and some helper files. You can see that at the start of the app, I call the three helper files. Two of them just write, you know, functions and UI elements I need to build this app. But the first one is where all the action happens in terms of authenticating to Databricks, because this app will need to use data in addition to the model. If we look at that authenticate file, we see that we're just using the exact same code that we used in the script to connect to Databricks. And just like in the script, it needs to rely on these environmental variables, which I already have loaded in my environment.

So, if I've done anything right, this app should run. I might need to load a few packages. Here we go. Here's the app. I'm running it locally. It's connecting to Databricks. If I ask to predict the rate, it works. And that's fine. But I'm running it locally with my own local environmental variables. Ultimately, we want to take this app and publish it somewhere else where other people could use it and it could connect to Databricks for them without me being there to give my environmental variables. Now, I'm going to show you one way to do that.

The first thing we're going to do is we're going to publish the app. So, I click the publish button up here. It looks like one little blue hand passing a ball to another little blue hand. I have some accounts here. Let's add a new account just to see what that looks like. I'm going to add an account from Posit Connect. This one will be called Pub.DemoPositTeam. It's an account I have access to, but hasn't been connected to this workbench. So, I'll click next. It says, well, let's pass along the credentials you used to sign into workbench to this Posit Connect account and see if you have access there. And I do. And it recognizes me based on this credentials and says, do you want to connect to this workbench instance? Yes, I do. So, I click connect. It says successful. So, it's really easy to connect at this point.

And now I can publish to this Pub.DemoPosit.Team account. I should pick a name for the app. That's maybe a little more descriptive than app. And then I will publish it. Now it's publishing the app over here. What it is publishing is just the scripts and files needed to make this app. Once the app is loaded, I will go and provide the environmental variables that the app needs to connect to Databricks. So, here's the app. It's trying to load. It's not going to succeed because it's going to hit an error when it tries to connect to Databricks. I need to supply those remaining environmental variables. And the way I could do that with RStudio Connect or Posit Connect, if I publish there, is to go to the Vars tab for my app. And then I could add environmental variables. These are variables that no one using my app will have access to. They won't be able to see them. But Connect will use them to service my app and connect to Databricks. And then the variables I want to add are the same ones that I had over here. So, it's a very simple process. I take the name of the variable, put in the name slot, and I fetch the value of the variable without quotation marks. And I put it in the value slot. And then I add the variable. And when I'm all done, I can click save.

Now, I happen to know that this isn't going to work exactly. Well, I'll add all these. So, sometimes I just have to wave my hands here. But adding these variables isn't going to be enough to save the day for this app because, unfortunately, this instance of Connect uses a different version of R than the one I built this in, which doesn't need to be a problem. But for one of the packages, unfortunately, it is a problem. But that's okay because I have a workaround that we can use. Copy and pasting. So, I'm reasonably certain there will be no typos.

All right. Now, I save these. Click okay. So, now they are part of the app. And the app is going to update. And I suspect there's going to be a version problem. Yeah. So, there's some sort of a problem. I'm not going to say the password. I can look at the logs and see what's happening. And it says, oh-ho, this R version is 4.3, but the mass package requires greater than 4.4.0. That's okay. I have another version of this app running that I'm going to show you. And it is called... well, let's narrow this down. The things made by me. That clicked into something by accident. There's the app we just made. But the one I want to show you that is running is this one. And if we look into the Vars tab of its controls, you can see I created these three variables that it needed to make this work. And it does indeed work.

If you want to take the time, you could add a UI element in your app, maybe with a password widget that asks the user to write their own credentials. And once you have those, you can use the user's credentials to connect to Databricks or a database or what have you. And if you want to take even more time, you can create a reactive UI such that what the user sees in the app is keyed into what permissions their credentials have. So, a manager might see quite a bit in their app because they could fetch all of that with their credentials. Whereas an employee might not see everything because they don't have access to all that with their credentials. All the basics are there. It's just for you to elaborate on them. This is the easiest way is just provide a global set of credentials that everyone will be using in the background. Unfortunately, everyone who visits this app will be using my credentials in the background. So, it might not stay up for too long. But now you know how to do it. And with that, I'll thank you for your attention. Bye.

If you want to take the time, you could add a UI element in your app, maybe with a password widget that asks the user to write their own credentials. And once you have those, you can use the user's credentials to connect to Databricks or a database or what have you. And if you want to take even more time, you can create a reactive UI such that what the user sees in the app is keyed into what permissions their credentials have.