David Robinson | dbcooper: Turn any database into an R or Python package | RStudio (2022)
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, very excited to be here. So I do a lot of exploratory data analysis and a lot of that flow is built around reading data into memory. The first step is we import some data before we can transform, visualize, model it, and if you're in this room, it might be because this database is track, it might be because a lot of the time that's not where the data we need is.
In a lot of organizations, we work with data that's in a database, MySQL, Postgres, Snowflake, SQL Server, and a lot of those exciting data analyses don't start from reading a CSV into memory. They start by hitting data that's in SQL. I have a strong belief that working with a database should be as easy for your team as possible.
Last couple of years, there's been a lot of attention to how can we build R packages that make working within a company really smooth, that really help people get to the data that matters for the team, and that first kind of really hit me when I was in my first data scientist role at Stack Overflow. So when I first got in there, I was the first R user, and I ended up building an internal package called SQL Stack R that managed connecting and working with the data in those databases. Every job I've had since, I've built some similar iteration onto that same template of creating a package that wraps a database connection.
So now we've open sourced it as DB Cooper. So DB Cooper started as an R package named after the famous bandit, the hijacker, DB Cooper. And I'm excited to share that while it started as an R package, Michael Chow has actually ported it to Python, so now there's both a DB Cooper R and Python version. And what they do is make it one step to create a database-specific packages.
So if you're at a company, Widgets Inc., you can use DB Cooper to create an internal package like Widget R or Widget Pi to help analysts fluidly access the data in your databases.
The problem with working with databases from R
Mostly today I'm going to be talking from the perspective of the R package, but a lot of these same problems and solutions exist in the Python version. So what's it like to work with a database from R? What might you need to do when you're analyzing data in a database? Some common tasks, you might need to list the tables that are in the database, you might want to transform tables using dplyr.
So dbplyr is a really powerful package that lets you work with a remote table, kind of as if it were an in-memory table, and who here has used dbplyr? It is a really fun experience, isn't it? We sometimes can't just use dbplyr, we want to run our custom SQL queries. So we want to construct SQL, select statements ourselves and try running them. And sometimes we need to execute a command, like creating a table or dropping a table.
So I'll give an example here of how we might work with a SQL connection. So this is a database called Lamin, it's based on the Lamin baseball statistics dataset, a really phenomenal collection of tables, and it comes built into dbplyr. So we've got this Lamin db object, it's a SQL connection. And let's start with how might we go about listing tables?
We use the db list tables function from DBI, would get us a vector of the tables in the database. If we wanted to work with a remote table as if it were a dbplyr table, we'd use the tbl function. Tbl on this database, we want to get one of the tables out of it. In the case of the Lamin data, I might be interested in its table on batting statistics. If I want to run a SQL query, I'd still use the tbl function, but now I could use the SQL function to actually construct that query. So here I might be writing a query where I'm counting the batting table grouped by player ID and getting back some results from it. And if I want to execute a command, I might use the db execute function.
So these four approaches, they work, they do what they're supposed to do, but something really bothers me about them. Something is just fundamentally a little bit harder to use than I would expect it to be. Something is they're not designed how I'd want to work with this data. And I really try and figure out what is bothering me about these functions.
And I'll say, I think that, let's say if I load up this database, what really is bugging me is that Lamin db object. Like in our environment, we've got this Lamin db object, it's a SQL connection object. And I don't know what that is. For me, that's the wrong level of abstraction. I want to be working with individual tables. I want to be directly writing SQL queries and immediately executing them. I don't necessarily, I don't know how I'm going to pass this connection around. It's a clumsy way to engage with a database.
For me, that's the wrong level of abstraction. I want to be working with individual tables. I want to be directly writing SQL queries and immediately executing them.
How DbCooper works
So that's why I think there's a missing puzzle piece in between DBI for creating a connection to a database, and Dbplyr for transforming and exploring those tables. That's where DbCooper comes in. It turns a database into an opinionated collection of functions to let us explore it really fluidly, kind of as if it were data that were in memory.
DbCooper takes, you only need to do one function call to set up DbCooper. It's DbcInit. You pass it the connection object, and you give it a prefix. If you've worked with a lot of R packages, you know how useful it can be to have a shared prefix. DbcInit stores and manages the connection globally. It generates a prefixed function for important tasks you would do with that database, and really importantly, it generates autocomplete-friendly accessors for tables.
So I'll show how each of those tasks would be achieved in DbCooper. Once you've run DbcInit, instead of doing DbListTables, I would just do lamin underscore tbl. So that prefix that I passed it turned into a prefix for the function. LaminList would return the list of tables that are in the database.
So lamin, if I want to access just one table, I use lamin underscore tbl. So I say lamin tbl, then I can pass it the name of the table, get that one back. If I wanted to run a query, I do lamin underscore query and pass the SQL. And if I wanted to execute, I'd use lamin underscore execute. So one thing to notice about this is it's taking the connection object entirely out of the picture, and it's focusing on an opinionated set of tasks that someone might want to do with a database.
My favorite feature about DbCooper is autocomplete. So whenever I've worked with a database, it can be hard to remember what the tables are, and I get kind of tired of constantly referring back to the documentation or even to a list function to see what they are. I would maybe run into typos, et cetera, and it just doesn't feel anything like working with data that's in memory.
So DbCooper creates one function for each of the tables in your database so that you can access them with autocomplete. So it's a much more fluid experience where I can say, if I go into a table, I can just start typing it and see, oh, yeah, here are the tables that I'm... I can't remember exactly what this table in my database is called, but I use autocomplete to help me out. One thing I like about that is it makes something like a complicated join much more concise and much easier to read. For me, again, it really means I don't have to remember that I'm working with SQL tables.
Using DbPlyr with remote tables can feel a lot like using Dplyr with local tables.
The Python version
So I've been showing examples of the R function. What Michael Chow has set up is really something phenomenal, which is DbCooper.py, where each of the... It has the same syntax for creating list, table, query, and execute, but it also takes advantage of his suba package. Suba is a Python port of Dplyr that supports both local pandas data frames and databases.
So here you can actually see that we could use the lamin data and say .tbl, the salaries, and then pipe to count, but all within Python. So one thing I think is great about Suba is if you're really familiar with Dplyr and you're interested in going over to Python, Suba is absolutely the way to do it. And DbCooper in Python is a way to use that in your databases.
Exploring a database with DbCooper
I'm gonna show an example of exploring a database using DbCooper. So this is the New York Times COVID-19 database. Google makes public a large set of databases of, let's say, 239 databases in Google BigQuery. So this is a phenomenal resource, because that means there are 239 packages you could write, wrapping each of these data sources. I'm gonna show one, which is the New York Times COVID-19 database.
So we'd start by creating the connection. DbCooper doesn't help with this. We created ourselves using DbConnect, really whatever approach we would generally use to create a database connection. And then all we have to do with DbCooper is DbCinit, and we'll say we'll use the prefix COVID. So all of our functions we're gonna use to access are going to start with the word COVID.
So now that I have done DbCinit, I could use COVIDLIST, if I wanna say, okay, there are four tables in this database. And if I wanna access one of them, I could say, okay, COVID underscore US states. I've got this autocomplete-friendly way to access one of the tables, and we see data that looks like it's a time series that might separate across each of the US states. It's got both a column for confirmed cases and deaths. My understanding, it gets updated daily, so we could build something on top of this that unlike a static data set, it's gonna stay up to date.
And now that I've got it, I can work with this data as if it were local. I can work with it, and I can filter, I can arrange, I can do visualizations. So I can, let's say we pipe it through. This is actually all this aggregation, this filter, this arrange, this mutate, was all turned into a SQL query, gets run, gets brought into memory only when we plot it. And this is something that I really like about DbCooper is I just forget I'm using it. As soon as you've got these tables, you kind of just go right back into your normal flow.
I never liked when I worked with SQL databases that I'd have to take one step to do the select and the collect, and now I'm working in my traditional exploratory data analysis flow. Here you can barely tell where the database ends and where your exploratory data analysis begins.
Here you can barely tell where the database ends and where your exploratory data analysis begins.
Building a database-specific package
Where DbCooper really shines is in creating an opinionated package around a database. So I'll give an example. As I said, I first developed this kind of approach while I was working at Stack Overflow. So it's appropriate that we set up a package that wraps a public Stack Overflow database, which is one of the BigQuery public databases, is Stack Overflow questions, answers, tags. So around this database, we set up the Stack BigQuery package.
This is a package I put up just as an example of a package that could be created. We pick the database. Now let's wrap it with a package. And the way that we go about doing that is three steps. First we write database-specific connection code. So anything in terms of setting up credentials, et cetera, that's going to live inside this package.
Then critically, in the .onload and .unload functions, those are functions that run every time a package is loaded or unloaded, we're going to create one of those database connections and then use DbC init. So this means that as soon as somebody loads in this package, it's going to populate the package's namespace with functions that start with that prefix stack.
Now every database is different, and the ways that we explore them are different, too. So the next step would be adding database-specific verbs. Something that I love to do with Stack Overflow data is look at, say, patterns of trends of tags over time. I'm really interested in can I aggregate at the tag level? That requires joining together a couple tables in the database and doing a little bit of custom logic. That kind of custom database-specific logic or problem-specific logic or company-specific logic is a really good fit for a package that uses DbCooper because it's going to mean that when the person using this package actually does this analysis, everything's going to be set up for them.
So when you do library stack BigQuery, now we've got a function stack underscore for any of the...we've got stack list, stack table, stack execute, stack query, and one of these functions for each of the table accessors. So basically by loading library, it's as if every table just got loaded right into their environment so they can use it.
Here's an example of an analysis that I would do. Here we go. So here's an example of an analysis that I would do, for instance, of tags over time. So this is an aggregation that compares a couple of technologies like Python, JavaScript, Java, and R in terms of the percentage of stack overflow questions each...it was done by month...by month that are tagged with this particular language. So we can...notice there's a use of that summarized tags function that's database-specific, and then there's just some dplyr, some ggplot2, and again, really quickly, you can forget that you're working with a database at all.
Takeaways
So what I learned from the dbcouper package is kind of really two takeaways. One is don't underestimate the value of small usability improvements that affect you every day. And second is that a good database package lets you forget about the package entirely. See, once we've got this data loaded, it's really as if we just had...instead of looking at that one connection object, and now we're always working with that connection object, now we're just working with all the tables in the database.
A good database package lets you forget about the package entirely.
So next time you're working with a database connection, it just takes one line, try dbcouper, and I'd love to hear what you think. Thank you.
