David Robinson | dbcooper: Turn any database into an R or Python package

Transcript#

This transcript was generated automatically and may contain errors.

Hi, very excited to be here. So I do a lot of exploratory data analysis and a lot of that flow is built around reading data into memory. The first step is we import some data before we can transform, visualize, model it, and if you're in this room, it might be because this database is track, it might be because a lot of the time that's not where the data we need is.

In a lot of organizations, we work with data that's in a database, MySQL, Postgres, Snowflake, SQL Server, and a lot of those exciting data analyses don't start from reading a CSV into memory. They start by hitting data that's in SQL. I have a strong belief that working with a database should be as easy for your team as possible.

Last couple of years, there's been a lot of attention to how can we build R packages that make working within a company really smooth, that really help people get to the data that matters for the team, and that first kind of really hit me when I was in my first data scientist role at Stack Overflow. So when I first got in there, I was the first R user, and I ended up building an internal package called SQL Stack R that managed connecting and working with the data in those databases. Every job I've had since, I've built some similar iteration onto that same template of creating a package that wraps a database connection.

So now we've open sourced it as DB Cooper. So DB Cooper started as an R package named after the famous bandit, the hijacker, DB Cooper. And I'm excited to share that while it started as an R package, Michael Chow has actually ported it to Python, so now there's both a DB Cooper R and Python version. And what they do is make it one step to create a database-specific packages.

So if you're at a company, Widgets Inc., you can use DB Cooper to create an internal package like Widget R or Widget Pi to help analysts fluidly access the data in your databases.

The problem with working with databases from R

Mostly today I'm going to be talking from the perspective of the R package, but a lot of these same problems and solutions exist in the Python version. So what's it like to work with a database from R? What might you need to do when you're analyzing data in a database? Some common tasks, you might need to list the tables that are in the database, you might want to transform tables using dplyr .

So dbplyr is a really powerful package that lets you work with a remote table, kind of as if it were an in-memory table, and who here has used dbplyr? It is a really fun experience, isn't it? We sometimes can't just use dbplyr, we want to run our custom SQL queries. So we want to construct SQL, select statements ourselves and try running them. And sometimes we need to execute a command, like creating a table or dropping a table.

So I'll give an example here of how we might work with a SQL connection. So this is a database called Lamin, it's based on the Lamin baseball statistics dataset, a really phenomenal collection of tables, and it comes built into dbplyr. So we've got this Lamin db object, it's a SQL connection. And let's start with how might we go about listing tables?

We use the db list tables function from DBI , would get us a vector of the tables in the database. If we wanted to work with a remote table as if it were a dbplyr table, we'd use the tbl function. Tbl on this database, we want to get one of the tables out of it. In the case of the Lamin data, I might be interested in its table on batting statistics. If I want to run a SQL query, I'd still use the tbl function, but now I could use the SQL function to actually construct that query. So here I might be writing a query where I'm counting the batting table grouped by player ID and getting back some results from it. And if I want to execute a command, I might use the db execute function.

So these four approaches, they work, they do what they're supposed to do, but something really bothers me about them. Something is just fundamentally a little bit harder to use than I would expect it to be. Something is they're not designed how I'd want to work with this data. And I really try and figure out what is bothering me about these functions.

And I'll say, I think that, let's say if I load up this database, what really is bugging me is that Lamin db object. Like in our environment, we've got this Lamin db object, it's a SQL connection object. And I don't know what that is. For me, that's the wrong level of abstraction. I want to be working with individual tables. I want to be directly writing SQL queries and immediately executing them. I don't necessarily, I don't know how I'm going to pass this connection around. It's a clumsy way to engage with a database.

For me, that's the wrong level of abstraction. I want to be working with individual tables. I want to be directly writing SQL queries and immediately executing them.

Here you can barely tell where the database ends and where your exploratory data analysis begins.

Building a database-specific package

Where DbCooper really shines is in creating an opinionated package around a database. So I'll give an example. As I said, I first developed this kind of approach while I was working at Stack Overflow. So it's appropriate that we set up a package that wraps a public Stack Overflow database, which is one of the BigQuery public databases, is Stack Overflow questions, answers, tags. So around this database, we set up the Stack BigQuery package.

This is a package I put up just as an example of a package that could be created. We pick the database. Now let's wrap it with a package. And the way that we go about doing that is three steps. First we write database-specific connection code. So anything in terms of setting up credentials, et cetera, that's going to live inside this package.

Then critically, in the .onload and .unload functions, those are functions that run every time a package is loaded or unloaded, we're going to create one of those database connections and then use DbC init. So this means that as soon as somebody loads in this package, it's going to populate the package's namespace with functions that start with that prefix stack.

Now every database is different, and the ways that we explore them are different, too. So the next step would be adding database-specific verbs. Something that I love to do with Stack Overflow data is look at, say, patterns of trends of tags over time. I'm really interested in can I aggregate at the tag level? That requires joining together a couple tables in the database and doing a little bit of custom logic. That kind of custom database-specific logic or problem-specific logic or company-specific logic is a really good fit for a package that uses DbCooper because it's going to mean that when the person using this package actually does this analysis, everything's going to be set up for them.

So when you do library stack BigQuery, now we've got a function stack underscore for any of the...we've got stack list, stack table, stack execute, stack query, and one of these functions for each of the table accessors. So basically by loading library, it's as if every table just got loaded right into their environment so they can use it.

Here's an example of an analysis that I would do. Here we go. So here's an example of an analysis that I would do, for instance, of tags over time. So this is an aggregation that compares a couple of technologies like Python, JavaScript, Java, and R in terms of the percentage of stack overflow questions each...it was done by month...by month that are tagged with this particular language. So we can...notice there's a use of that summarized tags function that's database-specific, and then there's just some dplyr, some ggplot2 , and again, really quickly, you can forget that you're working with a database at all.

Takeaways

So what I learned from the dbcouper package is kind of really two takeaways. One is don't underestimate the value of small usability improvements that affect you every day. And second is that a good database package lets you forget about the package entirely. See, once we've got this data loaded, it's really as if we just had...instead of looking at that one connection object, and now we're always working with that connection object, now we're just working with all the tables in the database.

A good database package lets you forget about the package entirely.

So next time you're working with a database connection, it just takes one line, try dbcouper, and I'd love to hear what you think. Thank you.

David Robinson | dbcooper: Turn any database into an R or Python package | RStudio (2022)

Transcript#

The problem with working with databases from R

How DbCooper works

The Python version

Exploring a database with DbCooper

Building a database-specific package

Takeaways

Featured software#

dbplyr

rstudio