Hannes Mühleisen - Data Wrangling [for Python or R] Like a Boss With DuckDB

Transcript#

This transcript was generated automatically and may contain errors.

Hi, hello everybody. Thanks, Hadley, for the introduction. And, yeah, glad to see so many of you are sticking around for the last talk. I know it's a lot to ask.

Yeah, my name is Hannes, and I'm going to talk about data wrangling like a boss with DuckDB. Very briefly about myself, if the clicker stops working. As Hadley has said, I am the co-founder and the CEO of a company called DuckDB Labs. But I'm also a professor of data engineering as a sort of side gig at the University of Nijmegen, which, in case you have never heard of it, that's okay, it's very small.

Tables through history

But I wanted to start with something that really inspires me, and I don't know how many of you have seen this thing. Anybody has seen this? Raise your hand. Nobody. Mark has seen it. Okay. This is one of the first tables that we know of. It is a protoc uniform tablet excavated in Uruk, which is, I think, in today's Iraq. And it's dated to 3200 BC.

It's in the Metropolitan Museum of Art in New York. And this is interesting, because if you look at it and squint a little bit, you can see an Excel table, right? And in fact, it is exactly that. It has two columns, it's item and amount, and two entries. One is hop and one is barley. One can only guess what they were making.

And the cool thing about it is that it predates this thing. Record-keeping predates literature by about a thousand years. So we've been making tables a thousand years before we have the first text. And I really liked it when I found this. I was really happy, because I was always like, what is it about my brain that's so excited about tables? Something must be wrong with me. But now I'm content to thinking that I'm just an older version of a human, which I'm very happy with Stone Age person.

Motivating the talk: databases are frustrating and slow?

I want to motivate my talk with a Hadley quote as a tradition. This is on the dbplyr vignette, and I will repeat it until it's removed. If your data fits in memory, there is no advantage to putting it in a database. It will only be slower and more frustrating.

If your data fits in memory, there is no advantage to putting it in a database. It will only be slower and more frustrating.

But then I really like tables and databases, and I kind of want to explore this a bit together with you. But I want to say that this quote and others by people like Thomas Lumley and Anthony D'Amico actually inspired the work on DuckDB, this kind of notion that data management systems are frustrating and slow, and we kind of set out to change that.

DuckDB is this culmination of decades of research in execution engines, but then really wrapped in a user-friendly package.

You will forgive me about some more things about speed. So this is a benchmark that originated from the data table folks that we've kind of resurrected and keep updated now. So here's two tasks. There's a group-by task and a join task. And this is a 50 gigabyte data set here. And you can see that, lo and behold, we have also managed to be faster than everything else here. Same is also on the join task. And this is not even the latest in what we have. I think the next version will be even faster than that.

But coming back to the golf course, I don't really go to the golf course. So DuckDB is free. It's free and open source under the MIT license. So you can build a company on it and you don't have to talk to us at all, which I think it sounds a bit, you know, difficult from traditional database sort of thinking ways. But we really think that in order to make an impact in how people think about data, which is kind of what we're trying to do here, we really need to give people the tools, the state-of-the-art tools to work with big data sets. And we can't really, you know, make this just happen for one company or for the customers of one company. We have to really make this work for everyone.

The world is kind of lacking DuckDB. So here's a plot of the GitHub stars. It's a very, it's a crap metric. Mostly VCs care about this. But it's interesting to see how this is going on. So we really have an exponential growth in the adoption of DuckDB. We're really, really grateful about it. We eventually spun off from the Research Institute into a company called DuckDB Labs because every open source project needs to fund itself somehow. You can fund the open source project by working for another company and then doing nights and weekends, but I can't really do that. You can sell your soul to VCs and then basically, you know, continue until the party stops and somebody has to change a license.

But DuckDB itself is, yeah, backed by DuckDB Labs, which is a company that provides support and consulting contracts around DuckDB, but actually is not the owner of the software itself, which is in a non-profit. So this is really a unique setup that we don't have, you know, we see interest in the company, but it's a long-term effort backed by an independent foundation just to make sure that this thing keeps being available. So even we have trouble, are going to have trouble sort of rug pulling this.

DuckDB loves R and Python

And DuckDB loves R. Hey! As I said, the motivation to build DuckDB came from interactions with the R community. I actually spoke about DuckDB for the first time at Kirill's meetup. Where's Kirill? In Zurich in summer of 2019. That was a long time ago. And because it's in process, we can actually do some pretty amazing things within R. So for example, DuckDB running inside the R process can directly read any data frame that you have inside the R process as if it were a table. There's no, you know, we don't have to first convert this to anything or like, no, no, no. If a data frame exists inside your R process, we can run queries on it. And we can also take the query results that we produce from DuckDB and convert them directly into data frames. And some systems out there have kind of also functionally like this, but it tends to be insanely slow. We can do this very quickly because we're sitting in the same process space. We don't have to do any serialization or any of that sort.

We also love Python. So it's the biggest API actually by downloads per month. It's like six million downloads per month. It's pretty ridiculous. And we can do also the same trick with pandas data frames. So if you run DuckDB in a Python process and you have a pandas data frame sitting there, we can also treat it as a table and read it. We can also produce results again as pandas data frames because that allows us to integrate with downstream libraries. And because many data sets or many tools are starting to support Apache Arrow as intermediates, we can also do the same trick for R and Python. If the data happens to be Arrow, we can read and write from it directly. Even streaming. So we don't actually have to copy things. It's just streaming into Arrow, streaming from Arrow. That works all really well. So that's DuckDB in a nutshell.

DuckDB's CSV reader

Let's revisit the statement. Slower and more frustrating. I'm going to go back to the running example in a bit, our wonderful CSV file. First, I want to say we actually wrote a research paper on CSV parsing. Because it turns out that this is such an annoying issue that you can actually write a computer science research paper on it. I think we may have been the first ones that have done this. There's some happened afterwards. But as a result, we actually have one of the best CSV readers out there in DuckDB because we realize that your analysis starts with getting data from somewhere else, right? You're not going to sit down and type everything into DuckDB. You're going to use something else. And so we brought this research paper and we took the ideas from the research paper and put them into DuckDB. A shout out to Pedro from our team who worked incredibly hard on this. And we can actually read most files out there without human interaction.

So, for example, in DuckDB, you can just type from data so-and-so, so-and-so, right? And this is technically a SQL query. We can treat table, if you don't, if this is using a table name, using the file as a table name. And we have this logic that says if the table isn't there and there's a file with the same name, well, probably we should read this file. And if it ends with CSV, it's probably a CSV file. So we do a lot of sort of guessing just to reduce the frustration. And you will see that it has indeed read this file correctly with all the column types and so on and so forth.

We also have some cool features in DuckDB like summarize, right? So we have some features that are trying to make your life as an analyst easier, even if the data set is huge. So to summarize is pretty cool, actually, because you just say summarize any query, really, and we will compute a ton of summary statistics very efficiently on this file. So this is a couple of million row files. And in very few, a very short amount of time, we will give you like the minimum, the maximum, the approximate, unique, the average, the standard deviation, the quantiles, and non-allocals, and a lot of stuff for all the columns in the file, right? And this is a single pass thing. So we read through the file once using our efficient implementation. We give you a good overview of what's in the file. And I said that we're not going to care too much about performance, but I couldn't resist but saying despite this being one of the most flexible CSV readers out there, it also is the fastest, okay? So this is cool.

Complex queries and duckplyr

But let's do a bit more complex queries, okay? I didn't get to this part with Postgres because it was too frustrated to kind of continue at some point. But we're actually not here to read CSV files. We're here to answer analysis questions. So here I have a totally made-up analysis questions. So given this data set, give me the number of non-Europeans between 20 and 40 that live in Auckland area from the 2018 census and group that by sex. Okay. So if now somebody has to translate it into a SQL query, if you have generative AI, maybe you can try, but I have just written down the SQL query. And it's basically, you can see it's not, it's nothing dramatic. It's doing a bunch of things where we like joined all these data sets together because the columns are actually codes and not actually values, except for the count, which is sometimes a value, but sometimes also C. And it is a bunch of cleanups because the statistics people in New Zealand somehow include totals in all the dimensions and there's rows in the main data that are aggregates of either other. And we obviously don't want that in a tidy data set. But if you just ignore all this, we can run this fairly complex query and it completes in one second.

That's kind of weird, right? Because, oops, sorry, I wanted to go here. It's kind of weird because I just told you it took 1.1 second to read this file, right? How can you, oops, sorry, how can you run this complicated query over all the files, joining them all together, doing aggregation along multiple things, doing a bunch of projections, filters and all that in one second? Well, again, magic. And what the reason is actually what DuckDB does, it takes the query and turns it into this query plan, which we call a query plan, which is kind of a high-level concept of how the query should be executed. And this is optimized and then executed in parallel. So the filters, for example, they move all the way down. So we remove data from the path of the data for the query as early as possible. We prune all the columns that we don't need as early as possible. And it's a streaming engine, which means we don't actually have to materialize any of this in memory. And yeah, so it's really efficient.

But then I realized that not everybody loves SQL. So let's explore alternatives, shall we say. So obviously we can also do this in dplyr , which I think is much nicer to write and to sort of reason about. So I've written down the exact same thing in dplyr, and I've actually checked whether the results are equivalent, and they are. So this is the same kind of thing where we basically make this sort of joined together data set with all these joins and the mutations and filters. And then the very bottom we have our final analysis on this free current data set, and it takes 15 seconds. It's not so terrible, but this is exactly why on this reasonably sized data set, we do see a significant difference in speed, just because we don't have a sort of holistic optimization happening here. dplyr is eagerly executing everything as it does.

So enter duckplyr. So we've been actually working with Posit to create sort of a hybrid mutant of dplyr and DuckDB. And the result is duckplyr. I mean, you have maybe seen the stickers I've put a bunch outside. So massive thanks again to Kirill for his hard work on this. It's dplyr, just re-engined with DuckDB. And you have the exact same syntax, so the syntax is exactly the same. And this, from the CSVs, directly finishes in 1.5 seconds. So that is because we have done this high-level reasoning about how this query should be executed and directly on the CSV files, and we can do this 10 times faster. From parquet files, it's actually half that, so more on that later. If you run the same query from the data frames that are already loaded into our session, it's roughly a third of this time. So if you're not doing the CSV parsing, it will get in faster.

So if we just run this on standard dplyr, this sequence of events, we're reading 35 million rows. This is what the data set contains, 35 million rows. And then we run this filter. Well, you can't really see it in the script, but the filter only matches to 14,000 rows, because that's just, you know, the way that the data is distributed. So we're throwing away all this other stuff except for these 14,000 rows. And then it's getting worse, because in the summarization, we only ever look at one of the columns. So we read the five other columns for no reason at all. And what we actually need to make this better is, as I mentioned, we need holistic optimization, where we can derive which columns are going to be used, which rows can be pruned earlier, and all these kind of things.

But now we have this problem that if we now start making these data, data2, and results somehow lazy, we would break the integration with other tools. Like, say we make this a lazy sort of placeholder object. We want to pass stuff into ggplot. ggplot would say, I don't know what this lazy thing is. So we somehow have to make the dplyr objects both data frames and lazy placeholders. And we have thanks to this ALTREP framework, which is really exciting. In R, you can kind of make these objects that are sort of both things at the same time, and it can be both lazy and non-lazy at the same time. So if you use dplyr, these objects data2 and result will actually be data frames, and you can't really see a difference without... other tools cannot really see a difference, but we internally know that there are lazy representations, and we can do the holistic optimization.

So this is really cool. If you want to try it out, it's on CRAN. There is a way of enabling telemetry if something goes wrong, and I encourage you to do that if you want to help us improving dplyr duckplyr. But we've also been super sensible in the sense that it will fall back to dplyr implementation if there's something we don't understand in there, so that you should actually not see a big difference when you just start using it. So I encourage you to do that.

I also want to show IBIS briefly, because the cool thing about DuckDB is it's an engine that is kind of agnostic and multi-platform. So DuckDB also runs in Python, and I don't know if Filip is here, but he made this cool integration with IBIS that basically means you can write the same query again in IBIS syntax, which is this Python syntax that looks a bit like dplyr, and it will produce the exact same results, and it will also take, in this case, 1.3 seconds, because it ends up also just being a DuckDB query plan. So it's really cool.

Atomic datasets with DuckDB

So now we've done complex queries with DuckDB, and I've shown you that we can basically do very, very complicated questions on fairly annoying datasets very quickly, and we don't really have to fall back to my favorite programming language, SQL. Okay, but we've kind of... So far this has been a bit of a defense, right? I've been defending against the accusation that databases are frustrating and slow. So I don't think DuckDB is frustrating and slow, but the thing is that it's more to using data management systems than, like, reading CSV files and running queries on them. And so I want to talk a bit about atomic datasets with DuckDB, and atomic is a positive thing.

People say, yeah, okay, let's not use CSV. And I agree. I'm a professor of data engineering. I agree. We should not use CSV files. We should use Parquet files. That's true. And DuckDB can actually help with this. In case you don't know what Parquet is, it's a parallel data format. It's really cool. DuckDB can read and write Parquet files very efficiently. So tip one, you should probably... Step one before using DuckDB, consider using Parquet if you don't fully trust me yet. That one is a no-brainer.

Yeah, and so we have native support for it in DuckDB. There's other options, like Arrow has support for Parquet files or the new Nano Parquet package from Posit. The interesting cool thing about Parquet file for this dataset is actually that if you take this file, which was an 800 megabyte CSV file, if you turn it into a Parquet file, it's only 71 megabytes, which is smaller than even the zip file that we had before. And it's actually much more efficient to read. So there's nothing not to like about it.

But then what happens, and I'm sure you've all seen this, is like I've done a bunch of stuff in this folder with all these files. I've done some conversions. I've done some scripting. I've done all these things. And now I end up with this ungodly mess of files, right? And there's no more sort of tracking what came from where and what version was, you know, used to create this and what are these scripts doing. And some people use makefiles there. And then that makefile essentially interprets the files, the timestamps there. But I don't think that's a great thing either.

But we get this ungodly mess of all these files. And what we should be doing better, and actually DuckDB can help here because we can, in a DuckDB file, we can store many tables. We can store views. We can store functions. All of that in a single file. So what we can do is we can create here in this file, example, we can create a single DuckDB file that contains all the CSVs in parsed, columnar, typed, compressed, binary formats. The ingestion can be atomic. So if I wrap this import statement there in a transaction, it either all goes through or nothing goes through. The resulting file is also around 100 megabytes. So it's about the same as the zip file, just much better.

And we actually have, a couple of months ago, we have finalized our storage format. So this format is now, the DuckDB file format is now stable and will remain stable for the, you know, foreseeable future. And there's actually some organizations like Hugging Face that are already shipping all the data sets that they have, which is really crazy, as DuckDB files. So you go Hugging Face data set, you click a button and you get the DuckDB file. So it's really cool. They can even be read remotely via HTTP and S3.

So it's really cool. So we put the data in this file. But that's not all. We can also do things like this, where we can create something called views, which are like a sort of a virtual table that you define a transformation using SQL or you can also do this using some other tool, but it doesn't really matter. And these views, they're actually stored in the file and they travel with the file. And it makes it actually much easier to write high-level queries. And you can stack the views in order to sort of describe like multi-step derivations. And that means that whenever you change something in the base table, the views will actually update. They don't cost any storage because they are just sort of symbolic descriptions. And this hides, it's possible to hide a lot of complexity using this kind of tricks. And you can share the file with somebody else and they can look at the file and see the view and can directly use it. It's still a single file.

Another thing that's really cool in DuckDB that you can create like macros, which are like sort of shortcuts into sort of things that you're using a lot. So on the top here, we're kind of repeating ourselves a lot because they have used these total columns in all these different dimensions there. And we kind of have to filter this out in order to not muck up our summaries. But you don't want to repeat like these ugly expressions 15 times. So you can create a macro and say, okay, I want to have a macro not total that has an arbitrary complex expression in it. They can just use it. And again, the cool thing is that macro will live in the file. So that means that you can send the file to somebody else and they can use it. It's really cool. And it's together with all the data, together with all the views, together with everything.

And finally, I want to talk about transactions. I know that transactions are something that you may remember from a database course where, you know, we move some money from A to B or something like that. But Mark has actually implemented the first really great transaction management for analytical use cases where we're changing data a lot, we're making, we're ingesting a lot of data, we're updating a lot of data, we're adding columns, removing columns, that sort of stuff. And that means we can actually make these atomic, this is why I said atomic, atomic sort of sequences of changes to our data that either all go through or not. And we can even do a manual rollback if we did something by accident, if we didn't sort of intend. So if I go back to this example here where, you know, I couldn't read the file because it has some dot dot C in it, you can just say, oh no, this was actually not what we want. You can roll back and your file will be unaffected and the data will be unaffected, okay?

Is big data dead? The single-node future

But I think it's an interesting question and I would like to take the last couple of minutes of the talk to actually discuss that. It's like, what does this actually mean for data, for data engineering, for data infrastructure? What does it mean if we have tools that can actually, you know, comfortably run through gigabytes of data on your laptop? And there's been some discussion on this, whether big data is dead or not.

So if we somehow, um, if we have, if you say, let's just do a thought experiment on data creation. The human typing speed is roughly 200 characters per minute. The working population in the Netherlands is about 10 million people. I will assume I have infinity money because I've, you know, I'm a VC or something. And, um, I've employed all of them. All 10 million. And the work year of the Netherlands has about 100,000 minutes, give or take. So if I employ the entire population for a whole year to type important data, I'm sure, into a computer, we will end up with roughly 200 terabytes of data. Well, this is, of course, ASCII text, which is compressible, so if you compress this somehow reasonably, we will end up with something like 60 terabytes.

And that sounds like a lot, right? 60 terabytes. But it's actually not. So you can buy these hard disks. They cost around $400. There's a banana for scale. I mean, we have seen IBM hard disks from the 50s, right? They were like these things that you lift into an airplane and things like that. No, no, no, no. These are tiny. So the maximum sort of output in usable data of the entire Netherlands working population would easily fit on these two hard disks. There's no scale-up required. I don't have to buy a Hadoop cluster for this. In two years, it will be one disk. Will we have two times the amount of people in the Netherlands? I sincerely doubt it. Okay? In ten years, this will fit on a laptop.

The human growth in people, and people that can type anything or produce any meaningful data, is growing much slower than our capabilities in crunching through data. We look at compute. The Apple M3, pretty standard computer, I'm sure some of you have one, has 16 cores. They are very fast cores. They have 400 megabytes

Hannes Mühleisen - Data Wrangling [for Python or R] Like a Boss With DuckDB

Transcript#

Tables through history

Motivating the talk: databases are frustrating and slow?

The CSV problem with Postgres

Speed of CSV readers

Introducing DuckDB

DuckDB loves R and Python

DuckDB's CSV reader

Complex queries and duckplyr

Atomic datasets with DuckDB

Is big data dead? The single-node future