Resources

Hannes Mühleisen - Data Wrangling [for Python or R] Like a Boss With DuckDB

video
Oct 31, 2024
58:29

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, hello everybody. Thanks, Hadley, for the introduction. And, yeah, glad to see so many of you are sticking around for the last talk. I know it's a lot to ask.

Yeah, my name is Hannes, and I'm going to talk about data wrangling like a boss with DuckDB. Very briefly about myself, if the clicker stops working. As Hadley has said, I am the co-founder and the CEO of a company called DuckDB Labs. But I'm also a professor of data engineering as a sort of side gig at the University of Nijmegen, which, in case you have never heard of it, that's okay, it's very small.

Tables through history

But I wanted to start with something that really inspires me, and I don't know how many of you have seen this thing. Anybody has seen this? Raise your hand. Nobody. Mark has seen it. Okay. This is one of the first tables that we know of. It is a protoc uniform tablet excavated in Uruk, which is, I think, in today's Iraq. And it's dated to 3200 BC.

It's in the Metropolitan Museum of Art in New York. And this is interesting, because if you look at it and squint a little bit, you can see an Excel table, right? And in fact, it is exactly that. It has two columns, it's item and amount, and two entries. One is hop and one is barley. One can only guess what they were making.

And the cool thing about it is that it predates this thing. Record-keeping predates literature by about a thousand years. So we've been making tables a thousand years before we have the first text. And I really liked it when I found this. I was really happy, because I was always like, what is it about my brain that's so excited about tables? Something must be wrong with me. But now I'm content to thinking that I'm just an older version of a human, which I'm very happy with Stone Age person.

Motivating the talk: databases are frustrating and slow?

I want to motivate my talk with a Hadley quote as a tradition. This is on the dbplyr vignette, and I will repeat it until it's removed. If your data fits in memory, there is no advantage to putting it in a database. It will only be slower and more frustrating.

If your data fits in memory, there is no advantage to putting it in a database. It will only be slower and more frustrating.

But then I really like tables and databases, and I kind of want to explore this a bit together with you. But I want to say that this quote and others by people like Thomas Lumley and Anthony D'Amico actually inspired the work on DuckDB, this kind of notion that data management systems are frustrating and slow, and we kind of set out to change that.

The CSV problem with Postgres

So I went on the internet and I googled big CSV file. This was the first thing that came up. It's the New Zealand Stats Agency. And they have a census data set, multiple CSV files for download on their website. Really, I promise you this was the first thing that I found good on them that had all this stuff for download there. And then I just went through the list and grabbed the biggest CSV file I could find. It's six files in a 100 megabyte zip file or something like this. The biggest one is 857 megabytes large. It has 34 million rows and six columns. You would agree with me that that fits in memory, yes?

Let's unpack the statement that we've seen earlier. Say I want to look at this file from Stats NZ with a database, let's say Postgres, like the world's most commonly overhyped database system, right? I'm going to ignore some of the pain that is required to set Postgres up in the first place. No, I'm not going to talk about credentials, ports, users, all that. It's a pain in itself. Hopefully you believe me that.

But, you know, in principle, what we're going to do in Postgres is going to say, we have to create a table and we have to use this copy command to get the CSV file into the database. And then we can finally do the analysis. OK, so we look at the CSV file first. And you're just on the command line. We go like head data CSV. Very good name. And we see these columns there. It's a year, age, ethnicity and so on. And by squinting and with a lot of experience. And after all, I'm professor of data engineering. I can see that. OK, this probably is a table with these columns.

And this looks like these are numbers, although the leading zeros somewhere around there, they seem a bit worrying, right? Like leading zeros. Are these really numbers we shall see. So then I after figuring all this out, I type the copy command into Postgres and it says Mac. And I'm like, OK. Invalid input syntax for type integer dot dot C line 18. Like, OK, fine. Column count is clearly not an integer, which is weird because that's the only one where I was quite sure it was a count. But it isn't. OK. So then we drop the table again. Fine. We create a table again. We make this count a bar chart because, you know, OK, clearly we have to figure this out later on. And then we call the copy command again. And it says Mac. Invalid input syntax for type integer. CMB 0 7 6 0 1 line four hundred thirty one thousand seven forty one. And that from your laughter, I know exactly what you're feeling right now.

So that's not great. OK, fine. Drop the table again. Go for broke. Go for broke. And then just copy the data. And lo and behold, we've copied the 34 million rows in 16 seconds. This is all on this MacBook, by the way. I'm running out of experience on this MacBook. Now I have done have five other files to do this with. Right. I'm going to spare you this. Right. And then after I've done all of this, I finally get to use this SQL thing. At this point, I've kind of spent an hour on this. And indeed, this is more frustrating than I thought it would be.

Speed of CSV readers

Let's just look at sort of the speed for a couple of off the shelves solutions. And again, it's not only about performance. And this is very weird coming from a database person. I know. But if we use reader, for example, as you know, one of the many sees readers that we have in our you pointed at the data. And it says, hey, I've read this file for you. And you think, hang on, wouldn't you only going to tell me that you don't like the data types? And no, no, no. It says, no, I figured it out. There's three character columns in here. And there's three double numeric integer in this case. But numeric columns here and then here is, by the way, how this thing looks like. And we all have been sent 10 seconds again.

The execution performance of 10 seconds is not so relevant for an interactive use case, because this took, you know, 20 seconds overall. I had to type the command press enter. But so this is really interesting to see how the data science tools like are in Python on the ecosystems have kind of solved this frustrating thing. Actually, in many in a pretty good way, like you just do it. And comparing that to Postgres, the shining light on the database sort of ceiling is really a huge, huge difference.

And it's why is that? I think it's interesting question. I think the data management community in the past has really just never cared about reading about also about people. And this is not an exaggeration. I mean, if you if you think about it, how do you sell Oracle to somebody? Well, it's very easy. You meet the other CEO on the golf court. Right. Done. You don't have to talk to the people that actually have to use Oracle because they would tell you or Postgres or anything. They really they really they would like rip you apart. But since this is not relevant in the sales process, they don't care in data science tools. We really care, which is why there's so much better.

Introducing DuckDB

Having said all that. We can, of course, do a formal benchmark here and say, let's take a file right and throw it into various readers. And we kind of see that there is a spread of time between like 22 seconds and 2.9 seconds data table. Traditionally had a very good as we reader pollers doesn't really manage this because they are, I think, too eager in their optimizations here. But basically, again, having said this, the performance of the reader is only so important. And it's interesting how that ties these two statements together, because frustrating and slower are kind of similar in that sense. Right. Because slower actually doesn't refer to performance here. It refers to the end to end time.

So because we realize that this was true for most data management systems, but we really like data management systems, we thought, how can we maybe build a data management system that doesn't is not slower and more frustrating? So the so Mark, who is here and I got together and we started thinking about how analytical database system. That's not slower. Frustrating could look like. And we did this research at CWI in Amsterdam, which is the Dutch National Research Lab of Computer Science and Mathematics, which is famous because it was where Python was invented. But I'm at the wrong conference.

I realized that the result of this effort is that to be. I want to do a quick poll. Who here has used that DB? Oh, that's that's excellent. It's excellent. But there's a lot of people that haven't used it yet. So I'm I will you will forgive me in trying to explain why it's maybe a good idea to use that DB. And I think the user experience is really core to the philosophy of that DB. People have focused too long on this little middle thing of the burger. They are the DBS engine to join algorithms. The I don't know, you know, the scan performance, all these things. And now we really has nobody had really thought about the end-to-end user experience because, you know, remember. And we really set out to change that to make a data management system. That is good. Nice to use.

And sort of by accident, we also invented a new category of data management system, which is an in-process analytical system. I will say in a second why it's in process, but basically it means that there is no setup required of a server or files or anything like that, but it just runs wherever you are. And it's not a transactional system like Postgres or SQLite or actually most other data management systems, but it's analytical, meaning that it's really focused on doing analysis and not on sort of tracking orders in a web shop. And it didn't exist yet. So that's pretty good. It's always good to invent a new category, right?

Let's go a little bit in more detail, like the in-process architecture. I think this is the biggest sort of departure from traditional data management systems architecture that we did in DuckDB is that it runs within another process. It's a bit hard to imagine what that means. But for example, if you're in R or in Python or in Java or whatever, if you're pulling up a DuckDB connection, it means that the entire code of the entire database management system actually runs within that process. And this sounds very weird, but we're actually quite used to it because the world's most widely deployed data management system, not Postgres, but SQLite, actually does the same thing.

And it has really great effects on usability on, you know, set up simplicity and all these things. It also makes it really easy to set this up. So you can type install.package with DuckDB. Okay, you can also do pip install DuckDB. We support a bunch of other stuff. We have zero dependencies. So you don't have to first install, I don't know, OpenMP or something else like that. It's just all wrapped into one package. There's no apt-get. There's no Docker. There's no funky drivers, no funky GPU drivers, nothing like that. It runs really anywhere, you know, Mac, Linux, Windows, ARM, Intel, doesn't really matter. As I said, for users, it's just install packages. So that's nice.

And it's also batteries included. I noticed a term from yet another system, but it's a fully featured data management system. We have transactions, we have persistence, and we can write tables to the disk and we can actually have transactional safe changes to these files, even if the changes are quite massive. We have a crazy amount of operators, high-performance implementations for operators like joins, aggregates, window functions. DuckDB can, out of the box, read CSV files, important, parquet files, JSON files, a bunch of other stuff. It's all built in, right? We can connect to cloud storage like S3. We can talk to Postgres. We can talk to MySQL, can do all these things. And it's integrated, as I mentioned, into R, Python, Java, and a zoo of other languages. So it's really, we try to have all the things there.

And it's fast in the sense of efficiency. And I realized that if that, I just said that speed doesn't matter, but that is for the interactive use case. Speed does actually matter in two ways. First, if you're running something every 10 seconds, probably the speed doesn't matter. And the other thing is that, and I will come back to that later, is that by making something faster, you actually expand the capabilities of what you can do on a single node. So this is great.

So how is DuckDB fast? Well, it's hard to explain. You show an effect. If you run DuckDB, your laptop can kind of get warm. So a word of warning here. So we automatically paralyze the analysis tasks that you give us across all the available CPU cores. But line paralyzation is actually not very meaningful without having a good per-core efficiency. And we achieve that by having highly efficient C++ implementations for all the various operators. We can also use the hard disk if the data set does actually not fit in memory. So if we have this case where we want to run a query on a terabyte file and we actually have to build, have to sort the whole thing, you can imagine that you might not have a terabyte of RAM. And sorting everything is a quite sort of IO-intensive operation. So DuckDB can transparently and gracefully use the available disk space. That's really cool. But all this happens completely automatically, which is why it's really magic for many people that it just does things quickly and then they don't understand.

And of course, we know that any piece of sufficiently advanced technology is indistinguishable from magic. And I think what's fair to say is that DuckDB is this culmination of decades of research in execution engines, but then really wrapped in a user-friendly package.

DuckDB is this culmination of decades of research in execution engines, but then really wrapped in a user-friendly package.

You will forgive me about some more things about speed. So this is a benchmark that originated from the data table folks that we've kind of resurrected and keep updated now. So here's two tasks. There's a group-by task and a join task. And this is a 50 gigabyte data set here. And you can see that, lo and behold, we have also managed to be faster than everything else here. Same is also on the join task. And this is not even the latest in what we have. I think the next version will be even faster than that.

But coming back to the golf course, I don't really go to the golf course. So DuckDB is free. It's free and open source under the MIT license. So you can build a company on it and you don't have to talk to us at all, which I think it sounds a bit, you know, difficult from traditional database sort of thinking ways. But we really think that in order to make an impact in how people think about data, which is kind of what we're trying to do here, we really need to give people the tools, the state-of-the-art tools to work with big data sets. And we can't really, you know, make this just happen for one company or for the customers of one company. We have to really make this work for everyone.

The world is kind of lacking DuckDB. So here's a plot of the GitHub stars. It's a very, it's a crap metric. Mostly VCs care about this. But it's interesting to see how this is going on. So we really have an exponential growth in the adoption of DuckDB. We're really, really grateful about it. We eventually spun off from the Research Institute into a company called DuckDB Labs because every open source project needs to fund itself somehow. You can fund the open source project by working for another company and then doing nights and weekends, but I can't really do that. You can sell your soul to VCs and then basically, you know, continue until the party stops and somebody has to change a license.

But DuckDB itself is, yeah, backed by DuckDB Labs, which is a company that provides support and consulting contracts around DuckDB, but actually is not the owner of the software itself, which is in a non-profit. So this is really a unique setup that we don't have, you know, we see interest in the company, but it's a long-term effort backed by an independent foundation just to make sure that this thing keeps being available. So even we have trouble, are going to have trouble sort of rug pulling this.

DuckDB loves R and Python

And DuckDB loves R. Hey! As I said, the motivation to build DuckDB came from interactions with the R community. I actually spoke about DuckDB for the first time at Kirill's meetup. Where's Kirill? In Zurich in summer of 2019. That was a long time ago. And because it's in process, we can actually do some pretty amazing things within R. So for example, DuckDB running inside the R process can directly read any data frame that you have inside the R process as if it were a table. There's no, you know, we don't have to first convert this to anything or like, no, no, no. If a data frame exists inside your R process, we can run queries on it. And we can also take the query results that we produce from DuckDB and convert them directly into data frames. And some systems out there have kind of also functionally like this, but it tends to be insanely slow. We can do this very quickly because we're sitting in the same process space. We don't have to do any serialization or any of that sort.

We also love Python. So it's the biggest API actually by downloads per month. It's like six million downloads per month. It's pretty ridiculous. And we can do also the same trick with pandas data frames. So if you run DuckDB in a Python process and you have a pandas data frame sitting there, we can also treat it as a table and read it. We can also produce results again as pandas data frames because that allows us to integrate with downstream libraries. And because many data sets or many tools are starting to support Apache Arrow as intermediates, we can also do the same trick for R and Python. If the data happens to be Arrow, we can read and write from it directly. Even streaming. So we don't actually have to copy things. It's just streaming into Arrow, streaming from Arrow. That works all really well. So that's DuckDB in a nutshell.

DuckDB's CSV reader

Let's revisit the statement. Slower and more frustrating. I'm going to go back to the running example in a bit, our wonderful CSV file. First, I want to say we actually wrote a research paper on CSV parsing. Because it turns out that this is such an annoying issue that you can actually write a computer science research paper on it. I think we may have been the first ones that have done this. There's some happened afterwards. But as a result, we actually have one of the best CSV readers out there in DuckDB because we realize that your analysis starts with getting data from somewhere else, right? You're not going to sit down and type everything into DuckDB. You're going to use something else. And so we brought this research paper and we took the ideas from the research paper and put them into DuckDB. A shout out to Pedro from our team who worked incredibly hard on this. And we can actually read most files out there without human interaction.

So, for example, in DuckDB, you can just type from data so-and-so, so-and-so, right? And this is technically a SQL query. We can treat table, if you don't, if this is using a table name, using the file as a table name. And we have this logic that says if the table isn't there and there's a file with the same name, well, probably we should read this file. And if it ends with CSV, it's probably a CSV file. So we do a lot of sort of guessing just to reduce the frustration. And you will see that it has indeed read this file correctly with all the column types and so on and so forth.

We also have some cool features in DuckDB like summarize, right? So we have some features that are trying to make your life as an analyst easier, even if the data set is huge. So to summarize is pretty cool, actually, because you just say summarize any query, really, and we will compute a ton of summary statistics very efficiently on this file. So this is a couple of million row files. And in very few, a very short amount of time, we will give you like the minimum, the maximum, the approximate, unique, the average, the standard deviation, the quantiles, and non-allocals, and a lot of stuff for all the columns in the file, right? And this is a single pass thing. So we read through the file once using our efficient implementation. We give you a good overview of what's in the file. And I said that we're not going to care too much about performance, but I couldn't resist but saying despite this being one of the most flexible CSV readers out there, it also is the fastest, okay? So this is cool.

Complex queries and duckplyr

But let's do a bit more complex queries, okay? I didn't get to this part with Postgres because it was too frustrated to kind of continue at some point. But we're actually not here to read CSV files. We're here to answer analysis questions. So here I have a totally made-up analysis questions. So given this data set, give me the number of non-Europeans between 20 and 40 that live in Auckland area from the 2018 census and group that by sex. Okay. So if now somebody has to translate it into a SQL query, if you have generative AI, maybe you can try, but I have just written down the SQL query. And it's basically, you can see it's not, it's nothing dramatic. It's doing a bunch of things where we like joined all these data sets together because the columns are actually codes and not actually values, except for the count, which is sometimes a value, but sometimes also C. And it is a bunch of cleanups because the statistics people in New Zealand somehow include totals in all the dimensions and there's rows in the main data that are aggregates of either other. And we obviously don't want that in a tidy data set. But if you just ignore all this, we can run this fairly complex query and it completes in one second.

That's kind of weird, right? Because, oops, sorry, I wanted to go here. It's kind of weird because I just told you it took 1.1 second to read this file, right? How can you, oops, sorry, how can you run this complicated query over all the files, joining them all together, doing aggregation along multiple things, doing a bunch of projections, filters and all that in one second? Well, again, magic. And what the reason is actually what DuckDB does, it takes the query and turns it into this query plan, which we call a query plan, which is kind of a high-level concept of how the query should be executed. And this is optimized and then executed in parallel. So the filters, for example, they move all the way down. So we remove data from the path of the data for the query as early as possible. We prune all the columns that we don't need as early as possible. And it's a streaming engine, which means we don't actually have to materialize any of this in memory. And yeah, so it's really efficient.

But then I realized that not everybody loves SQL. So let's explore alternatives, shall we say. So obviously we can also do this in dplyr, which I think is much nicer to write and to sort of reason about. So I've written down the exact same thing in dplyr, and I've actually checked whether the results are equivalent, and they are. So this is the same kind of thing where we basically make this sort of joined together data set with all these joins and the mutations and filters. And then the very bottom we have our final analysis on this free current data set, and it takes 15 seconds. It's not so terrible, but this is exactly why on this reasonably sized data set, we do see a significant difference in speed, just because we don't have a sort of holistic optimization happening here. dplyr is eagerly executing everything as it does.

So enter duckplyr. So we've been actually working with Posit to create sort of a hybrid mutant of dplyr and DuckDB. And the result is duckplyr. I mean, you have maybe seen the stickers I've put a bunch outside. So massive thanks again to Kirill for his hard work on this. It's dplyr, just re-engined with DuckDB. And you have the exact same syntax, so the syntax is exactly the same. And this, from the CSVs, directly finishes in 1.5 seconds. So that is because we have done this high-level reasoning about how this query should be executed and directly on the CSV files, and we can do this 10 times faster. From parquet files, it's actually half that, so more on that later. If you run the same query from the data frames that are already loaded into our session, it's roughly a third of this time. So if you're not doing the CSV parsing, it will get in faster.

So if we just run this on standard dplyr, this sequence of events, we're reading 35 million rows. This is what the data set contains, 35 million rows. And then we run this filter. Well, you can't really see it in the script, but the filter only matches to 14,000 rows, because that's just, you know, the way that the data is distributed. So we're throwing away all this other stuff except for these 14,000 rows. And then it's getting worse, because in the summarization, we only ever look at one of the columns. So we read the five other columns for no reason at all. And what we actually need to make this better is, as I mentioned, we need holistic optimization, where we can derive which columns are going to be used, which rows can be pruned earlier, and all these kind of things.

But now we have this problem that if we now start making these data, data2, and results somehow lazy, we would break the integration with other tools. Like, say we make this a lazy sort of placeholder object. We want to pass stuff into ggplot. ggplot would say, I don't know what this lazy thing is. So we somehow have to make the dplyr objects both data frames and lazy placeholders. And we have thanks to this ALTREP framework, which is really exciting. In R, you can kind of make these objects that are sort of both things at the same time, and it can be both lazy and non-lazy at the same time. So if you use dplyr, these objects data2 and result will actually be data frames, and you can't really see a difference without... other tools cannot really see a difference, but we internally know that there are lazy representations, and we can do the holistic optimization.

So this is really cool. If you want to try it out, it's on CRAN. There is a way of enabling telemetry if something goes wrong, and I encourage you to do that if you want to help us improving dplyr duckplyr. But we've also been super sensible in the sense that it will fall back to dplyr implementation if there's something we don't understand in there, so that you should actually not see a big difference when you just start using it. So I encourage you to do that.

I also want to show IBIS briefly, because the cool thing about DuckDB is it's an engine that is kind of agnostic and multi-platform. So DuckDB also runs in Python, and I don't know if Filip is here, but he made this cool integration with IBIS that basically means you can write the same query again in IBIS syntax, which is this Python syntax that looks a bit like dplyr, and it will produce the exact same results, and it will also take, in this case, 1.3 seconds, because it ends up also just being a DuckDB query plan. So it's really cool.

Atomic datasets with DuckDB

So now we've done complex queries with DuckDB, and I've shown you that we can basically do very, very complicated questions on fairly annoying datasets very quickly, and we don't really have to fall back to my favorite programming language, SQL. Okay, but we've kind of... So far this has been a bit of a defense, right? I've been defending against the accusation that databases are frustrating and slow. So I don't think DuckDB is frustrating and slow, but the thing is that it's more to using data management systems than, like, reading CSV files and running queries on them. And so I want to talk a bit about atomic datasets with DuckDB, and atomic is a positive thing.

People say, yeah, okay, let's not use CSV. And I agree. I'm a professor of data engineering. I agree. We should not use CSV files. We should use Parquet files. That's true. And DuckDB can actually help with this. In case you don't know what Parquet is, it's a parallel data format. It's really cool. DuckDB can read and write Parquet files very efficiently. So tip one, you should probably... Step one before using DuckDB, consider using Parquet if you don't fully trust me yet. That one is a no-brainer.

Yeah, and so we have native support for it in DuckDB. There's other options, like Arrow has support for Parquet files or the new Nano Parquet package from Posit. The interesting cool thing about Parquet file for this dataset is actually that if you take this file, which was an 800 megabyte CSV file, if you turn it into a Parquet file, it's only 71 megabytes, which is smaller than even the zip file that we had before. And it's actually much more efficient to read. So there's nothing not to like about it.

But then what happens, and I'm sure you've all seen this, is like I've done a bunch of stuff in this folder with all these files. I've done some conversions. I've done some scripting. I've done all these things. And now I end up with this ungodly mess of files, right? And there's no more sort of tracking what came from where and what version was, you know, used to create this and what are these scripts doing. And some people use makefiles there. And then that makefile essentially interprets the files, the timestamps there. But I don't think that's a great thing either.

But we get this ungodly mess of all these files. And what we should be doing better, and actually DuckDB can help here because we can, in a DuckDB file, we can store many tables. We can store views. We can store functions. All of that in a single file. So what we can do is we can create here in this file, example, we can create a single DuckDB file that contains all the CSVs in parsed, columnar, typed, compressed, binary formats. The ingestion can be atomic. So if I wrap this import statement there in a transaction, it either all goes through or nothing goes through. The resulting file is also around 100 megabytes. So it's about the same as the zip file, just much better.

And we actually have, a couple of months ago, we have finalized our storage format. So this format is now, the DuckDB file format is now stable and will remain stable for the, you know, foreseeable future. And there's actually some organizations like Hugging Face that are already shipping all the data sets that they have, which is really crazy, as DuckDB files. So you go Hugging Face data set, you click a button and you get the DuckDB file. So it's really cool. They can even be read remotely via HTTP and S3.

So it's really cool. So we put the data in this file. But that's not all. We can also do things like this, where we can create something called views, which are like a sort of a virtual table that you define a transformation using SQL or you can also do this using some other tool, but it doesn't really matter. And these views, they're actually stored in the file and they travel with the file. And it makes it actually much easier to write high-level queries. And you can stack the views in order to sort of describe like multi-step derivations. And that means that whenever you change something in the base table, the views will actually update. They don't cost any storage because they are just sort of symbolic descriptions. And this hides, it's possible to hide a lot of complexity using this kind of tricks. And you can share the file with somebody else and they can look at the file and see the view and can directly use it. It's still a single file.

Another thing that's really cool in DuckDB that you can create like macros, which are like sort of shortcuts into sort of things that you're using a lot. So on the top here, we're kind of repeating ourselves a lot because they have used these total columns in all these different dimensions there. And we kind of have to filter this out in order to not muck up our summaries. But you don't want to repeat like these ugly expressions 15 times. So you can create a macro and say, okay, I want to have a macro not total that has an arbitrary complex expression in it. They can just use it. And again, the cool thing is that macro will live in the file. So that means that you can send the file to somebody else and they can use it. It's really cool. And it's together with all the data, together with all the views, together with everything.

And finally, I want to talk about transactions. I know that transactions are something that you may remember from a database course where, you know, we move some money from A to B or something like that. But Mark has actually implemented the first really great transaction management for analytical use cases where we're changing data a lot, we're making, we're ingesting a lot of data, we're updating a lot of data, we're adding columns, removing columns, that sort of stuff. And that means we can actually make these atomic, this is why I said atomic, atomic sort of sequences of changes to our data that either all go through or not. And we can even do a manual rollback if we did something by accident, if we didn't sort of intend. So if I go back to this example here where, you know, I couldn't read the file because it has some dot dot C in it, you can just say, oh no, this was actually not what we want. You can roll back and your file will be unaffected and the data will be unaffected, okay?

Is big data dead? The single-node future

But I think it's an interesting question and I would like to take the last couple of minutes of the talk to actually discuss that. It's like, what does this actually mean for data, for data engineering, for data infrastructure? What does it mean if we have tools that can actually, you know, comfortably run through gigabytes of data on your laptop? And there's been some discussion on this, whether big data is dead or not.

So if we somehow, um, if we have, if you say, let's just do a thought experiment on data creation. The human typing speed is roughly 200 characters per minute. The working population in the Netherlands is about 10 million people. I will assume I have infinity money because I've, you know, I'm a VC or something. And, um, I've employed all of them. All 10 million. And the work year of the Netherlands has about 100,000 minutes, give or take. So if I employ the entire population for a whole year to type important data, I'm sure, into a computer, we will end up with roughly 200 terabytes of data. Well, this is, of course, ASCII text, which is compressible, so if you compress this somehow reasonably, we will end up with something like 60 terabytes.

And that sounds like a lot, right? 60 terabytes. But it's actually not. So you can buy these hard disks. They cost around $400. There's a banana for scale. I mean, we have seen IBM hard disks from the 50s, right? They were like these things that you lift into an airplane and things like that. No, no, no, no. These are tiny. So the maximum sort of output in usable data of the entire Netherlands working population would easily fit on these two hard disks. There's no scale-up required. I don't have to buy a Hadoop cluster for this. In two years, it will be one disk. Will we have two times the amount of people in the Netherlands? I sincerely doubt it. Okay? In ten years, this will fit on a laptop.

The human growth in people, and people that can type anything or produce any meaningful data, is growing much slower than our capabilities in crunching through data. We look at compute. The Apple M3, pretty standard computer, I'm sure some of you have one, has 16 cores. They are very fast cores. They have 400 megabytes