Resources

Colin Gillespie - Give me the Data

video
Oct 31, 2024
19:30

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Thank you very much. So as just introduced, my name is Colin Gillespie. I'm almost an ex-academic. So anyone who's heard me give talks, the academic part gets decreased by 10% every year for the last sort of five to ten years. So I'm now down to one day a week. I'm one of the co-founders of Jumping Rivers, so we've got the wonderful Popley booth. We've just been rebranded, so feel free to come and see how wonderful it looks. Big appreciated. And I've been reading data into R for 20 odd years, which makes me feel old. So I started using R in 1999.

It's all good things during a PhD, and I still got all my PhD files. And it's always helpful to look back at when you were doing things and realize how stupid you were. So all my R scripts had the file extension .com because there were commands, and that made perfect sense in my PhD brain of .com. And all my CSVs were called .dat because they were data files, and that made perfect sense. So I've been using R for a long time and been using it badly for quite a bit. I even used the underscore for assignment. Now, has anyone in the room used the underscore for assignment? Two people. There you go.

That is aging you. So in the dark and distant past, not only did you have the equals and the arrow and the double arrow and all left and right and all that stuff, you could also use the underscore, which would have played havoc with the tidyverse. But we could use underscore for assignment. And it wasn't always plain sailing. So I abandoned R for Python, don't judge, around about this time. So this was a postdoc, and they were paying me money. So that's why I did it, which seems like a reasonable thing to do. So they wanted to give me money for a postdoc. They were using Python. I learned Python and then came back, so I returned. And I always felt that something was always off with R.

Right? There's always just something in the back of my mind that was just a bit off. And I was never sure what it was. And I've been thinking about this for quite a while, and I thought, you know, what did I feel? And then I had an epiphany. So it's like, what is it? I think it's just too big and too much choice. You know, in Python you've got one way of doing things, and it's just, you know, R, you've just got so many different ways. And then I realised, I'd seen something in the newspaper not that long ago, and I thought, this is something that everyone in Britain faces. The hellish reality of living in a castle. And I realised people in the US don't have this problem, so I have at least three castles. And, you know, it's really hard just maintaining the grounds and the staff and everything else. But this is R, you know, there's just too much choice. We've got just not one way, we've got lots of different ways. And sometimes that's wonderful, and sometimes it's not so great, because we've just got something we want. So that's my introduction.

So I'm going to make a caveat or two. So there's lots of simplification in this talk. And basically, so I could get nice diagrams, if I'm completely honest with you, the academic part was like, oh, there should be more arrows, but I just went for simplification. But the goal is to make you curious about what's out there, to give you a feeling of, oh, I could try this in these circumstances, or I could try that, that's where I am. So if you have detailed knowledge of these tools, don't ask horrible questions, please. Right, so my entire life, or my entire talk, is basically going from the left-hand side to the right-hand side. We've got some data, and then I want to filter, select, summarise, or do modelling, or do something like that. So data on the left, result on the right. And this is basically my entire life over the last 25 years, you know, from my PhD, postdoc, then academic and jumping rivers, is basically getting data and going to the right-hand side.

Starting with CSV

So when I've been thinking about this, I started by thinking, well, where did I begin? So I started, you know, CSV file format, which should be familiar, round about zero to 10 meg, don't quote me on these exact, you know, the very hand-wavy numbers, but that sort of seems like a reasonable thing to do.

During my PhD, for some reason, I used the scan function to read in data, because the person next to me used the scan function, so I copied and pasted it from the person next to me. Don't use the scan function for reading CSV file, it's just bonkers, but I read in CSV. And, you know, the sort of classic tools would be read table, read CSV, and then you'll get to your result, right? So I've tried to follow this structure throughout the talk, you know, I've got data on the left, and then thinking about a little bit how we store our data, so here I'm storing it as a CSV, and then the engine is my posture of saying of how am I getting into R, okay? Now you could be a bit more fancy, and we could use the read R package, or we could use data.table, and I'm sure there's another three or four other packages out there, but we're all going to get to the same place.

Okay? And I'm going to defend CSV, you know, it's no longer quite such a popular thing, but the good thing about CSV is it just works, right? And as you get older and more grumpy, you just want things to work, right? So without going deep into the sort of reproducibility pipelines in RM, it's quite nice just having something that works. You've got a CSV file, you can look at it, you've got a read.csv or read R, and it just works. And even better, it just works next year as well, right? Even if you don't intend to reuse it, you probably are at some point, or you're probably going to give someone your code, and you just want it to work. And it probably just work in five years' time as well, probably.

There was something I seen in social media, and then I tried to track it down again a few days later, and I couldn't find it. So this is either completely true or just a figment of my imagination, but it sounds sort of true. And that was most of the stuff stored in S3 Amazon buckets. So cloud computing where you can put data is JSON or CSV files. So the stat quoted was around about 90% of data was JSON and CSV files. It's not to say that's a good thing, that's what people should have been doing. That's not what I'm saying. It's just, it just works. Now the big problem with this is it's going to be in memory only. So what does that mean is, well, when we're reading this data in, essentially we're limited by the size of RAM, okay? So that's why I'm sort of saying around about, so it's zero to 10 meg, but maybe zero to one gigabyte or two gigabytes, and that's sort of the rough size. So you probably want to do something else.

Parquet and column storage

So now I'm thinking about larger files, okay? So here I've got 10 megabytes thinking to round about with waving a bit of hands, 100 gig. Sadly, we're still wanting to do exactly the same. So even though data's got bigger, we're still wanting to filter, we're still wanting to select, we're still wanting to summarize. That's what we want to do. And crucially, and this is the important part, in memory isn't possible, right? So unless you've got a very posh laptop, if you try to read in this amount of data into your laptop, it would just sort of sit there burning a little bit before waving a white flag and just dying, right? So we can't just sort of do it in memory anymore. Instead, we have to do some other way. So this is where Parquet comes in, okay? So Parquet, I'm changing the file format. So rather than doing CSV, I'm using the Parquet file format. So why would you want to use this? So first of all, it's cross-platform. So I use Linux, so that means I was quite smug a few weeks ago when CrowdStrike hit the world, and I was just sitting with a Linux laptop.

And Everyday Jumping Rivers uses Linux, so we're all quite smug as a group. But it's cross-platform, so that means it works in Linux, Mac, and Windows, so that's important. And it's got a column layout. So why would you care about column layout? So let's suppose we've got an example data set. So we've got ID, so we've got 1, 2, 3, so a bunch of integers. We've got some names, n1, n2, long name, and we've got a bunch of doubles. So doubles is just a posh way of saying a number with some decimal places, okay? And if we were to do row storage, so that's CSV, we would store the data like this. Basically exactly what you'd expect, you've got ID, name, age. ID, name, age. ID, name, age.

And there's nothing inherently wrong about this, you know, it's fine, right? One thing to note is these things here are the same size, so the number 1, 2, and 3 are the same size, they're all integers, so you've got the same little block of memory. 20, 35, and 62, they're all doubles, so they've got the same size of memory as well. Whereas here, these names, they've got different size, right? You know, here I've got two characters, whereas long name has got nine characters. You've got the space in the middle as well, right? And you can imagine that actually you might have something that's, you know, tens or hundreds of characters, so you've got different sizes, and that's got implications.

So what that means is, when you've got row storage, it's very good for this sort of stuff, right? I've written a little bit of sort of NavSQL, but essentially here we're saying, give me all this stuff in row 2. And you can see intuitively of, if you've got this data structure on your hard disk, all we need to do is find here row 2 starts, and then go along a little bit, and then we've got the data, and that's quite good, right? It's simple. But this, this is bad, or harder, because we've got the age at 20, 35, and 62, and if you notice that the strides, such as the sort of the jumps between the numbers, it's different because of this name here. You know, so we've got long name, we've got name 2, we've got n1, right? So that sort of retrieval part is going to be slightly more complex.

For this data set, you don't notice, right? We're talking micro nanoseconds of a made-up SI unit, right? So it's nothing. But for big data sets, that stuff has implications. Column storage will store data like this. So we've got all the IDs in a block, then all the names, and then all the ages. Okay? So we've got, it's all flipped around. This is now bad. There's no such thing as a free lunch. Not awful, but it's not great, because look, we need to get ID number 2, and then we need to find the second name, and then we need to find the age. So that's not that great, because we've got that little bit of jumping about. But that is now dead easy, because all we need to do is find the start of the age, and then we've got all the data in one place, and we can do that calculation, right? So you win some, you lose some.

So you win some, you lose some.

The parquet file also does something about storing metadata, right? And what it does, I'll give you a few examples, there's lots of little tricks that it uses. Tricks being a portion of seeing really clever stuff under the hood, you know, so clever. But if I say, well, I've got this sequence of numbers, would you mind memorizing it? Now, not many people would go, right, I've got 4, 4, 4, 4, 1, 2. You wouldn't do that. Instead, you'll do, well, I've got 4 repeated 5 times, then I've got a 1, and then I've got 2 repeated 4 times. You know, that part here is easier to memorize, whereas trying to remember the actual sequence is quite hard. And you can imagine how, then, if you're going to store that information, if parquet stores it in the same way as this, it's a lot shorter. So rather than storing all the numbers, we can just store the sequence length. And if you've got some data, you know, so you've got patient data, and you've got the city, you can see how the city might repeat itself over and over and over again, or names, or whatever, you know, surnames. So here's a little data frame. So I've got the letter A repeated, what's that, 10 to the power of 6 times. So a fair few times. And the CSV file was round about 2 meg, and the parquet file was 1 kilobyte, right? So it's nothing, basically. You know, it's a few pixels in terms of size. It's just not a lot, there's nothing there. And again, it's not magic, because you just think of, I'm storing the letter A this amount of times, that's all I record. So that's got massive implications for storage.

We can also do dictionary encoding. So here I've got a data frame, and here I've got the word jumping rivers, and I'm going to do that 10 to 6 times. But rather than storing jumping rivers over and over and over again, well, what we can do is we can replace jumping rivers with, say, the integer 0, and then have a little lookup table saying jumping rivers so many times. And then we've got CSV file, 14 megabytes, parquet, 1 kilobyte. And that's quite a lot of, that's quite a saving. And it's got lots of other little tricks that it does in the background. And then when you start combining all these tricks, you get massive savings. So your data's smaller.

So now we can think about how we can store these larger files. And then the engine that we would use could be the arrow, or the Apache arrow package, or the nano parquet file. There's also, we've got a world famous author, Nick, who's just published a book. So please talk to Nick afterwards about her book on parquet and arrow and everything else. So she's right, sitting in the front row. But more seriously, she's got a book, it's online, really good, you should read it. But now we've got a way of storing data, and we've got a way of reading data into R. And then we can do all this filtering and sorting. And this package does a whole bunch of magic, right? It says it doesn't read in the data into R. So even if your parquet files are 100 meg, or a gigabyte, or five gigabyte, it doesn't do the calculation in memory, it tries as much as possible to do the calculation on the file, and then read it in.

Comparing RDS and Parquet performance

So I thought, I'm quite a fan of the R binary dataset, you know, the RDS file, save RDS, read RDS, that. And so here I've got a little comparison. So we've got RDS, that's our native binary file. And then we've got parquet twice. The reason we've got parquet twice is because we've got a different compression algorithm, okay? So gzip for RDS, when you save an RDS file, it's doing gzip under the hood, right? There's an argument you can change if you wanted. Parquet, the default snappy, you might imagine that it's going to be a bit quicker, unless we've been missold. And then you've got gzip as well. So in terms of size, so I think I had about a hundred and, yeah, so 110, I can't remember, yeah, so big data frame, lots of columns, lots of rows. So, you know, going from parquet, RDS is 115 meg, parquet is 100 meg, so that's 10% saving. That's not bad in terms of just size.

Gets more important when you start to think about up and downloading dataset, you know, so if you've got a shiny app that has to download some data right at the start, and it's a big dataset, you know, that little tweak saves you a lot of time. The write speed, so what I'm talking about here is I've got a big data frame, you know, lots of rows, lots of columns. For this example, RDS took about 27 seconds to write, the parquet snappy format took four seconds, gzip took 12 seconds.

Typically the sort of applications I work with, not really caring so much about writing, it's the reading part that's important. If we're reading, so that means you've got an RDS file on your laptop, and you've got a parquet file on your laptop, and you read that both into memory, so we're reading both into memory, then how long does that take? So we've got an order of magnitude, right? So what that means in practice is, is going from a shiny app that says it's got five seconds to just read in data, which is painful because everybody now expects us to code like Google and, code like Google and, you know, super fast. Switching file format can then reduce that time down to sort of 0.3, 0.4 seconds and things start getting a bit faster.

DuckDB for larger datasets

Larger datasets, so here I'm thinking about 100 megabyte plus, and we've still got parquet file formats, we've still got CSV, I'm sure there's someone out there with CSV files, you know, hundreds of megabytes long, and we might also have a database, right? So these are sort of three file formats I'm thinking of. And the way I'd suggest handling this, and there's always caveats and not always, you know, is to think about DuckDB.

And here I've sort of slightly waved my hands and called it a format. So what I'm thinking of here is I've got DuckDB, so a database, and we read in those three file types, and there's others, into this format. And then we can access DuckDB using a sort of combination of dplyr and duckplyr as well. Okay, so that would be the sort of way in. So the duckplyr just allows you to write sort of native dplyr type code without having to understand what's going on. So all that's going on here is we've got a massive amount of data stored in these formats, so database, parquet, CSV, reading it into this sort of DuckDB intermediate stage, perhaps a caching stage, and now we can do really fast queries on DuckDB.

Why load data into DuckDB? There's a whole keynote on it, so I definitely am not going down that route, but there's a whole keynote tomorrow, so you should go to that. Hans will give a brilliant talk. But a rough sort of hand-wavy thing. It's faster for some cases, particularly the repeated joins, repeated queries, sorry, so you're doing the same query over and over again. For joining data, it's also got some cool inbuilt stuff in DuckDB that makes joining a bit quicker than, say, nanoparquet or something like that, so it's quite nice.

Summary

Summary, there's not a single solution. As I've got older, I value stability quite a lot. You know, I still enjoy messing about with new stuff, but I also like things that just work. And I've tried to make this talk of the stuff that I've talked about, you know, CSV files, parquet files, DuckDB, they're sort of stable tools, they're not going anywhere, they solve your problem, and depending on your current problem, you should choose one of them. And if you've got 10 minutes, you can just mess about with them and get a rough idea of what works.

Lots of blog posts in Jumping Rivers, so if you go to our website, we've done a few, three or four blog posts in this stuff, it's got more details. But thank you very much for listening, and happy to take any questions.

Well, there seems to be an issue with our Slido right now, but I do have a question for you about, you're talking about different file types and how to read those in. What are you using to, when you do read that data in Transformer, and I know Spark has become more popular with data size, like, speed of running those queries and doing data manipulation on those.

So, I think it depends on your use case and what you care about. And so, I suppose the main thing I've been thinking about doing this talk is, you've got an app, or you've got a Quarto document or a Markdown document, and you've got a dataset that you're pulling in, it's quite big, maybe it's in a pen and connect, or maybe it's somewhere else, and you're trying to optimize that building of the document. So, a lot of the clients that we work with, they don't need the data in real time. You know, things change in the last three hours, but who really cares? So, you could do a little run at the stroke of midnight, convert to whatever format you want, and then that's what you can use for the rest of the day. And at the stroke of midnight, you can regenerate that sort of cache file, you know, whether it's a parquet file, whether it's something in DuckDB. So, it's done sort of not quite in real time, it's done, and then you've just got a very nice, quick way of generating Quarto shiny apps, that sort of stuff. Yes, definitely.