Wes McKinney - The Future Roadmap for the Composable Data Stack

Transcript#

This transcript was generated automatically and may contain errors.

Okay, our next speaker is representing the year 2016, but he has spoken at 15, 16, obviously, 17, 18, 19, 20, 21, 22, 23, 24, all 10 years. Yeah, that's freaking awesome, all 10 years.

All 10 years of the R conference for Mr. Python, right? As we all know, this is the exemplar of the community's getting along. He's actually, I want to say, had been a pillar of this of saying, why is everyone arguing about languages? Let's build tools that all the languages can use. And I think he's incredibly successful at that. So yes, I think, you know, he has built so many tools that has changed the world. Yeah, he's changed the world. Sorry. Okay, but I'm starting to embarrass him. So I'll just tell you that he really loves doing yoga. Please welcome Wes.

All right, let's see if my audio is working. It is. Yeah, cool. Okay, cool. Awesome. And my eyes are not as good as they used to be.

So yeah, thanks. This has been 10 years in a row. Thanks for everyone bearing with me and the people, little people at home watching video online. So I've given some variant of this talk, like, a few times in the last year. So I'm going to talk about some of the ideas from it, but I'm also going to give a bit of a retrospective on the last 10 years of the New York ARC conference, or at least my involvement in it.

So yeah, I guess you folks mostly know me, panned this project and, you know, more recently I've been pretty involved in the Arrow and Parquet communities. My book, Python for Data Analysis, is in its third edition. I'm back working at Posit after a multi-year detour from Posit to build Voltron Data, which is still working very actively on GPU acceleration for analytics with Arrow. I also, you know, am now investing a lot more. So I have a small part-time venture fund called Composed Ventures to invest in companies that are building in and around this ecosystem of technologies that I've been working in for the last decade.

A 10-year retrospective

So what I tried to do for this talk was that I went back and it occurred to me, like, maybe I've been giving the same talk here for the last 10 years, and so I should actually try to figure out whether that's true or not. So I took screen grabs from the YouTube. I was able to track down all the YouTube videos for every talk I've given at this conference. Firstly, I will say, like, I'm very nearly wearing the same thing, like, not it's, like, not on purpose either. I think I've lost weight. I have a lot more gray hair, but, yeah, it's, you know, very interesting.

So anyway, the title in 2015 was Data Frames, the Good, the Bad, and the Ugly. It was basically about data frame APIs and, like, what are the things that differentiate or, like, what are the commonalities and the things that differentiate our data frame tools when basically we're building more or less the same things in R and Python and all of these other programming languages. And at that time, I was working at Cloudera, so here's a slide from that talk in 2015. And I had, in working with the big data community, I said, wouldn't it be great if, you know, we could start to think about decoupling the API layer of the data frame expressions, basically what Hadley has done with dplyr and the tidyverse , like, creating this really nice composable, pipeable expressions layer for describing what you want to do with the data, and then we could enable other people to build really fast execution engines, scalable storage layers, and we could just focus on usability and user experience in our programming languages that we love, like Python and R, and we could build less of the systems where historically, like, we had to build all of these full-stack systems from scratch.

And so as a result, Hadley has implemented CSV readers and all the weird edge cases of reading Excel files and CSV files, and I've done the same thing, and as have people in the Julia community, and so all this duplicated effort seems like it's taking effort away from the really valuable work, which is improving holistically the developer experience in making, you know, doing data science more productively. So my prediction was I think we could get there by 2025, and so the rest of this talk is maybe a little bit the strides that we've taken toward this goal.

So my prediction was I think we could get there by 2025, and so the rest of this talk is maybe a little bit the strides that we've taken toward this goal.

And in the backdrop of all of this, like, it's pretty incredible how much computing hardware has changed in the last 20 years, and just thinking about, you know, the cell phone and the fact that, you know, the cell phone is basically more powerful than the laptop that I was giving that presentation on 10 years ago. And if you look at just server counts in data centers in, you know, the 2010 to 2015 era, core counts were relatively modest, but they've spiraled up to now where you can get a, you know, server in AWS with 192 physical cores, 384, you know, core, like, concurrent threads with hyper-threading, you know, tons of RAM. The same kind of exponential increases happen in disk performance, we had, we were just starting to have solid state drives in the mid-2010s, now we've got ultra-fast, you know, non-volatile memory, really low, you know, ton, you can do, you know, hundreds of thousands of I.O. operations per second.

The exact same thing has happened in networking performance, you know, we had, it used to be that the first generation of InfiniBand was 10 gigabit, now we're talking, starting to talk about terabit Ethernet and terabit networking in data centers, which is just completely mind-blowing. I'm sure you've heard about GPUs and, you know, LLMs and that sort of thing, so I think also the advances in computing hardware in accelerators and advanced silicon is super impressive. So there's all this, you know, progress in hardware, and so we'd like to be able to have systems that can take advantage of all of this stuff and just, you know, use it seamlessly within our data science environments without having someone coming to say, hey, if you want to use all this fancy, you know, fast networking and fast disk and get the most out of the CPUs and get the most out of GPUs, you've got to use a different programming language or you've got to use a different data analysis framework. Like, I think that kind of stinks if that is what ends up happening. So we've tried to build things that would enable that to not happen.

The hope is that this is kind of like a reverse Tower of Babel effect that, you know, by working together and collaborating on these shared components that we can reach a level of performance and interoperability and efficiency that we would never be able to achieve if we were working in isolation.

So we hope, and we foresee, but we hope that this will yield, you know, a disruptive effect, you know, broadly on how data management systems are built. I think for the purposes of the data science ecosystem, it's great that, you know, now we have, you know, nice tools. You can run dplyr expressions that execute against multi-file parquet datasets or large, you know, essentially parquet file or ORC file or CSV file datasets that live in S3 or Google Cloud without having to think so much about, you know, the mechanics of how that works. We can do the same thing in Python, really any language that builds bindings to these libraries.

But it's interesting to think about the implications as this approach to building systems becomes more ubiquitous. And so all around the data stack from execution engines to interchange and interface protocols to the storage ecosystem. So we've got many open source projects that are being built to achieve this type of composability and modularity.