Wes McKinney - The Future Roadmap for the Composable Data Stack
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Okay, our next speaker is representing the year 2016, but he has spoken at 15, 16, obviously, 17, 18, 19, 20, 21, 22, 23, 24, all 10 years. Yeah, that's freaking awesome, all 10 years.
All 10 years of the R conference for Mr. Python, right? As we all know, this is the exemplar of the community's getting along. He's actually, I want to say, had been a pillar of this of saying, why is everyone arguing about languages? Let's build tools that all the languages can use. And I think he's incredibly successful at that. So yes, I think, you know, he has built so many tools that has changed the world. Yeah, he's changed the world. Sorry. Okay, but I'm starting to embarrass him. So I'll just tell you that he really loves doing yoga. Please welcome Wes.
All right, let's see if my audio is working. It is. Yeah, cool. Okay, cool. Awesome. And my eyes are not as good as they used to be.
So yeah, thanks. This has been 10 years in a row. Thanks for everyone bearing with me and the people, little people at home watching video online. So I've given some variant of this talk, like, a few times in the last year. So I'm going to talk about some of the ideas from it, but I'm also going to give a bit of a retrospective on the last 10 years of the New York ARC conference, or at least my involvement in it.
So yeah, I guess you folks mostly know me, panned this project and, you know, more recently I've been pretty involved in the Arrow and Parquet communities. My book, Python for Data Analysis, is in its third edition. I'm back working at Posit after a multi-year detour from Posit to build Voltron Data, which is still working very actively on GPU acceleration for analytics with Arrow. I also, you know, am now investing a lot more. So I have a small part-time venture fund called Composed Ventures to invest in companies that are building in and around this ecosystem of technologies that I've been working in for the last decade.
A 10-year retrospective
So what I tried to do for this talk was that I went back and it occurred to me, like, maybe I've been giving the same talk here for the last 10 years, and so I should actually try to figure out whether that's true or not. So I took screen grabs from the YouTube. I was able to track down all the YouTube videos for every talk I've given at this conference. Firstly, I will say, like, I'm very nearly wearing the same thing, like, not it's, like, not on purpose either. I think I've lost weight. I have a lot more gray hair, but, yeah, it's, you know, very interesting.
So anyway, the title in 2015 was Data Frames, the Good, the Bad, and the Ugly. It was basically about data frame APIs and, like, what are the things that differentiate or, like, what are the commonalities and the things that differentiate our data frame tools when basically we're building more or less the same things in R and Python and all of these other programming languages. And at that time, I was working at Cloudera, so here's a slide from that talk in 2015. And I had, in working with the big data community, I said, wouldn't it be great if, you know, we could start to think about decoupling the API layer of the data frame expressions, basically what Hadley has done with dplyr and the tidyverse, like, creating this really nice composable, pipeable expressions layer for describing what you want to do with the data, and then we could enable other people to build really fast execution engines, scalable storage layers, and we could just focus on usability and user experience in our programming languages that we love, like Python and R, and we could build less of the systems where historically, like, we had to build all of these full-stack systems from scratch.
And so as a result, Hadley has implemented CSV readers and all the weird edge cases of reading Excel files and CSV files, and I've done the same thing, and as have people in the Julia community, and so all this duplicated effort seems like it's taking effort away from the really valuable work, which is improving holistically the developer experience in making, you know, doing data science more productively. So my prediction was I think we could get there by 2025, and so the rest of this talk is maybe a little bit the strides that we've taken toward this goal.
So my prediction was I think we could get there by 2025, and so the rest of this talk is maybe a little bit the strides that we've taken toward this goal.
And in the backdrop of all of this, like, it's pretty incredible how much computing hardware has changed in the last 20 years, and just thinking about, you know, the cell phone and the fact that, you know, the cell phone is basically more powerful than the laptop that I was giving that presentation on 10 years ago. And if you look at just server counts in data centers in, you know, the 2010 to 2015 era, core counts were relatively modest, but they've spiraled up to now where you can get a, you know, server in AWS with 192 physical cores, 384, you know, core, like, concurrent threads with hyper-threading, you know, tons of RAM. The same kind of exponential increases happen in disk performance, we had, we were just starting to have solid state drives in the mid-2010s, now we've got ultra-fast, you know, non-volatile memory, really low, you know, ton, you can do, you know, hundreds of thousands of I.O. operations per second.
The exact same thing has happened in networking performance, you know, we had, it used to be that the first generation of InfiniBand was 10 gigabit, now we're talking, starting to talk about terabit Ethernet and terabit networking in data centers, which is just completely mind-blowing. I'm sure you've heard about GPUs and, you know, LLMs and that sort of thing, so I think also the advances in computing hardware in accelerators and advanced silicon is super impressive. So there's all this, you know, progress in hardware, and so we'd like to be able to have systems that can take advantage of all of this stuff and just, you know, use it seamlessly within our data science environments without having someone coming to say, hey, if you want to use all this fancy, you know, fast networking and fast disk and get the most out of the CPUs and get the most out of GPUs, you've got to use a different programming language or you've got to use a different data analysis framework. Like, I think that kind of stinks if that is what ends up happening. So we've tried to build things that would enable that to not happen.
Arrow, feather, and the path to interoperability
So in 2016, so we started the Apache Arrow project and I got together with Hadley so that we could say, you know, what could we build with this? Apache Arrow, the idea was a universal cross-language memory format for data frames and tabular data, and so we said, okay, like, what can we do with this? So we made the feather file format, which was, like, the first, you know, kind of, you know, there are probably some other efforts, but, you know, proof of concept showing, like, what it would look like to be able to really efficiently share data frames between R and Python.
So that was, so now year three, toward interoperable data frames, and now I'd moved from Cloudera to Two Sigma, and so this is a talk that I gave at JupyterCon in 2017, representing Two Sigma, and we were talking about this idea of, like, what if we can go beyond just sharing data and start building portable computational runtimes where we can take advantage of all that fast disk and fast networking and fast CPUs and GPUs and then expose it in a uniform way in all of these programming languages so we make an improvement in one of these layers and we get all the benefits in the programming languages that we use every day.
And so we continued to drive this forward in the Arrow project in 2018. Here I am in 2019, partnered with RStudio to build Ursa Labs, started pulling in more funding from hardware companies and financial firms and, you know, kind of making the fireball bigger and bigger. 2020, hair got longer because COVID and no haircuts, and so Neil and I, you know, in 2020 gave a talk here at the conference about how, you know, the progress that we had made towards sort of creating this uniform high-speed data access and computational layer for interacting with large parquet data sets in the cloud with Arrow. So, you know, already four years into Arrow, we've made a lot of progress starting to make this a reality.
So, 2021, still COVID, still virtual. We started building Voltron data. 2022, back in person here on this stage with John Keene talking about the maturation of the Arrow stack and how that was building out for the R ecosystem. Last year, retrospective, basically, you know, one year older version of this talk.
Big data is dead (sort of)
But the general idea here is that, you know, what you can get on a single node has gotten so big and so powerful that that led Jordan Tagani of MotherDuck, who used to be at BigQuery and SingleStore to declare that big data is dead. And the reality is, for a lot of people, it is true that the definition of big data has changed. And so what used to be big data is now medium data and can be worked with very effectively on a single machine with tools like, you know, one of our new favorite projects, DuckDB. Who's heard of DuckDB or used DuckDB? It's almost everybody here. So two years ago, I think almost nobody had heard of this project.
And so I'm, you know, such a huge fan and to see the convergence of the data science ecosystem and the database ecosystem. And so, you know, my joke now is that, you know, kind of DuckDB, our lord and savior. So the fact that you can put a state-of-the-art analytic database system on your phone, in your web browser, you know, really everywhere, and do efficient cluster centralized computing on large instances, as well as at the edge, has created a lot of new interesting possibilities for the ecosystem.
So if you're an R user, thanks to the partnership between DuckDB Labs and Posit, you can now use dplyr powered by DuckDB. So duckplyr, so Suhannes and Kira Miller worked on this project. And so this is now something you can install and very nearly subject to some semantic differences, which, you know, will hopefully be ironed out in the fullness of time. Use DuckDB to execute dplyr operations as a more or less drop-in replacement.
The composable data stack
And so at a higher level, you know, while we've been working on accelerating the data science stack, there's been this broader trend at the large tech company level of thinking about how companies like Meta and Google and Microsoft and Amazon, like how we build large scale data management systems. And so the work that we've done in Arrow, there's been similar thinking around all the different pieces that go into building these large scale data warehouses. So last year, working with the folks at Meta, they had built many query processing engines within Facebook's data platform, Meta's data platform. And they said, well, you know, what if we could create like a reusable execution engine that we can consolidate all of our query processing and all of these different systems onto a common layer, make it interoperable with Arrow. It could be used for stream processing, batch processing, machine learning, and all the things that go into building state-of-the-art systems at Meta.
So the core ideas here is that approaching the problem from the Arrow ethos that we want to build systems that use open standards and open protocols and that are designed around the ideas of modularity, reuse, and interoperability so that we can have these off-the-shelf components like DuckDB or DataFusion that, you know, provide accelerated execution. But that we also have some of the other layers like the storage systems, the file formats, which are also open standards and are available off the shelf where we can mix and match the components and we can build new custom data processing systems in much less time than we could in the past. And at the same time, because we're not starting, when you're building a new system, you're not starting from scratch and trying to leverage all of the cutting-edge capability that's available in modern hardware that you can leverage the advances that are being developed within each of these component systems.
The hope is that this is kind of like a reverse Tower of Babel effect that, you know, by working together and collaborating on these shared components that we can reach a level of performance and interoperability and efficiency that we would never be able to achieve if we were working in isolation.
So we hope, and we foresee, but we hope that this will yield, you know, a disruptive effect, you know, broadly on how data management systems are built. I think for the purposes of the data science ecosystem, it's great that, you know, now we have, you know, nice tools. You can run dplyr expressions that execute against multi-file parquet datasets or large, you know, essentially parquet file or ORC file or CSV file datasets that live in S3 or Google Cloud without having to think so much about, you know, the mechanics of how that works. We can do the same thing in Python, really any language that builds bindings to these libraries.
But it's interesting to think about the implications as this approach to building systems becomes more ubiquitous. And so all around the data stack from execution engines to interchange and interface protocols to the storage ecosystem. So we've got many open source projects that are being built to achieve this type of composability and modularity.
Query interfaces and ADBC
Another interesting area, and I'll talk for a short minute at the end of the talk, is query interfaces. So we love SQL, but we also hate SQL. And so I think to have many query interfaces and things like dplyr and Ibis, you know, Ibis is kind of dplyr for Python. But you shouldn't have to use one query interface all the time. And so different interfaces work better for different types of applications. And so I think diversity in this domain is good. And so I'm also excited to see even new query languages being built like Malloy, which came out of Looker, you know, the kind of Looker team at Google, sort of reimagining a query language for BI and analytics.
A cool thing that's happening actively right now is that we're working to retrofit databases with Arrow native connectivity. So you can have we've developed an API standard for getting Arrow tables out of a database driver called ADBC. So if you've ever used ODBC and JDBC and been frustrated at how long it takes to get the results of select star out of a database, we're working with database vendors to implement ADBC in their drivers. So things will be faster and less memory hungry to get data out. So you can think of us as basically building a fast path that bypasses those legacy interfaces to give us, in principle, the database itself could produce Arrow on the server side, pipe that to the client side. And if you're using dplyr or Pandas or Polars or DuckDB, that can be those are native interface. Arrow is a native interface for all of those tools. And so you can pipe that data right in and be off to the races and building your data science application.
Accelerating legacy systems
Another interesting thing I just wanted to make you aware of is that these modular, these kind of new execution engines are being used to accelerate legacy systems like Apache Spark. And so Spark has enormous adoption and isn't going anywhere. There's also open source database systems like Presto and Trino. Trino is a fork of Presto and maybe it's called PrestoDB. And so the Meta folks have been working on accelerating Presto with their Velox project called Prestissimo. Data Fusion just spun out from Arrow and Apple and others are using Data Fusion to accelerate Spark. That's called the Comet Project. Intel is working on accelerating Spark with Velox. It's called Gluten. And I expect that we'll see more and more of these modular acceleration projects to take existing systems, keep the API the same, swap out the execution engine to make things faster, more interoperable, and more efficient.
So this all, you know, but I back, you know, again, there's Spark. Spark is the, you know, probably the most successful distributed big data system. And just a few minutes ago, I told you the big data was dead. And so it turns out that actual big data does still exist. And while you can do a lot with DuckDB, you want to be able to write, essentially right size the framework that you're using to the size of your data set. And so the ideal thing would be you can use things like DuckDB up to a certain data scale, call it like the one terabyte scale. But as soon as you have truly big data, it's not a hardship for you to switch over to using Spark or some other type of distributed execution system.
It's traditionally hard to do this with SQL, because while SQL is advertised as being a standard, you know, I call it Gaslight SQL, because it's actually not standardized. Dialects have lots of small differences between different database engines. There's tools in, like in Python, we have this really cool tool called SQL Glot, which helps with transpiling SQL dialects from one dialect to another, and it's getting very sophisticated in its ability to translate queries. But it's also difficult to decide which engine to use in each context. And it might be, you know, unintuitive, like what engine will yield the best performance or the best cost per gigabyte processed.
So at Microsoft a few years ago, they confronted exactly this problem of, you know, how can we create a standardized Python API, and then under the hood we can decide which cloud SQL backend within the Azure umbrella to use on behalf of the user to deliver either good performance or good cost. And so they built this research project called Magpie to do exactly that, and to kind of automatically offload workloads onto the right engine based on more or less a decision tree that they modeled, on that they, like a machine learning model that they built to intelligently decide which execution engine to use. So it's great to see kind of large tech companies thinking about this in ways that we can achieve that, like decoupled API layer where you have something that's pandas like or data frame like or dplyr like, and then under the hood you just, you know, you get, you know, fast and scalable performance.
Ibis and the multi-engine future
So in the Python ecosystem we've been building this project called Ibis for the last nine years. We also started that project in 2015. It's now grown into a fairly mature project that supports something like, you know, 20 different backends, and there's a number of people working on building a, you know, multi-engine data stack using that tool for Python. And so I'm very interested in this, and so this is, you know, in the next few years this is one area that I'm looking to help people succeed in building tools to fill the missing gaps that will enable us to achieve that multi-engine data stack. And the hope is that we make it accessible, you know, through all the programming languages that we use, including R and Python, and we work to continue to work together to build shared infrastructure that we can all use.
So 41 slides. Thank you for listening, and I will probably yield my slot to somebody else next year because I've spoken enough at this conference, but in any case I look forward to seeing you again in the future. Thank you.
