Why he created pandas, the future of data systems - Wes McKinney - The Data Scientist Show #086

Transcript#

This transcript was generated automatically and may contain errors.

Hello everyone, welcome to the Data Scientist Show. Today we have Wes McKinney. He's an open source software developer and entrepreneur focusing on data processing tools and systems. He is the co-creator of Pandas Library, the author of Python for Data Analysis, which I have a copy of. And he's the co-founder of Voltron Data. Currently he's the principal architect at Posit. He also just launched his MicroVC fund, investing in data infrastructure, AI, and ML companies.

But for Python users, Ibis is a project that you can pick up and use to author complex SQL queries and with the comfort of a DataFrame API, which is really nice.

Vertical vs. composable tools

So like you mentioned, pandas is more like vertical and it also has a huge user base. So there is a benefit of having a vertical data processing tool. So when do you think is more appropriate for this vertical type of tools and when do you think it's better for composable data tools?

So, so one of the challenges with the vertically integrated strategy is that especially when you have a piece of software like pandas that people love as much as they do, is that they want everything to work just like, work just like pandas. And so the trouble is that the pandas API is pretty large. And so, and many of the aspects of its API, the code that you write is, is coupled to, or it contains details of how pandas is designed and implemented.

And so if people want to create, and there've been a number of projects like Modin, Koalas, that have created pandas emulation layers that translate their other operations onto some other execution environment. So there's a company Ponder Data, which used Modin to create a pandas interface to Snowflake. And now they were acquired by Snowflake and are working actively on that in Snowflake.

And so the challenge is that you can get 90 or 95% of the way there, but there are always going to be some, you're never going to be able to totally take existing pandas code and run it against another processing engine because there's details of pandas that surface in the API. And that's, what's very challenging is because whenever you want to swap out the execution, the storage, if you want to take pandas code and run it at a thousand times larger scale on like a large data set, I think Modin and Ponder are designed to help with that problem, to be able to run pandas code at scale, but there's always going to be some pandas code that it's not going to, that isn't going to be able to be translated and run because it's relying on internals, internal details of pandas data frame actually contains these NumPy arrays internally.

I think for a specific persona, for example, data scientists or machine learning engineers, where we have a variety of tools, we're not responsible for a large system. I think those vertical tools are useful for us to solve a specific problem. But if you are building a system to support a variety of use cases. I think having portability, particularly at the API level is especially useful for like people building systems that need to work in many different environments.

So for example, with Ibis, like Ibis was designed to give you the power and flexibility, like the full power of everything you can do with SQL and SQL is a very powerful programming language, which is why it's 50 years old. And then like people are still building databases based on SQL. So SQL is really powerful, but also SQL dialects are very different from database to database. Like even though there's the quote unquote SQL standard in practice, like SQL dialects are different from database to database.

And so the idea of Ibis is that you have a single, like a standardized Python API. That's data frame. That's a real programming language. It's Python. You get tab completion, you get type checking, like all the nice things that you get through Python, the ability to write unit tests, like the ability to write functions and reuse code in a modular way, but then at the end of the day, you can write code once and then you can run it in any of the 20 different backends, or if you're, if you're using multiple types of SQL engines in your work, maybe you're using duck DB, but also Databricks SQL or Spark SQL or Snowflake or big query from the same Python code, you can get SQL dot, like you can emit SQL strings for all of those backends and not have to rewrite your code at all, and that is really powerful.

So I think earlier on, a lot of data processing tools are focusing on structured data. And now with a larger language models, we have more unstructured data and multi-model data. So how do you think that will change the data processing tools?

So multi-modal data is something that we didn't really tackle initially in Arrow. And so Changshu, like my co-founder from Datapad and longtime collaborator on Pandas and other projects, he recently founded a company called Lance DB. And so they are creating the Lance file format, which is an Arrow compatible file format, but it's designed for multi-modal AI data. And so it has support for vector embeddings, images and kind of the data that you, you find in, in LLM applications, as well as the indexing kind of building the secondary indexes that you need to support LLM, like vector database type workloads.

And we've also seen like a whole ecosystem of new vector database products emerge, as well as vector database plugins for existing databases. There's, I think a couple of different projects for Postgres, for example. It sort of remains to be seen like whether the support for multi-modal data where it will happen mainly through specialized tools and systems, as opposed to extensions to general purpose databases.

But that's, I think an interesting area to explore, but I think projects like Lance that are starting with Arrow and layering on these kind of multi-modal data management capabilities are super interesting because it's definitely building for the composable data stack by being based on Arrow and using Arrow as much as possible.

So we see a lot of data tools coming and going. So based on your experience, what are some tools you think are going to be obsolete? What are some emerging tools you think that are going to be more important?

I wonder that I might become obsolete. Because a lot of us who are writing code, I think the more optimistic way to look at it is that we'll get to spend more time doing the fun stuff as co-pilot and generative AI, like automates a lot of the boring stuff that we have to do while we're doing data analysis or doing CRUD tasks, writing code that's repetitive or transforming, forming data from one format to another, things like that.

I am very optimistic about, yeah, about having better logical separation between the APIs and the user interfaces that were used with projects, Ibis, the backend. To say that like, not to say that like the Pandas API isn't great. Like I think it is great, but it does create, pose a challenge for being able to run workloads at scale or to be able to auto essentially transpile your work workloads, like based on kind of where you need to run the code and just the size of the dataset and many other factors.

I think this is especially important as we see the hardware, like hardware heterogeneity increase. So right now, like already we have multiple GPU architectures, like Intel, AMD, and Nvidia have separate GPU computing architectures. Apple Silicon has its own like metal GPU architecture. And some of the machine learning frameworks have been optimized for all of these, for all of these frameworks.

I think that as time goes on, it's going to be easier and easier or become something that developers have to think less and less about, where basically we're able to automatically take advantage of hardware acceleration when it's there without having to explicitly opt into it. I think that's already happened to, to a great degree in deep learning where like with TensorFlow was like a pioneer in enabling hardware heterogeneous computing. Try to say that 10 times fast, where you can write your workload in TensorFlow or in PyTorch, and if you have TPUs available, it will use them. If you have GPUs available, it will use them. And so that, that enables portability across, across hardware.

And so, so that's, I think a really positive, a really positive trend. And so I think any increasingly, I think that systems that are not built with this modularity in mind and being able to run on different types of hardware seamlessly or be reusable or take advantage of these kinds of composable ideas. I think those would be increasingly be seen as like the last generation of technology that's best replaced by systems that are built for this more like kind of modern, the more modern way will be the modular composable way.

So, and I mentioned, I've been telling everyone I've been following pretty actively, this new company modular. It's Chris Lattner's, Chris Lattner's new company. They, he created the LLVM compiler project and there's a something called MLIR, which is like a layer on top of that's intended to make it easier to write kernels for deep learning or other machine learning workloads, but to be able to compile for different hardware targets. And so they build a whole new programming language called Mojo for that is producing really amazing results on LLM workloads.

And so I do think that the folks working on compiler technologies that are making it a lot easier, the company is called modular. So it's like similar theme, right? To kind of all the things I'm talking about. But I think that is like definitely going to be the way, like the way of the future in terms of like how we build these systems to create these abstractions where developers don't have to really think about the details of the hardware. Like nobody will need to be an expert in how NVIDIA GPUs work or how Apple Silicon GPUs work. That's something that the compiler will take care of and that we'll have expressive enough frameworks for developing the systems so that we can generate optimal code that runs on each on the type of hardware that's available and that we can upgrade the hardware independent of the software.

Joining Posit

And later you left Voltron Data and become a principal architect at Posit. What made you make this move?

So I'm still an advisor at Voltron Data. So helping out the engineering team and obviously continuing to drive the vision around composable data systems and kind of the open source technologies that are the foundation of the foundation of the company. But after spending between Ursa Labs, Ursa Computing, Voltron Data, I'd spent more than five years in entrepreneur mode working on Apache Arrow. And I'd felt that I had sort of achieved like a lot of what I personally needed to contribute to that, contribute to that project and opening myself up to explore the ecosystem and look for kind of other projects to make investments in, personal investments of time and effort.

I was also interested in doing more venture investing and having had a long working relationship with Posit. It was kind of the perfect, like the stars aligning, like I can stay involved and stay involved in Voltron Data, help Voltron Data succeed, but without being in a full-time operator role running the engineering team there. So it was good timing. And in the meantime, while working on Voltron Data, RStudio rebranded to Posit, incorporated Python into its enterprise products, repositioned itself to be a polyglot data science company.

And for me, like I really have been a huge fan of JJ Allaire and Hadley Wickham and the leadership at Posit. And so to be, take a larger role in the mission of, in the mission of Posit, I think it was just the opportunity of a lifetime for me. And, and I'm in a position there where I'm able to make important contributions to Posit's product offerings to help enhance the experience, not only for Python data scientists, but for data scientists more generally, regardless of programming language.

But I also have the freedom there to continue working on critical open source technologies, work on, if I want to write another book, it's a place where there are tons of Posit employees have written books and there's a whole library of books written by folks at Posit. And so I think for me, it gives me a platform to really, I think, have a lot of impact in the data science world. So very happy with the transition.

So now you are, is your role more like an individual contributor at Posit?

I don't manage anyone directly. So I, yeah, I would say technically I think as a principal architect, I report to JJ and so I have, I'm not managing anyone directly. So technically that makes me an individual contributor, but I'm doing a mixture of, I am writing code. It's kind of a mixture of Python and some TypeScript and so working in multiple programming languages.

And I'm also helping with the product roadmaps around like Posit's product offerings. So Posit Connect, Posit Workbench, Posit Package Manager, sort of Quarto . I'm very interested in Quarto as a technology. Like I read my third edition of my book, like I migrated to use Quarto and so you can read it for free on my website. It's all powered by Quarto. I think that's a really important piece to help with content creation, building dashboards, building interactive documents, building interactive applications and publishing and sharing results of data analysis.

On coding, open source, and what's next

So when you were a full-time co-founder of Voltron Data, did you miss writing code and kind of being a creator?

I would do some coding now and then while it was full-time at Voltron, but you know, often, yeah, I guess when you're running a company, often software development is not the best use of your time. And so sometimes if I would do write some C++ on Arrow, it would usually be like a nights and weekends kind of thing, or maybe I had a long plane flight, I'd be like, oh, okay, what am I going to do for the next five hours? Okay. I guess I'll find an issue to hack on and write some code. I did miss coding, but I think when you're in that, when you're in an operational role like that, I found it's better to, to not get in the way.

So I'm happy to be spending more of my time coding again. And my plan is to spend the majority of my time doing software development, doing direct hands-on coding work or writing, or, you know, beyond podcasts, beyond podcasts.

I know a lot of my engineer friends, they later become managers or founders because they want to have more impact and then they realize, oh, I miss writing code. And now I have to manage people, write performance review. And then they really kind of struggle with a change of work. And a friend of mine, he was a director of engineering. Then later he went back to become an individual contributor. He really enjoy, he has more time to write code, but he misses the direct influence you can have on your team, on other people's team. Do you miss that element?

I mean, I enjoy doing development work, I think by nature, a change agent. And so I think I like building successful software projects, building successful, like productive developer groups. And so I think that working in an architect role, I think even if you aren't directly managing, like directly managing people as an architect, like your job is to shape the culture and the roadmap of how the software project works, like the tools that it uses, like how it operates on a day-to-day basis, makes decisions. Like what is the code review culture? What is the issue, like kind of the issue management, like project planning culture, as well as identifying like high leverage projects to help the project move forward.

And so I think my most productive periods of being in that principal developer role where you're writing a lot of, you're writing a lot of code and you're kind of like the tip of the spear, like helping kind of carry the torch and lead the charge in an ambitious software project, I really enjoy doing that. And I really like working on new things. Cause people have asked me like, Wes, like you aren't really working on pandas very much anymore. And I think partly it's because a lot of the work in pandas is very important. And there's a whole large core team. There's pandas is just an enormous project. It's had thousands of different people contribute to it, but it's a very different project now than it was 15 years ago when it was brand new.

And so I really enjoy that, like exploratory trailblazing, kind of that building something new where you're like figuring, figuring out something new, like building something from scratch. And usually once a project gets more mature after four or five years, this also happened with Arrow. Like I'm not doing as much Arrow development these days, but also like Arrow has become a more mature software project. And so it made sense, like to not be such a looming presence in the project and to make space for other maintainers to grow a leadership role in the project and to not be dependent on me for kind of driving the project forward, which is great because it frees me up to think about what to build next.

And so that's part of kind of what I'm doing right now is thinking about what's next, like where's the, I think Arrow has had a huge impact in the data world. Pandas has had a big impact. I think Ibis is going to have an even bigger impact in the coming years. And so I am looking for other things that we could be doing to help transform, kind of transform the ecosystem and continue to make progress.

I think Ibis is going to have an even bigger impact in the coming years.

And you're in a very interesting position as a advisor to a startup you founded and as a architect posit and you're also investing in a variety of startups. So how does your week look like? Do you feel you have to constantly change context?

Well, fortunately, yeah. I try to structure my days so that like I have a block of meetings and then a block of deep work. And so I try to protect as many blocks of, of deep work in my schedule as possible. Maybe some of you have read a Paul Graham's essay, the maker's schedule, the manager's schedule kind of discusses like the Frick, the push pull between having meetings and networking and influencing other people and making time for, for doing deep technical work.

So it's a lot of calendar management and I mentioned I'm doing the venture investing, but I'm also trying to not, my goal is not to be a full-time, a full-time VC is really a part-time thing. And so I'm not out there hustling for hustling for deals. If somebody introduces a founder to me or a group of founders to me that are working on something where it's something I know about, or I could make a judgment about their tech stack or something that seemed that appeals to me, then I'll take a look, but yeah, my goal is for venture investing to take as little time as possible on the advising front.

I think doing podcasts and helping and also networking with other software developers helps me identify opportunities and ways to create like interesting connections, kind of new synapses between these different commercial and open source projects. So yeah, it's definitely a lot different than it was 10, 12 years ago when I was just had just pandas and didn't have a lot going on outside of that. But I've realized that influencing other people and helping, helping align disparate efforts in the open source community, like solving the social challenges in open source are, are as important or more important than the contributions of any one person.

And so even when sometimes I'd be like, Oh, it's annoying. I have so many meetings. I really would rather be coding, but I also have to recognize that like those meetings and those conversations, like they are recording podcasts, like they matter because, um, you know, having people helping, um, get people working on thinking on the same wavelength or thinking about things in, in, in compatible ways is really important.

And so like arrow is a good example of a project where, uh, it took a lot of collaboration, like a lot of people wrangling to make that project happen. But that work of, of getting people like marching to the beat of the same drum, it was essential and a project wouldn't have happened otherwise. And I think that social labor of understanding what are other people's motivations or is there an opportunity to collaborate or could there be something we could build that could make collaboration possible?

That was like, I think for a long time, one of my questions was how do we enable the Python and our communities to collaborate and before arrow, like we didn't really have a technology that facilitated that collaboration. And so now that we have a computing engine and like a whole compute and data access framework that ships both in Python and R. So now we can make an improvement to arrow and it's instantly available in both Python and R and developers in both eco and both ecosystems can benefit.

Because on social media, people always put Python and R against each other. Python and R is like, you have to pick a side and it's great that now in Posit supports both Python and R.

Yeah, I think Hadley, Hadley Wickham and I got together back when arrow was starting in 2016 and, and talked several times and had said that we thought that the quote unquote language wars were kind of dumb and we would just want them to go away. And so one of the best things that we could do is to start working more actively together and building things that could help end the language wars. And I think that arrow is one technology that's really helped. It's really helped a lot with that.

And obviously a posit like with its products for data science teams is about making it easy for teams that are multi-language to use common technologies, for example, for report deployment or application deployment. So being able to deploy an R based shiny application and a Python based stream load application in the same platform is super useful. And like being able to deploy an R markdown or a Quarto document or a Jupyter notebook and set it up to, I want to send this, rerun this Jupyter notebook and send the report to my boss once a week to be able to do that in the same place. It's super useful. So yeah, I think the polyglot model feels to me like the model of the future. And so anybody who's not building for a multilingual environment will find themselves kind of disfavored, I think amongst users.

Open source funding and what to work on next

And if there's one thing you can change or improve for the open source community, it could be cultural or on the technical side, what would that be?

I mean, I think overall we still need more structured avenues for open source developers to get financial support and like direct funding for their work. It's definitely, things are a lot better now than they used to be. Like there's a number of GitHub sponsors. There's a new platform for open source sponsorship called Polar that I'm pretty excited about. It's different from Polar's, the Python library. It's polar.sh, the open source funding platform. There's also Tidelift and Patreon.

So there's new things that we didn't have years ago and there's organizations like NumFocus that you can donate to that helps get money into the hands of funding small projects and infrastructure improvements and open source projects. But yeah, I remember a decade ago we had the aspiration of ideally we need a hundred million dollars a year of government grants toward open source development and we're a long way from that. And a hundred million dollars a year wouldn't be enough, I think, to support the kind of maintenance and feature development and innovation that's needed in this ecosystem.

And so I think any project that has become dependent on corporate support has like an inherent vulnerability because you're in a sense, you're subject potentially to the capriciousness of a single, of a company and their ability to sustain their investment in that project going forward. And so I've heard no project say we have more funding than we know what to do, which means I guess the answer is there's not enough funding. So I just hope that more corporations can allocate part of their R&D budget toward making targeted donations or investments in open source projects that they depend on.

And you said you like to work on new things and when you face a lot of new options, what's your philosophy or do you have a set of criteria when you want to decide what to work on next?

I try to learn from the users. I listen, or if I, if I'm building something myself, if I find that something doesn't work quite the way that I think it should, then I'll try to like talk to other people and see, is this a problem that you also think that this doesn't seem great. Do you also think this is a problem? And so, I mean, that's basically what happened with,

Why he created pandas, the future of data systems - Wes McKinney - The Data Scientist Show #086

Transcript#

How pandas started

Voltron Data and Apache Arrow

The composable data stack

Why now is the right time

Vertical vs. composable tools

Multi-modal data and new tools

Joining Posit

On coding, open source, and what's next

Open source funding and what to work on next