Why he created pandas, the future of data systems - Wes McKinney - The Data Scientist Show #086
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hello everyone, welcome to the Data Scientist Show. Today we have Wes McKinney. He's an open source software developer and entrepreneur focusing on data processing tools and systems. He is the co-creator of Pandas Library, the author of Python for Data Analysis, which I have a copy of. And he's the co-founder of Voltron Data. Currently he's the principal architect at Posit. He also just launched his MicroVC fund, investing in data infrastructure, AI, and ML companies.
How pandas started
So I have to ask, what's the backstory of Pandas? What was the motivation?
Yeah, well, my first job out of college was in Quant Finance. I had a math degree, and it was 2007, and it was the great financial crisis that just started. And I was growing frustrated because I was under a lot of pressure to do analysis work, and I felt that I wasn't able to do it quickly enough. And I had been introduced to Python programming, and I said, hey, this programming language is really great, but it's missing some data analysis tools, like the kinds of things that I had seen some of my colleagues programmed in R. And so I wanted to have some of the same kinds of tools in Python that I saw in R.
And it was also like a way for me to learn Python and have exposure to building a software project. But it started out as tools for myself. And at a certain point, I started socializing it with my colleagues. And they also really liked using it. And then I convinced the company where I worked, AQR, to let me open source it. And so we open sourced it at the end of 2009. And so I gave my first talk to the Python community at PyCon 2010 about the project. I think that video is still online somewhere.
And yeah, I started grad school in 2010. And then at some point, I realized that there was a huge opportunity to make Python an important language in statistical computing and data science. And so I dropped out of grad school to work full-time on Pandas starting from like May 2011 and spent a little over a year working full-time on the project. I wrote my book, Python for Data Analysis. And yeah, it really helped fill out the features in the library and build the initial open source community.
And sometime in 2013, Changshu and I, he's one of the early developers of Pandas. He was also at AQR with me. We decided to start a company. And there were other people that we had gotten involved in Pandas development. So when we got busy with our startup, Datapad, we turned over the Pandas project to the other core developers. So we haven't been so actively involved in Pandas development since like 2013, 2014. And I've been working on other projects in the meantime, but obviously Pandas has become extremely successful.
So at the time, I spent a lot of time trying to come up with a name because I was like, this is a Python data analysis toolkit. And I was working with all these econometricians who spent a lot of time talking about panel data. And so I was like, Python data analysis, panel data. I sort of was mashing around syllables and letters. And I was like, oh, there's like a panda found in here. And so initially it was going to be Panda, but then somebody suggested that Pandas was like funnier. So that was kind of the backstory. But it was, yeah, the main origin of the word was panel data.
Initially, I would say the main challenge was the SEO wasn't good. You would search for Pandas and you would not get the thing that you were looking for. And eventually, if you wrote Python Pandas, it would come up. But now when you type Pandas, it comes up in Google.
So when you feel frustrated about data processing before you create a Pandas, what were some data science or analytics project you were working on?
Yeah, I was working on a small statistical modeling project that involved some financial data sets and the data was, was over, it was cross-sectional data over time. So panel data and there, the data was patchy. So there were some data issues. And so I was really the, one of the problems I was really focused on solving was making a tool that made it really easy to work with cross-sectional data over time. So time series data or cross-sectional data with a lot of patchiness.
So really good support for the indexing capabilities in Pandas and the, the data realignment logic that's built into the series and data frame data structure. Like I wanted the, all of that to work automatically so that it would like automatically deal with the data quality and data alignment issues that I was, that I was experiencing in my work. And so it was like trying to incorporate ideas of R, like the data frame data structure, the tabular data structure, but also add in the data realignment logic, the data indexing logic.
So you can do arithmetic between different time series that have, where maybe a time series has different missing dates in one time series or missing, there might be like missing stock tickers in one, in one series. And you can do math across these data sets that have data quality issues and handles automatically realigning things for you, which is really nice. But at the time that was, that was like the thing that I really wanted to work well and be intuitive.
Voltron Data and Apache Arrow
And what made you want to start Voltron Data?
So we had spent several years working on Apache Arrow. So Arrow for people that don't know is, it's a bunch of things, but it started out, we wanted to create basically a universal table format or a universal data frame format that could be transported really efficiently between data processing systems and between, from one programming language to another. So for example, being able to share large datasets efficiently between R and Python.
And we always had the aspiration of building computing engines that were Arrow based. So first we had to get Arrow adopted and successful as a data format and something that's adopted and used by many systems. And then we moved on to building compute engines that are Arrow based. And so now there's several compute engines that support Arrow natively, DuckDB, Data Fusion. There's Acero, which is part of the Arrow C++ project.
So we've got these reusable, like mod, we call them modular computing engines. And so our idea was that we wanted to create a company that could, on one hand, provide a lot of development and commercial support and drive forward the Arrow development roadmap and provide enterprise support for companies that are building on Arrow to be a partner to them in, in helping drive forward the open source project.
But we also saw that there was a lot of opportunity in enabling next generation data systems to be more modular, to be Arrow based, and to be able to take advantage of hardware acceleration. So we created Voltron Data kind of on one hand to be a driving force in Apache Arrow, but also to build some technology to facilitate this transition to these kind of modular computing engines and taking advantage of hardware acceleration more seamlessly.
But Arrow is interesting because it's like the kind of project that most data scientists won't come into it, into contact with it directly. Like it's something that just starts getting used. Like it's getting used in Pandas now, it's being used in a lot of other projects, but it's used internally. So it's something that makes things faster. It makes things more efficient, more interoperable. Like you can now share data very easily between, between R and Python. You can use Arrow to interact with large parquet datasets that are stored in the cloud. So there's all kinds of like new use cases that have been unlocked through adoption of Arrow. But most data scientists don't need to know about Arrow. They just, they get it indirectly through the, the tools that they're already using.
The composable data stack
So you mentioned modular data processing tools and also your, your investor, you're interested in emerging composable data stacks. So what's the benefit of the tools being modular or composable?
So the benefit of this, like this modularity or what we call composability concept is that it facilitates, it facilitates reuse where firstly, you make it easier for people to collaborate on shared, like reusable software components that, that many people can use to build many different kinds of data processing systems. And already like DuckDB is a classic example of you have a cutting edge analytic database system that's available as a single C++ file that you can drop into any project, or you can load it into your web application and have super fast SQL processing, basically anywhere on your phone and your web browser, really anywhere.
And that reusability and kind of composability is what we call it. The composability comes from the use of open standards. And so in order to achieve composability, you need to have a standardized interface between that piece of software and your application. And so part of what we've done in the Arrow project and in Voltron data is driving these open standards that enable and facilitate that composability and reuse.
And so now there's a growing collection of open source projects that are participating in what we call the composable data stack. I helped write a paper last year with Meta called the composable data management system manifesto. Try saying that five times fast, but basically we tried to communicate a vision for, I guess we can maybe put a link to that paper in the show notes, but we tried to communicate a vision of what is, what does the future look like where data warehouses, databases, and data processing engines are built with reusable modular components.
And the idea is that you want to build a system in such a way that when something new and better comes up, comes in the future on the horizon, that you can change out the old part and put in the new part without disrupting the whole user experience so that you get things get better, but you don't have to like completely throw out the system and use something completely new. But you have, yeah, you have something that's you can hot swap components without breaking the whole, having to, yeah, throw out the whole system, throw out the baby with the bath water.
And yeah, so recently I just launched a venture fund, like a micro venture fund called composed ventures specifically to invest in companies that are helping make this happen. And as an entrepreneur, like I've, I've started a company to work, to make large contributions to toward this effort. But now there's a whole ecosystem of companies that are building technologies that are helping make it easier to build composable data systems.
Why now is the right time
So what do you think is the reason that this composable data stack didn't happen, say 10 years ago? Why now is the right time?
Yeah, I see it as a natural evolution of the way that systems across many domains develop. So the best analogy I can give is the, if you look at semiconductor manufacturing, so the original model of semiconductor manufacturing is the Intel model, which is vertically integrated. So Intel built their own tools. All of their designs are proprietary. All of their chip fabrication is proprietary and in-house. And so like they control everything, everything top to bottom.
And so compare that to the new kind of the, like the new mod way of designing and building new computer processors. We have open processor architectures and specifications are ones that you can license from ARM. There's now RISC-V, which is a totally open source, freely available processor architecture, which can be used for chip fabrication. There's different number of companies that build the software that assists with chip design. So Cadence Design Systems is one company that many people have never heard of, but it's like a very valuable company that builds software for chip design.
We have fabless, fabless semiconductor manufacturers like NVIDIA, one of the most valuable companies in the world doesn't own any fabs, like they have their chips manufactured by TSMC in Taiwan. And so in Taiwan and TSMC by turn has specialized in kind of taking all of these pieces and being really good at producing with high yield, producing chips, but they're dependent on ASML, which provides like the most, the world's most advanced photolithography. And ASML in Netherlands is also dependent on advanced optics from, I think from Zeiss in Germany.
And so basically what you've had is if you kind of decompose like all of these problems that like Intel was responsible from top to bottom, there's now like a specialist building these tools with around open specifications and kind of reusable systems for each layer of the stacks. You've got a specialist in photolithography, a specialist in optics, specialist in semiconductor manufacturing, software for chip design.
And so that's very much like what's going on in data systems right now is in the past, it was more expedient for somebody building a database or somebody build a data processing system to take ownership of all of the pieces in order to ship something more quickly. But now that we've gone through that first wave of progress in open source data management systems. And so now we can start to take a step back and say, okay, well, we want to make things 10 times better, but in all of these different places.
And so there was a collective recognition in the middle 2010s, I would say that it wasn't sustainable for us to continue building these vertically integrated systems. And so that's what led to, well, firstly, there were open source file formats that became widely adopted like Parquet, but then Arrow provided this in-memory computing layer, data interchange and computing layer. So that everyone realized that was something that we needed. Then we need reusable execution engines. And so that's led to things like DuckDB, Data Fusion, Velox, so reusable execution engines.
We're starting to think more about the user interface and the query optimization layer. So another project that I created at the same time as Arrow is Ibis. It's a Python project, which provides a portable data frame query layer. So you can use it to write your analytical queries and then it, depending on what backend you're using, whether you're running in memory with DuckDB or with pandas, or you're running against your data warehouse, like BigQuery or Snowflake or another cloud data warehouse, Ibis knows how to generate the SQL code or the pandas operations that you need to run that query.
And so that really helps with achieving this decoupling of concerns and systems. And so, yeah, that was like after building pandas, pandas is an example of a vertically integrated project where we were responsible for building everything top to bottom. And so having gone through that experience already, that was what really motivated me, that plus wanting to build Python interfaces to big data systems.
So when I was at Cloudera, like that was like one of the things that I wanted to figure out was like, how do we build Python interfaces that can use all of these large scale systems, build Python interfaces that can work with Apache Hive or Hadoop or Spark, but not have to rewrite your code to use from one system to another. So that's what motivated me to create Ibis, which is still going and it's been developing rapidly in the last few years, as well as Arrow, which is more of an infrastructure level project. But for Python users, Ibis is a project that you can pick up and use to author complex SQL queries and with the comfort of a DataFrame API, which is really nice.
But for Python users, Ibis is a project that you can pick up and use to author complex SQL queries and with the comfort of a DataFrame API, which is really nice.
Vertical vs. composable tools
So like you mentioned, pandas is more like vertical and it also has a huge user base. So there is a benefit of having a vertical data processing tool. So when do you think is more appropriate for this vertical type of tools and when do you think it's better for composable data tools?
So, so one of the challenges with the vertically integrated strategy is that especially when you have a piece of software like pandas that people love as much as they do, is that they want everything to work just like, work just like pandas. And so the trouble is that the pandas API is pretty large. And so, and many of the aspects of its API, the code that you write is, is coupled to, or it contains details of how pandas is designed and implemented.
And so if people want to create, and there've been a number of projects like Modin, Koalas, that have created pandas emulation layers that translate their other operations onto some other execution environment. So there's a company Ponder Data, which used Modin to create a pandas interface to Snowflake. And now they were acquired by Snowflake and are working actively on that in Snowflake.
And so the challenge is that you can get 90 or 95% of the way there, but there are always going to be some, you're never going to be able to totally take existing pandas code and run it against another processing engine because there's details of pandas that surface in the API. And that's, what's very challenging is because whenever you want to swap out the execution, the storage, if you want to take pandas code and run it at a thousand times larger scale on like a large data set, I think Modin and Ponder are designed to help with that problem, to be able to run pandas code at scale, but there's always going to be some pandas code that it's not going to, that isn't going to be able to be translated and run because it's relying on internals, internal details of pandas data frame actually contains these NumPy arrays internally.
I think for a specific persona, for example, data scientists or machine learning engineers, where we have a variety of tools, we're not responsible for a large system. I think those vertical tools are useful for us to solve a specific problem. But if you are building a system to support a variety of use cases. I think having portability, particularly at the API level is especially useful for like people building systems that need to work in many different environments.
So for example, with Ibis, like Ibis was designed to give you the power and flexibility, like the full power of everything you can do with SQL and SQL is a very powerful programming language, which is why it's 50 years old. And then like people are still building databases based on SQL. So SQL is really powerful, but also SQL dialects are very different from database to database. Like even though there's the quote unquote SQL standard in practice, like SQL dialects are different from database to database.
And so the idea of Ibis is that you have a single, like a standardized Python API. That's data frame. That's a real programming language. It's Python. You get tab completion, you get type checking, like all the nice things that you get through Python, the ability to write unit tests, like the ability to write functions and reuse code in a modular way, but then at the end of the day, you can write code once and then you can run it in any of the 20 different backends, or if you're, if you're using multiple types of SQL engines in your work, maybe you're using duck DB, but also Databricks SQL or Spark SQL or Snowflake or big query from the same Python code, you can get SQL dot, like you can emit SQL strings for all of those backends and not have to rewrite your code at all, and that is really powerful.
Multi-modal data and new tools
So I think earlier on, a lot of data processing tools are focusing on structured data. And now with a larger language models, we have more unstructured data and multi-model data. So how do you think that will change the data processing tools?
So multi-modal data is something that we didn't really tackle initially in Arrow. And so Changshu, like my co-founder from Datapad and longtime collaborator on Pandas and other projects, he recently founded a company called Lance DB. And so they are creating the Lance file format, which is an Arrow compatible file format, but it's designed for multi-modal AI data. And so it has support for vector embeddings, images and kind of the data that you, you find in, in LLM applications, as well as the indexing kind of building the secondary indexes that you need to support LLM, like vector database type workloads.
And we've also seen like a whole ecosystem of new vector database products emerge, as well as vector database plugins for existing databases. There's, I think a couple of different projects for Postgres, for example. It sort of remains to be seen like whether the support for multi-modal data where it will happen mainly through specialized tools and systems, as opposed to extensions to general purpose databases.
But that's, I think an interesting area to explore, but I think projects like Lance that are starting with Arrow and layering on these kind of multi-modal data management capabilities are super interesting because it's definitely building for the composable data stack by being based on Arrow and using Arrow as much as possible.
So we see a lot of data tools coming and going. So based on your experience, what are some tools you think are going to be obsolete? What are some emerging tools you think that are going to be more important?
I wonder that I might become obsolete. Because a lot of us who are writing code, I think the more optimistic way to look at it is that we'll get to spend more time doing the fun stuff as co-pilot and generative AI, like automates a lot of the boring stuff that we have to do while we're doing data analysis or doing CRUD tasks, writing code that's repetitive or transforming, forming data from one format to another, things like that.
I am very optimistic about, yeah, about having better logical separation between the APIs and the user interfaces that were used with projects, Ibis, the backend. To say that like, not to say that like the Pandas API isn't great. Like I think it is great, but it does create, pose a challenge for being able to run workloads at scale or to be able to auto essentially transpile your work workloads, like based on kind of where you need to run the code and just the size of the dataset and many other factors.
I think this is especially important as we see the hardware, like hardware heterogeneity increase. So right now, like already we have multiple GPU architectures, like Intel, AMD, and Nvidia have separate GPU computing architectures. Apple Silicon has its own like metal GPU architecture. And some of the machine learning frameworks have been optimized for all of these, for all of these frameworks.
I think that as time goes on, it's going to be easier and easier or become something that developers have to think less and less about, where basically we're able to automatically take advantage of hardware acceleration when it's there without having to explicitly opt into it. I think that's already happened to, to a great degree in deep learning where like with TensorFlow was like a pioneer in enabling hardware heterogeneous computing. Try to say that 10 times fast, where you can write your workload in TensorFlow or in PyTorch, and if you have TPUs available, it will use them. If you have GPUs available, it will use them. And so that, that enables portability across, across hardware.
And so, so that's, I think a really positive, a really positive trend. And so I think any increasingly, I think that systems that are not built with this modularity in mind and being able to run on different types of hardware seamlessly or be reusable or take advantage of these kinds of composable ideas. I think those would be increasingly be seen as like the last generation of technology that's best replaced by systems that are built for this more like kind of modern, the more modern way will be the modular composable way.
So, and I mentioned, I've been telling everyone I've been following pretty actively, this new company modular. It's Chris Lattner's, Chris Lattner's new company. They, he created the LLVM compiler project and there's a something called MLIR, which is like a layer on top of that's intended to make it easier to write kernels for deep learning or other machine learning workloads, but to be able to compile for different hardware targets. And so they build a whole new programming language called Mojo for that is producing really amazing results on LLM workloads.
And so I do think that the folks working on compiler technologies that are making it a lot easier, the company is called modular. So it's like similar theme, right? To kind of all the things I'm talking about. But I think that is like definitely going to be the way, like the way of the future in terms of like how we build these systems to create these abstractions where developers don't have to really think about the details of the hardware. Like nobody will need to be an expert in how NVIDIA GPUs work or how Apple Silicon GPUs work. That's something that the compiler will take care of and that we'll have expressive enough frameworks for developing the systems so that we can generate optimal code that runs on each on the type of hardware that's available and that we can upgrade the hardware independent of the software.
Joining Posit
And later you left Voltron Data and become a principal architect at Posit. What made you make this move?
So I'm still an advisor at Voltron Data. So helping out the engineering team and obviously continuing to drive the vision around composable data systems and kind of the open source technologies that are the foundation of the foundation of the company. But after spending between Ursa Labs, Ursa Computing, Voltron Data, I'd spent more than five years in entrepreneur mode working on Apache Arrow. And I'd felt that I had sort of achieved like a lot of what I personally needed to contribute to that, contribute to that project and opening myself up to explore the ecosystem and look for kind of other projects to make investments in, personal investments of time and effort.
I was also interested in doing more venture investing and having had a long working relationship with Posit. It was kind of the perfect, like the stars aligning, like I can stay involved and stay involved in Voltron Data, help Voltron Data succeed, but without being in a full-time operator role running the engineering team there. So it was good timing. And in the meantime, while working on Voltron Data, RStudio rebranded to Posit, incorporated Python into its enterprise products, repositioned itself to be a polyglot data science company.
And for me, like I really have been a huge fan of JJ Allaire and Hadley Wickham and the leadership at Posit. And so to be, take a larger role in the mission of, in the mission of Posit, I think it was just the opportunity of a lifetime for me. And, and I'm in a position there where I'm able to make important contributions to Posit's product offerings to help enhance the experience, not only for Python data scientists, but for data scientists more generally, regardless of programming language.
But I also have the freedom there to continue working on critical open source technologies, work on, if I want to write another book, it's a place where there are tons of Posit employees have written books and there's a whole library of books written by folks at Posit. And so I think for me, it gives me a platform to really, I think, have a lot of impact in the data science world. So very happy with the transition.
So now you are, is your role more like an individual contributor at Posit?
I don't manage anyone directly. So I, yeah, I would say technically I think as a principal architect, I report to JJ and so I have, I'm not managing anyone directly. So technically that makes me an individual contributor, but I'm doing a mixture of, I am writing code. It's kind of a mixture of Python and some TypeScript and so working in multiple programming languages.
And I'm also helping with the product roadmaps around like Posit's product offerings. So Posit Connect, Posit Workbench, Posit Package Manager, sort of Quarto. I'm very interested in Quarto as a technology. Like I read my third edition of my book, like I migrated to use Quarto and so you can read it for free on my website. It's all powered by Quarto. I think that's a really important piece to help with content creation, building dashboards, building interactive documents, building interactive applications and publishing and sharing results of data analysis.
On coding, open source, and what's next
So when you were a full-time co-founder of Voltron Data, did you miss writing code and kind of being a creator?
I would do some coding now and then while it was full-time at Voltron, but you know, often, yeah, I guess when you're running a company, often software development is not the best use of your time. And so sometimes if I would do write some C++ on Arrow, it would usually be like a nights and weekends kind of thing, or maybe I had a long plane flight, I'd be like, oh, okay, what am I going to do for the next five hours? Okay. I guess I'll find an issue to hack on and write some code. I did miss coding, but I think when you're in that, when you're in an operational role like that, I found it's better to, to not get in the way.
So I'm happy to be spending more of my time coding again. And my plan is to spend the majority of my time doing software development, doing direct hands-on coding work or writing, or, you know, beyond podcasts, beyond podcasts.
I know a lot of my engineer friends, they later become managers or founders because they want to have more impact and then they realize, oh, I miss writing code. And now I have to manage people, write performance review. And then they really kind of struggle with a change of work. And a friend of mine, he was a director of engineering. Then later he went back to become an individual contributor. He really enjoy, he has more time to write code, but he misses the direct influence you can have on your team, on other people's team. Do you miss that element?
I mean, I enjoy doing development work, I think by nature, a change agent. And so I think I like building successful software projects, building successful, like productive developer groups. And so I think that working in an architect role, I think even if you aren't directly managing, like directly managing people as an architect, like your job is to shape the culture and the roadmap of how the software project works, like the tools that it uses, like how it operates on a day-to-day basis, makes decisions. Like what is the code review culture? What is the issue, like kind of the issue management, like project planning culture, as well as identifying like high leverage projects to help the project move forward.
And so I think my most productive periods of being in that principal developer role where you're writing a lot of, you're writing a lot of code and you're kind of like the tip of the spear, like helping kind of carry the torch and lead the charge in an ambitious software project, I really enjoy doing that. And I really like working on new things. Cause people have asked me like, Wes, like you aren't really working on pandas very much anymore. And I think partly it's because a lot of the work in pandas is very important. And there's a whole large core team. There's pandas is just an enormous project. It's had thousands of different people contribute to it, but it's a very different project now than it was 15 years ago when it was brand new.
And so I really enjoy that, like exploratory trailblazing, kind of that building something new where you're like figuring, figuring out something new, like building something from scratch. And usually once a project gets more mature after four or five years, this also happened with Arrow. Like I'm not doing as much Arrow development these days, but also like Arrow has become a more mature software project. And so it made sense, like to not be such a looming presence in the project and to make space for other maintainers to grow a leadership role in the project and to not be dependent on me for kind of driving the project forward, which is great because it frees me up to think about what to build next.
And so that's part of kind of what I'm doing right now is thinking about what's next, like where's the, I think Arrow has had a huge impact in the data world. Pandas has had a big impact. I think Ibis is going to have an even bigger impact in the coming years. And so I am looking for other things that we could be doing to help transform, kind of transform the ecosystem and continue to make progress.
I think Ibis is going to have an even bigger impact in the coming years.
And you're in a very interesting position as a advisor to a startup you founded and as a architect posit and you're also investing in a variety of startups. So how does your week look like? Do you feel you have to constantly change context?
Well, fortunately, yeah. I try to structure my days so that like I have a block of meetings and then a block of deep work. And so I try to protect as many blocks of, of deep work in my schedule as possible. Maybe some of you have read a Paul Graham's essay, the maker's schedule, the manager's schedule kind of discusses like the Frick, the push pull between having meetings and networking and influencing other people and making time for, for doing deep technical work.
So it's a lot of calendar management and I mentioned I'm doing the venture investing, but I'm also trying to not, my goal is not to be a full-time, a full-time VC is really a part-time thing. And so I'm not out there hustling for hustling for deals. If somebody introduces a founder to me or a group of founders to me that are working on something where it's something I know about, or I could make a judgment about their tech stack or something that seemed that appeals to me, then I'll take a look, but yeah, my goal is for venture investing to take as little time as possible on the advising front.
I think doing podcasts and helping and also networking with other software developers helps me identify opportunities and ways to create like interesting connections, kind of new synapses between these different commercial and open source projects. So yeah, it's definitely a lot different than it was 10, 12 years ago when I was just had just pandas and didn't have a lot going on outside of that. But I've realized that influencing other people and helping, helping align disparate efforts in the open source community, like solving the social challenges in open source are, are as important or more important than the contributions of any one person.
And so even when sometimes I'd be like, Oh, it's annoying. I have so many meetings. I really would rather be coding, but I also have to recognize that like those meetings and those conversations, like they are recording podcasts, like they matter because, um, you know, having people helping, um, get people working on thinking on the same wavelength or thinking about things in, in, in compatible ways is really important.
And so like arrow is a good example of a project where, uh, it took a lot of collaboration, like a lot of people wrangling to make that project happen. But that work of, of getting people like marching to the beat of the same drum, it was essential and a project wouldn't have happened otherwise. And I think that social labor of understanding what are other people's motivations or is there an opportunity to collaborate or could there be something we could build that could make collaboration possible?
That was like, I think for a long time, one of my questions was how do we enable the Python and our communities to collaborate and before arrow, like we didn't really have a technology that facilitated that collaboration. And so now that we have a computing engine and like a whole compute and data access framework that ships both in Python and R. So now we can make an improvement to arrow and it's instantly available in both Python and R and developers in both eco and both ecosystems can benefit.
Because on social media, people always put Python and R against each other. Python and R is like, you have to pick a side and it's great that now in Posit supports both Python and R.
Yeah, I think Hadley, Hadley Wickham and I got together back when arrow was starting in 2016 and, and talked several times and had said that we thought that the quote unquote language wars were kind of dumb and we would just want them to go away. And so one of the best things that we could do is to start working more actively together and building things that could help end the language wars. And I think that arrow is one technology that's really helped. It's really helped a lot with that.
And obviously a posit like with its products for data science teams is about making it easy for teams that are multi-language to use common technologies, for example, for report deployment or application deployment. So being able to deploy an R based shiny application and a Python based stream load application in the same platform is super useful. And like being able to deploy an R markdown or a Quarto document or a Jupyter notebook and set it up to, I want to send this, rerun this Jupyter notebook and send the report to my boss once a week to be able to do that in the same place. It's super useful. So yeah, I think the polyglot model feels to me like the model of the future. And so anybody who's not building for a multilingual environment will find themselves kind of disfavored, I think amongst users.
Open source funding and what to work on next
And if there's one thing you can change or improve for the open source community, it could be cultural or on the technical side, what would that be?
I mean, I think overall we still need more structured avenues for open source developers to get financial support and like direct funding for their work. It's definitely, things are a lot better now than they used to be. Like there's a number of GitHub sponsors. There's a new platform for open source sponsorship called Polar that I'm pretty excited about. It's different from Polar's, the Python library. It's polar.sh, the open source funding platform. There's also Tidelift and Patreon.
So there's new things that we didn't have years ago and there's organizations like NumFocus that you can donate to that helps get money into the hands of funding small projects and infrastructure improvements and open source projects. But yeah, I remember a decade ago we had the aspiration of ideally we need a hundred million dollars a year of government grants toward open source development and we're a long way from that. And a hundred million dollars a year wouldn't be enough, I think, to support the kind of maintenance and feature development and innovation that's needed in this ecosystem.
And so I think any project that has become dependent on corporate support has like an inherent vulnerability because you're in a sense, you're subject potentially to the capriciousness of a single, of a company and their ability to sustain their investment in that project going forward. And so I've heard no project say we have more funding than we know what to do, which means I guess the answer is there's not enough funding. So I just hope that more corporations can allocate part of their R&D budget toward making targeted donations or investments in open source projects that they depend on.
And you said you like to work on new things and when you face a lot of new options, what's your philosophy or do you have a set of criteria when you want to decide what to work on next?
I try to learn from the users. I listen, or if I, if I'm building something myself, if I find that something doesn't work quite the way that I think it should, then I'll try to like talk to other people and see, is this a problem that you also think that this doesn't seem great. Do you also think this is a problem? And so, I mean, that's basically what happened with,
