Marco Gorelli: Narwhals, ecosystem glue, and the value of boring work

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to The Test Set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning, digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field.

On this episode, we sit down with Marco Gorelli, Celtic folk shredder, Narwhals mastermind, and software engineer at Quansight Labs. Welcome to The Test Set, where we explore the people behind the data. I'm Michael Chow , a principal software engineer at Posit, and I'm joined here with my co-host, Wes McKinney, creator of Pandas, co-creator of Aera, and principal architect at Posit. And I think we're super pumped to have Marco with us, who is a software engineer at Quansight, and creator of Narwhals, and also a maintainer or former maintainer of Pandas, so doing a lot of DataFrame stuff. Marco, thanks so much for coming on to The Test Set. I'm a giant fan of your work and just the rowdy party that is the Narwhals Discord. So, so happy to have you on.

Yeah, thank you so much for having me. It's a pleasure to be here. In fact, Michael, I think you might have been the first person, you and Rich, I think you might have been the first people who I ever demoed Narwhals to.

Oh, wow. Yeah. Yeah, that's actually, I remember a lot of, this is almost like an open source thing, like I remember having like an early chat. But that's actually crazy, I didn't know that it was that early in the Narwhals process.

We met up to talk about something else, which was about great tables . And then I think I just ended the call by saying, hey, check this thing that I just started over the weekend. And sometimes your little weekend projects end up running away from you a bit and spiraling out of control. And that's what's happened here.

What is Narwhals?

Yeah, I love that. I mean, it's, it's been really cool to see and maybe for some background context, I guess you're also a contributor to Polars. So you've kind of spanned a lot of these data frame libraries, Polars, Pandas, Narwhals.

I mean, I'm embarrassed to say that I've actually never used Narwhals myself. I know what it is, but probably there's plenty of people listening who aren't, who've never heard of Narwhals. They're like, Narwhals? What is, what is that? So maybe, maybe you can explain, explain the project and how, how it came about and, yeah, we can talk more about it.

Yeah, yeah, sure. I mean, I think it's very likely that you have used it, but accidentally, in the sense that it's, it's intended as a compatibility layer between data frame libraries. And it's not something that end users tend to use directly. Rather, it's something that people tend to use as a transitive dependency. So they tend to use it because some library that they're using is actually using Narwhals under the hood in order to be able to handle multiple kinds of data frame inputs.

As an example, if you've been using Plotly since version six, then Narwhals is a required dependency and it's used to do all of the data frame operations. And this allows it to accept, let's say a Polars data frame and keep all the computation native to Polars until it has to serialize it without having to convert to Pandas or to depend on Pandas. And conversely, Pandas users can keep passing in Pandas data frames to Plotly without needing to take on Polars as a dependency. Like you can meet users where they are and both the Pandas and Polars and PyArrow user bases can just enjoy using their native tools with Plotly and other data science tools without having to even necessarily know that Narwhals exists.

As an example, if you've been using Plotly since version six, then Narwhals is a required dependency and it's used to do all of the data frame operations. And this allows it to accept, let's say a Polars data frame and keep all the computation native to Polars until it has to serialize it without having to convert to Pandas or to depend on Pandas.

I was just looking and I saw that it has 43 million monthly downloads, so it's definitely gotten some uptake. But to your point, I guess it's being picked up as a transitive dependency in a number of projects. I'm familiar with the problem because in the early days of Pandas, people would want to accept Pandas data frames in different libraries. I know that Scikit-learn was one of those projects where people were like, I really just want to pass a Pandas data frame into this project. But there was this challenge of, well, we can't require Pandas as a hard dependency of Scikit-learn. But then what if people want to pass other types of tabular data structures? And so, yeah, it seems like it's solving that problem and the evidence is in the uptake of the project, which is great.

Cheers, thank you. It's interesting that you mentioned Scikit-learn, actually. I just got pinged today on an issue in the repository where they're talking about using Narwhals because they do currently accept both Pandas and Polars, but they've got their own hand-rolled compatibility code, which I think they're getting a little bit tired of maintaining. There's been some issues reported, and it's the kind of situation where if they can outsource the work to a project where the only thing that we're concerned about is handling compatibility between different data frame libraries, then that can leave them to focus on their own competitive advantage.

If I'm understanding, you're saying a lot of libraries, they need to do a little bit of data frame-like stuff. They want to take a tabular structure, they might want to choose some columns or filter a little bit of data. So before Narwhals, or a lot of these kind of compatibility tools, they were sort of stuck with being like, oh, well, I want to do a little bit of data frame stuff, so I sort of need to choose a data frame to make a dependency and kind of build on top of that. Yes, that's exactly right. Whereas Narwhals allows you to write your data frame operations against the abstract idea of a data frame, and then whatever the user passes in will use the user's library.

I think people, sometimes people underestimate how important the social part of open source is, as opposed to purely the technical part. So, talking at conferences, putting yourself out there, generally trying to be a good member of the community, it can really pay off later if you want people to trust something that you've built.