Resources

Alenka Frim: What yoga teaches us about discipline and collaboration in data science

video
Feb 25, 2026
1:04:03

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to the test set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning, digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field.

On this episode, we jam with Alenka Frim, mathematician-turned-yogi-turned-all-star-Apache-arrow-committer, who notes that open-source and yoga's revolved side-angle pose have more in common than you think.

Hey, everybody. Welcome to the test set. We're joined today by Alenka Frim, who's a data engineer and software developer and a committer to Apache Arrow and on the Project Management Committee, or PMC, and an Iyengar yoga teacher. And I'm joined by my two co-hosts, Hadley Wickham, who's chief scientist at Posit, and Wes McKinney, who's principal architect at Posit.

And we're so excited to talk today. I feel like maybe one theme in hearing and in reading a little bit about you, Alenka, is it seems like the theme of sort of how do open-source systems stay healthy? And sometimes they're big systems like Arrow, which sort of undergird a lot of data work and tools that people use. So thanks so much for coming on the test set.

Signs of a healthy open-source community

Yeah, I thought maybe we could start today with just a little icebreaker, which might be maybe we could just go around. And I'd be curious to hear from people, what do you think are like what's assigned to each of you of a healthy open-source community, would you say?

I'm happy to go first. I'm, you know, I'm very interested in Alenka's take, especially after being involved in — I mean, I'm also been very involved in Arrow since the beginning, but I've been less involved actively in the last five years. And so I think Alenka's understanding and knowledge of the Arrow community is actually really interesting for this conversation, because I have a lot of experience like Arrow the first five years, Alenka, Arrow the last five years. And so together we can really speak accurately to like what a whole 10 years of the Arrow project has looked like.

But for me, like, I think the big signs are whether the community is adding people, like people are coming and staying. And also, are members of the community comfortable raising concerns about what's not working and having constructive dialogue about how to fix it without blaming or pointing fingers?

Yeah, I would definitely agree. I think what I see on Arrow, which I feel it shows of a healthy community, is the way we interact on GitHub, which is the main point of interaction, I would say, not only with PRs, but there's discussions, there is issues. Together with the mailing lists, a mailing list, one is for developers, one is for users. I feel there's a lot of motion, a lot of people raising questions, giving PRs, even reviews from a broader community, not just people you know. And I think that's really shows and the communication is really so respectful.

I'm not sure about, you know, open source in general, but my experience with Arrow is really, really positive in that sense. So I think that shows of health. I'm not sure how much Apache here influences. I think it has a strong like it stands on a strong foundation because it's under the Apache umbrella, I would say, because there's already a structure that people know and they follow it. You know what, how to communicate, what's expected, what's not.

So I think that's really helpful. The thing that like first came to my mind, like kind of building on what Wes says, but the first thing that came to my mind was like turnover, not just like adding new people. But I think it's also really healthy to have like people leave the community because that's like the sign that you've built something that's like more than just like one or two people. And I do think like one of the things we don't kind of talk about enough in general is like like kind of death as the part of like the life cycle, like not, you know, not death as in like the ending of things, but like people, you know, moving on to new things, changing roles, putting down things that they've had for that they've looked after over a long time. I think that's like super, super important.

What is Apache Arrow?

Yeah, it's so interesting that that aspect of people leaving. And I know, Wes, you mentioned like you were with Arrow for like you have some perspective on the first five years and Alenka came in for the last five years. Maybe to kick things off, Alenka, I'd be really curious to hear maybe a little bit about like what is Arrow and how did you get involved with it?

Well, Arrow in the core, the way I see it, the way I understand it now is a format that specifies an optimal way of saving data in memory so it can be analyzed the best, like the optimal way of analyzing versus having row by row as it's in conventional databases. But then this is just kind of an idea. I mean, the format, but then you have to build on it. So it's useful. And then there's so many languages connected to it. So it's not just a format. It's like a specification for languages, for applications. I mean, it's so, so broad.

And then every implementation, every language has kind of a structure. So you can actually use data in memory or data on disk, you know, with this kind of format. So it's a really broad thing. I hope I kind of helped understand this because it's not something people use. It's something that, you know, it's already there used by applications. So they're faster and more optimal, but in the essence is just making things easier for a computer to, you know, calculate and analyze, you know, making it faster and more efficient.

It sounds like you're saying like a lot of people, even if they don't realize it, are probably using Arrow when they're doing like data analysis or maybe like querying a database or loading a file of data.

The project is quietly crept into all corners of the data world. I actually just read a blog post on a data engineering substack titled Apache Arrow is eating the world, basically channeling, you know, Mark Andreessen's famous software is eating the world. And it's kind of true. Like, I think the, you know, if you look at the data ecosystem increasingly, like you look at the Venn diagram and you look at projects that support or use Arrow in some fashion, the ones that don't. And like the bubble that supports Arrow or uses it in some fashion has continued to grow. And the other bubble has shrunk as time has gone on.

And so which has been, you know, I think very interesting to watch, but also I think it speaks to the credibility that the community has built. And part of the credibility of an open source project like Arrow that you want everybody, all these hundreds of projects in open source and commercial projects to depend on is that they have to believe in the Arrow community to keep being healthy, to keep delivering good software because it has become this piece of critical infrastructure.

Part of the credibility of an open source project like Arrow that you want everybody, all these hundreds of projects in open source and commercial projects to depend on is that they have to believe in the Arrow community to keep being healthy, to keep delivering good software because it has become this piece of critical infrastructure.

And, you know, I think people like Alenka have been responsible for, you know, building and developing the community to create that credibility that has led to the project's success. I admit that I've forgotten the details of how, you know, how we came to be connected and how you got involved in the project's beginning. But I think all that is, you know, very interesting to me because you started out as a contributor and have grown into, you know, a leading member of the community. And I think that's really interesting, you know, arc and your kind of your growth as a, you know, open source community leader. Can you paint the picture for us? Like, what were you up to when you started contributing to Arrow?

Alenka's path into open source

Well, the interesting thing is that I wasn't up to anything at that time. I was solely teaching yoga for five years and doing nothing with computers. So, yeah, before I was working on this in the central bank as, you know, on a statistical department, just helping with anything they needed. And I graduated and I'm a mathematician. So when the pandemic hit, I had less, of course, less classes. They were mostly online. I had a bit more time and I realized I missed something. So I started doing some research. What could be a nice way for me to get back into? I didn't know even what space, like is it data science, data analytics, data engineering? I had no idea, but I wanted to try to see if, you know, it still intrigues me.

So when I started contributing to Arrow, I had no idea that in the meantime, Deployer happened because I was using R in central bank. You know, Pandas was big, but I haven't really used it before. So it was I was really kind of new. I had a friend that was working on Arrow at that time. So I was kind of brainstorming. We were brainstorming what could I try? And open source sounded a great place because I wasn't looking for a job. I was looking for an opportunity to try out and to learn and to have fun.

So I'm not sure why I picked Arrow. We talked about different things. I think I like hard problems. So I was really interested in if I can, you know, figure things out. The bad thing that I — I mean, the bad thing, the thing I struggled with a bit was that I wasn't a user. So I wasn't using Arrow in any sense. So it's kind of harder to understand it then because it's it will be much easier. But at the same time, I think the perspective I got from just trying to contribute was a different one. And it's also it has other positive things.

So, yeah, when I started, I did some contributions to R because that was the language that was that I was most comfortable with. And I was just lucky that right after my fourth or fifth contribution, we had a nice communication, nice collaboration with the R maintainers. There was a position open because there were grants. I think Ursa Labs got a grant from CZI. So it was just I think I was just at the right place at the right time, really.

Just to give a little context to people who might not know about like what an R contribution to Arrow is, like what is a R contribution to Arrow?

It can be different things. What I did was something that I thought would be the easiest, which is a binding. What that means is that our package, our Arrow package is connected to the C++ implementation. So you have a C++ code base that defines the format, the zero copying. It defines how to read parquet files, how to save them, all of that in C++. And then you don't have to do all of this magic again in R, you just kind of connect it. Right. So there's different layers and you have to just figure out what are the pieces of a puzzle that you are that you need. So my strategy was just look at, you know, PRs that were done before on the same on a similar thing and just try to figure out what's needed.

You kind of mentioned earlier you weren't a user of Arrow, which they kind of made me wonder, like, who do you think is a user of Arrow? Like, I think of like Arrow is mostly this kind of like invisible technology, right? Like it succeeds because things just work and you don't, like ideally, I think you don't know that Arrow is being involved. It's just like, oh, I can use my data from wherever I'm working and wherever it lives. Like, who do you think of as the user, the users of Arrow?

It also depends on the implementation, I think. But mostly for all, most of the Arrow implementations, the users are actually developers that build on, for example, you know, Query Engine uses this format and you just kind of, you know, build things on. Then you have Pandas, for example, trying to use PyArrow, so Arrow part of the kernels to do some of the things faster, better. So, yeah, it's not an end user, I would say. Maybe in R package, that's a bit more usual to have an end user using it.

But mostly, I mean, even in PyArrow, it's really interesting that a lot of PRs, a lot of pull requests, a lot of contributions are done by developers that have so much knowledge and they already know exactly what they need. They just, you know, make a PR and you're like, OK, I need a week to understand. So, yeah, these are experienced developers of applications that use, that it doesn't necessarily the application will be for the end user, but it can be. Sometimes it's already, it's also an application for an application, if that makes sense.

So, yeah. I do remember one of the first meetings I had with Arrow project. I still remember what they said and they were saying that we are doing a good job if nobody knows about Arrow. Right, so it totally makes sense, but I still feel that people should know about Arrow. If not for general knowledge, for support, because, you know, if you don't know about Arrow, then how will people get support for maintenance and, you know, working on that?

I mean, the way I've looked at it is I've told end users to, they don't need to know necessarily details of how to use Arrow or how to implement it, but rather when they're evaluating their other technology choices, that's something they should ask as part of their due diligence checklist is, does this system support Arrow and or if it doesn't, is Arrow support on the roadmap? Because essentially that shows that a project is looking at the landscape of the open source ecosystem through like a progressive lens of like, this is something that it need not, it's not something that constrains systems necessarily, but it makes them more interoperable and more and more efficient. And so if there's a downstream project that doesn't see that kind of interoperability or that kind of performance improvement that Arrow provides, that is, you know, for me, that's like a bit of a strike against the project.

That's not necessarily like fully disqualifying, but definitely like, you know, it would make me second guess like a choice of a critical piece of data technology. It is an interesting challenge, though, like when you're, when an open source product like primarily exists to kind of remove pain, like people, like as soon as it kind of like percolate, like people kind of forget, oh, this is actually really annoying and frustrating to do before. And that just becomes normalized so quickly. And then like, how do you tell that story? Like, OK, you should support this project because it's actually making your life much like, like people don't really appreciate that story.

That's like when I was a, when I used to walk to school, it was, you know, it snowed every day and I had to walk uphill both ways. It's an interesting challenge.

I'm not sure if necessarily developers or users need to know that or need to feel that. I think it's more on the managers and more on the leaders of this, you know, applications or companies. I think it's more on them to know the how serious this is.

Arrow, DuckDB, and the ecosystem

Yeah. And I guess maybe to make it concrete, too, for folks, there's that projects like like Polars uses Arrow and I think Pandas has Arrow support. And I think DuckDB as well is is built on or largely as far as I understand, they are influenced. So they use the same format structure, but it's not something they they don't depend on Arrow per se.

Yeah, DuckDB's internal data representation that it uses for all of its query processing is is very Arrow like and they actually early on, we started working with Hannes and Mark prior to even the existence of DuckDB Labs and the DuckDB Foundation to essentially figure out how to better align DuckDB with the Arrow ecosystem. And they did make changes in DuckDB to make the like the interface between Arrow and DuckDB work better. But without doing it in such a way that it didn't sacrifice performance, you know, in just only using DuckDB, I think they had the requirement. They're like, well, we don't want to make DuckDB slower in order to reduce the impedance mismatch with Arrow. And so they were able to find a path where, you know, the bridge — can think of it as like a bridge between, you know, between Arrow and DuckDB. It's not a totally free bridge, but it's like, you know, they wanted the toll on the bridge to be as low as possible. And so the toll is pretty — you can, you know, query, use DuckDB as an engine to process streams of Arrow data. And it's a very, you know, one of the most effective ways to, you know, build an application nowadays.

It may be it's easier for folks to understand these differences in if you think of Arrow specification, which is same or shared between, you know, Polars, DuckDB and Arrow project. So the idea is the same of how you store data and then just implementations themselves. You know, Arrow provides an implementation you can use or DuckDB has its own or, you know, Polars has its own. It's a Rust implementation, but because it's all the same spec, it's kind of — some people just say it's built on Arrow, but it's just compatible.

How do you think about that? Like, it's always difficult to like specifications are hard to like narrow, like get every single like everyone agreeing on all these things that you don't like, think about when you're writing it. Like, like, how is that kind of specification and implementations of Arrow like co-evolved over time?

I think that's a good question for Wes. I can say from how I'm observing it is you need to have a really good idea of how to make it in a way that people will be using it, you know, so it's kind of user friendly in a way. So they're like, OK, yes, and we can use that. It makes sense. And then also, I think what I see when working with Arrow is especially in PyArrow, because then separately, Python has a lot of specifications for interchange. You know, you have DLPACK, you have data interchange. I mean, there's a lot of them. So what's the difference? I think it's just when there's a need, something arises. And then with Arrow, I think the way community feels is that we're building something solid. And then if people need something different, that's fine. But the kind of the main thing is solid and we're going to work on it. And then more people adopt it, then it just kind of grows by itself. It's like spontaneous.

Managing 4,000 open issues

I feel like we've hit so many interesting things, which is like Arrow is largely kind of like behind the scenes, driving a lot of stuff. It's hard for that to surface, like people might enjoy Arrow without even knowing it. And there's the question of like, how do you kind of support that kind of stuff? And then there's it's interesting to hear even like DuckDB, there's like the spec and people who might not just use Arrow, but maybe it might like rhyme with Arrow or they might keep Arrow in mind. And how do you kind of like write a good specification?

I one thing I thought I'd flag, too, is that I saw in a talk you gave before that Arrow has like thousands of issues open. Like there's tons of issues.

We came under 4000 now with Nick Crane's help.

Are you using some AI tools to help with like grouping and, you know, summarization, linking, categorization and things like that?

I haven't dared to use AI yet. We're going through issues and like helping with like backlog, you know, curation. One thing Nick and I did was create a dashboard which surfaces, for example, issues that were not answered, issues that were opened by a new contributor. But because that's some things you want to kind of jump on first, same for PRs and etc. So that's what we use, which I think it's useful because otherwise what I tend to do is just answer things where I'm pinged because it's so much, you know, stuff going on.

So I mean, here's like a — this is I don't really mean this, but like like how can I trust a project that has like 4000 open issues? Like how do you think how do you think about that?

Like it's the closed issues, Hadley. Look at how many closed issues there are. Yeah, it's probably like 20000 closed issues.

I mean, like what how do you think? Because I think I've like evolved this like pretty unusual way of dealing with issues just because, you know, I think fewer users are comfortable on GitHub. So the number of issues you get for an R package is much, much smaller. So I like this like the kind of inbox zero approach of issues. Like, oh, to me, the ideal number of issues in a repository is like under 25 because that's like one page on GitHub. But that's like so like my I just can't even imagine like how do you think about that? Like when you think about those 4000 issues, like you just don't care that there's all these open issues. Like what like where's the like do you see them as being valuable or useful? Like what's like what's the point?

Like I see the point of an issue is to be closed, I guess, but that's clearly not.

If you look at the issues as bug reports, then yes. But there's a lot of feature requests in Arrow. There's a lot of documentation, you know, issues. There's user questions. I mean, it would be really nice to have zero. But I think it's important to understand the scope of Arrow. It's just it's not one project that we normally think of. It's so broad.

Just maybe to put it in perspective, we have in the repo, we have multiple languages. And every language is for a couple of projects, right? It's, you know, in Python, it's just one part. And you have C++, you have R, you have — I'm not really sure how many languages are still there. But because there's some of them are separate to make it easier. And then per language, you have so much parts of the implementation. You have, for example, in Python, in PyArrow, you have a part of the code base that deals with pandas compatibility. So you have issues for that. Then you have issues for file systems. And it's just such a broad thing that we're not stressing about it. We would like to have them all, not all closed, because there's always going to be something that will need to be done in Arrow, I think. Because it's an involving thing. It's not a project that it's just, you know, there and it will be used. It's more like it's developing. There's new data types, you know, being included. There's some new things like parquet. You know, you have a specific code part where you deal with parquet reading and writing. So parquet is being evolved. And, you know, you have to match that. So I think it's, yeah, you have to have a different mindset for Arrow project versus other packages.

I mean, the way I look at it is every like each language implementation of Arrow is aside from the ones that are a little more like wrapper layers like R and Python. But you can think of them as almost being like little operating system kernels for low level, low level data operations. And so, like, I mean, it's not exactly, you know, as grandiose as like, oh, a Linux kernel for data like that. That would be a silly way to market it. But it's kind of true in a way.

And I think like the set of like that toolbox is like that. I think we always envision the project is like building on the specification. But like the specification alone is only so useful without like the toolbox and the framework of things that you need to build real world systems. And as time has gone on, I think that that toolbox has grown and sometimes has expanded in ways where we've been like, oh, maybe that was a mistake to like take on that piece. And like we've pruned, I think the project has pruned things that were like not getting maintained or that there just wasn't that much community interest in continuing to keep around or maybe spinning off into a side project.

But it's been a little bit of like, I think, Neil — we had an Arrow summit in Paris at PyData Paris. And I think Neil Richardson described Arrow as a bit of a radical social experiment. And and so it's I do feel that way. And some of some of that, like the anarchy of like the sprawling, federated nature of the project has been, you know, certainly makes things more complicated, but also, I think, has helped keep the community together and keep it thriving and productive.

Getting started as a contributor

I mean, it seems like you've really latched onto Arrow as a contributor and done so much for the ecosystem and an open source. I'm really curious, like how what do you think are like the ingredients to really get started with something like that? Like how does someone like get a foot into contributing and succeed at doing that kind of thing?

I think to get started, you have to — I don't know what would be the motivation. I think everybody has a different reason. But I think more I think of it, the more I talk to people, it's mainly experience or to grow or to learn or to just, you know, interact with, you know, a new community. I think I think mainly that is that.

I mean, there's for sure a lot of contributors, especially in Arrow, where, you know, they use the tool and they need some feature or some bug fixed, which is so awesome in open source that you can actually have the possibility to do that. It's not like I'm waiting for somebody to fix my problem. It's like you have the option to do that and not just for you, but for somebody else, for sure.

So how to succeed, I think. Well, I'm not sure if I would succeed without people around me that helped. I mean, I did have to do the work and do the research and just stick with it. But just building Arrow from source locally, it's not an easy thing to do. And if I wouldn't have somebody helping me, I — it's quite possible that I wouldn't continue.

So having this group and just having, you know, people encourage you in — I think it was my third or second contribution that I was already being pinged like, oh, are you interested in doing this, too? And this means so much because you're like, oh, my God, they noticed me. It's in a project like this. It's amazing feeling. So small things like that when somebody would think it's not — it's maybe even rude to ping people. It's not. It's like you show somebody that they're worth, you know. To collaborate with and to put some trust in them.

Imposter syndrome in open source

Yeah, I could see like if you get pinged, it really shows like just realizing that people appreciated it and like noticed it like really makes it nice. And I wonder if related to — I saw you gave a keynote and I think you gave the keynote a keynote in Paris. Yeah, which is awesome. Yeah, it was fun. And I saw one thing you talked about, too, was imposter syndrome, like how to sort of approach that as a as a person in open source. I'm curious your thoughts like what? Yeah, what would you say is kind of the takeaway for people experiencing imposter syndrome and in and looking to contribute an open source?

Yeah, one thing worth mentioning maybe is that. I — you're not the only one feeling imposter syndrome, I think it's a really like a lot of people feel that when I think when I talk to people, when I talk to people after the keynote, when I talk to people I work with, which I feel they are the most brilliant people I've ever met. And they are saying that I feel I feel I don't understand. I feel I'm not capable.

I would say a lot of people feel that and how to overcome. Well, again, maybe I'm repeating myself, but having somebody you could talk to. I talked to Raul a lot. It was a PyArrow, our friend, a contributor and a PMC member. He's involved in the project a lot. I'm talking to Nick on a regular basis, also a contributor, and we see that it's a common thing, which maybe it shouldn't be. But I guess I don't know if it's open source related or is just technology related or is that happens in general with people?

Because you're you're working with a lot of folks, you see what they do. You see it's there is something amazing and you always kind of want to match. But I think the real thing you need to do is find something you like and you feel comfortable and you feel motivated to do and you feel, you know, you have fun. And then after a while, you have to you have to look back at what all of the things you did, not just like looking and comparing to others, but saying to yourself and maybe having a friend so you can talk to and they will say, look, this is amazing what you did.

That's also something I experience in Arrow is people stop, they talk to you and they say, Alenka, this is amazing work you did. And it really kind of all the bad emotions you feel, they tend to go away.

That's also something I experience in Arrow is people stop, they talk to you and they say, Alenka, this is amazing work you did. And it really kind of all the bad emotions you feel, they tend to go away.

Yeah, I don't know. I was really intrigued by Dr. Kat Hicks. She was doing keynote in Posit at Posit conference. So that was a lot of kind of research I did there for the for the for my keynote. I wasn't talking a lot about imposter syndrome. I was just mentioning it and mentioning that we need to talk about it and do some research because I think it's really important for us to talk about that. And, yeah, Dr. Kat Hicks has a lot of good ideas on how to tackle that.

So one thing I want to mention before I stop with this part is that at the Arrow community now we're talking about — I'm doing the still reading about what they're researching and we're talking about what could we do as a process in the community that would help kind of remove this feelings. So one thing we are trying to do now is to make a blog post of highlighting contributions, you know, saying, OK, this is — you know, this was a new contributor and they did amazing or this is such an amazing job done by somebody. I think that would help.

Yeah, one of the things we've like in the when we do the release announcement for tidyverse packages, we do just like a roundup, like, you know, very kind of low effort. But we're at least like, hey, thanks. So like all of these people who have interacted with us on GitHub and even like, you know, we hear every now and then people are like, oh, that's really cool to see my name or like when people have done like pull requests for the R for Data Science book, they get mentioned in the intro like that. It's small and obviously you can do much more. But even like little gestures like that, I think people appreciate being recognized and acknowledged. Very much.

And maybe also, you know, we don't all have to be, you know, Hadley and Wes, you know, it's fine if you're not the core developer. It's fine if you don't feel comfortable in doing something, but you feel comfortable in doing something else. I think we have to realize that we don't have to know everything. And then you start learning. And then if you have fun, you kind of see that you actually know a lot.

Yeah, it seems really neat, like. Acknowledging contributors and also like emphasizing to people like you don't have to be the best, like figure out what about this you like and, you know, like why is this fun or interesting to you?

It is interesting. Like I think like when you teach your like when you teach a workshop or you TA a workshop and like even if you're like, oh, like you feel like often like if you're TAing your first workshop, you feel kind of insecure in your knowledge. But then you realize like when you go and help people like the people taking the workshop and they're like, look, no, like literally nothing. And you like that. It's so like reinforcing to you like how valuable. Yeah, it shows how much you grown and how much you learned, actually.

Learning styles and the changing landscape of coding

Yeah, and I kind of experienced this afresh recently. Like I taught my first yoga class. Like I subbed for my teacher and she like prepared the class. I just like gave it. But one thing that I found fascinating was like I've always been like, you know, how does a yoga teacher know how to do adjustments? Like how do they know like to like all this? And I just kind of thought all that like and obviously some of it is like years of experience and knowledge of anatomy and training. But some of it is also you just look at people from outside their body and you're like, oh, my God, that like that looks like you're so wonky and off like and just having that like, oh, actually, this is not like it's not as complicated as I as I thought it was. It's such a like an empowering experience to be like, OK, like just being outside of a person can be like having that fresh perspective. You don't have to be that much more knowledgeable than them to be like helpful and useful and to make their journey a little nicer.

Interesting thought where like if you're the person doing yoga, you might have the thought like, how did the teacher know to adjust me? But once you're the teacher, you're like, well, I'm seeing like 20 people and it stands out.

Let's see. I'm trying to think, oh, this is our chance to turn this into a podcast solely about yoga.

But one I guess one interesting thing we could maybe turn to the group is I know you you mentioned a couple of questions that might be interesting to ask folks here. Maybe maybe this is kind of related to like yoga and learning. I think one thing you mentioned is like for learning, do people prefer to like learn as you go or to like stop and take a course to focus on things? Did I get that question right? Yes, yes, correct.

Yeah, I'd be curious, folks, take on that. Like, do you think it's better to like when you want to learn something, do you prefer to stop and take a course? Or kind of like learn as you go?

I like I teach a lot of workshops, but I have to say, like, I hate taking workshops myself because like the pace is set by someone else. Like, first of all, like, you know, now I'm like so used to watching like videos on YouTube at like one point five or two X speed. I'm like, why is this person talking so slowly? This is so inefficient. And then that like that sort of inability to kind of like go on off on your own little learning tangent. So in some ways, like classes like feel. Like, so I don't know, like, yeah, like I don't enjoy that, but on the other hand, like I take yoga classes and that is something where I'm like, OK, I just it's just it's also nice just to be like told what to do every now and then and just here is a path like I have to like forge my own path. I'm just going to follow someone else's path. And that is like really nice sometimes to.

Yeah, for me, like I, I definitely prefer prefer hands on hands on learning where we're possible. And so whenever I dive into something new, I I will try to learn just enough to, you know, be able to start start doing and then try to create a virtuous cycle of like doing and then running into the things that I don't know how to do and trying to figure it out and then, you know, kind of looping until and over the course of time, I guess that that that develops develops mastery.

Turns out that when it comes to programming, it turns out that that's all being broken apart and destroyed, maybe forever right now. But just kind of interesting. But I used coding used to be that coding used to be the perfect example of like the hands on ultimate and hands on learning, like you learn how to how to do hello world. And then you write how you write how to learn how to assign variables and write a loop and then write a function. And then you learn how to think about, you know, organizing groups of functions and modules and thinking about object oriented programming and state management and multi threading and distributed systems and data storage and file systems. And, you know, all that is, yes, not to hijack the topic, but yeah, it's I feel I feel like the whole subject of like the whole the whole domain of learning is like being turned on its head right now.

And so it's yeah, it a part of me is like, ah, the quaint old days of reading, reading a book and learning how to do things that way. Like, you know, I'm I'm actually, you know, ideating like what the fourth edition of Python for data analysis looks like. And part of me is like, do people even still read this book? And like, I look at the book sales and like the book sales have dropped substantially because people aren't buying programming books as much anymore. And you know why? So, you know, I think it's like book authors or like people creating learning materials, tutorials like it's it's a whole new world. And I feel like the whole domain of pedagogy is going to have to remake itself in order to remain relevant and to keep up with the way that things are changing. Otherwise, like, you know, why would yeah, essentially, why would people go to a class or read a book when they can sit with their their chatbot and self-tutor, you know, at their own and explore topics, you know, of their own choosing in whatever order they wish? So because all the all the chatbots have already ingested all of our books.

And so I mean, the other interesting connection, I think, is like I got on the whole, I'm still like optimistic about like software engineering as a career. Like, I think, you know, software engineers are so valuable. And at least right now, it feels like AI is kind of a force multiplier. But we are also at this sort of interesting point where you're like, well, like there's also like, you know, being a yoga teacher, like AI is not going to take that over anytime soon like that. It's just sort of an interesting where you're like, well, that's actually it's kind of nice to have that like in your, you know, your back pocket, like these things that are like so specifically human without like human connection is so important. Like that, those sort of careers feel a lot like, yeah, safer than software engineering right now.

That's true. But I think still for, you know, really diving into a harder subject that's not documented yet, you still need to have people that are intrigued by discovery and learning. Um, so I'm not, I don't think it's gonna, you know, yeah, but for general programming, I would say, yeah, it's a strange time.

Yeah, certainly developers of, you know, data entry CRUD applications are, you know, I think the first on the first on the chopping block. I actually think I, you know, I think in a talk that I gave recently, and then maybe in another podcast, like I described, I described the arrow project as being, I felt like an AI resistant project or AI resistant technology. And I wondered, I wanted to see what you what you thought about that, you know, both in the sense that like, you know, like AI doesn't make Arrow irrelevant, in some some ways, it makes it even more important, like as a as a grand unifier. But also like it's, you think about 10 years of development on this project. And, and, you know, I'm using coding agents, you know, every day to build little applications to make my life a little more streamlined, a little more productive. And even seeing like how difficult it is to guide chatbots to create correct and useful software requires so much human in the loop attention and intervention, and guidance and review and feedback.

You know, there's this whole narrative around, you know, artificial general intelligence and super intelligence, but like, personally, I don't, I don't buy it, like, because I'm using these agents every day. And like, you know, I'm like, I, yeah, the more you're using these things, like, yes, they're tremendously useful. Like I'm building more software more productive than I ever have been in my life, which is, you know, insane. But at the same time, you know, the human element of like, the judgment feels more essential than than ever. And so I'm curious how much you thought about that and how you're seeing that play out in the Arrow ecosystem.

Yeah, yeah. I mean, you could you could use agents to help you, you know, when you when you need to debug something in Arrow, it can be quite complicated to find what's wrong. So there's some help there that I use, especially since Paris, when we, me and Wes talked, and I was like, Oh, Claude sounds really cool. So I'm actually a late user. Pretty cool.

But it's yeah, you still you need to have, you know, agents there for, you know, when you get stuck or for repetitive things like, for example, now, there's a lot of documentation in Arrow, especially in Python. And we have somewhere we have IPython code blocks somewhere, Python code blocks. So it's kind of a mixture because different people contribute. And this is a perfect thing for an agent because it's, you know, you just need to unify it and, you know, do that for me. But yeah, doing some real work or even usage of Arrow, I think you still need to, yeah, you need to depend on yourself better and just do some digging and research.

I actually started using agents more when I started doing the data engineering or work that's not related to Arrow. It's totally on the other side, which is, you know, having to ingest a lot of data from different sources and using so many tools I never used before. This is a local telco company that I help. And, you know, I started, I was a R user, a Python user. So when I started to work with them, they were like, OK, which tools do you use? And I'm like, Python. And they're like, OK, we use Kafka, Airflow, Meltano, DBT, Snowflake. I mean, it's so many tools. And there I really, really had, you know, a lot of help from the agents just to see, OK, what's the structure? You know, what's the syntax that this tool uses or that one? And it really saved me a lot of time.

Yeah, coming back to Wes's point about pedagogy, like I think one of the things that's really interesting, like it feels like now, if you know how to program in one language, that skill is like so transferable to other languages because you're not kind of constantly stuck in this, like, oh, how do I write a for loop? But like, how do we like, how are people going to get to that? Like, how do you become a competent program, like a really good programmer when for like the first maybe two or three years of your career, like a chatbot is going to do everything better than you?

When I was in college and we were learned to program, what I heard a lot was that mathematicians are good programmers because the way you need to think as a programmer is easier because of the way you have to think in mathematics to solve problems. So, you know, what do you think about this idea of you still need to have the mindset to program? Well, it's not just practice. You need to think in a different way. So that still needs to be learned, right?

I guess the kind of the way I look at it is I think the field of software engineering is, you know, essentially, maybe go like the difference between, you know, before the printing press or like before books, I'm trying to come with the right analogy, right? So like plenty of people, they read books, they read literature, they go to school to study literature, they do analysis of writing, and they can understand what good writing looks like. And there's whole fields of liberal arts, you know, scholars that study and scrutinize and learn about, you know, what constitutes good literature.

And so I think the field of software engineering is going to go from, you know, essentially, like the agent is doing all of the typing, but you're still doing most of the critical thinking about like, what is the proper shape and structure of the system essentially creating like the algebraic structure of like the layers that you're building. And I do think that new tools will emerge to help visualize and orchestrate like the different layers of the system that need to be like compartmentalized in such a way that like, I think the problem right now with agents is that they create code bases that are so entangled that the agent starts like getting confused about all of the like commingled responsibilities and things.

And so I'm hopeful that some new, you know, software engineering practices emerge that enable like kind of that more, to your point, like to that more mathematical mindset of like, I was also a mathematician. And so I approached software engineering from like a very mathematician centric viewpoint of like, thinking about things in terms of like stacks of lemmas and theorems and trying to prove things by layering things on top of each other. And I think maybe that, you know, drove the part of what was so intellectually appealing about Arrow and maybe since you're a mathematician, maybe something that appealed to you is that that notion of composability and like the interoperability and like the layering of software almost as a mathematical concept in a way. And so I found the idea of, you know, really intellectually appealing in a sense.

And I find that, you know, not to stereotype, but I do find that a lot of folks who have computer science degrees often approach software engineering from a more like pragmatic mindset and less of like a theoretical or like design, like kind of design perspective. Not to say that in general, but like if you took like a room full of 100 mathematicians turned software engineers and 100 computer science degrees turned software engineers, I do think that like different thought patterns would emerge in terms of like, you know, the inherent approach to building software. Of course, I'd be interested in like actually doing some studies and learning more about that. This is just kind of my anecdotal experience, which is, you know, probably BS. So if you're listening, feel free to call BS. So it's just my gut feeling.

Your comment about literature that reminded me, I should try and find this again. But a few years ago, I read this, I don't know if it was a proposal or an actual thing, but it was an MFA, a master's in fine arts in like programming or software engineering. Like attacking it from like the, you know, one of the ways you learn how to do painting is you go and like look at the, you know, the great works of the past and you like recreate them. And just that like, it does feel like we've also got a lot we can learn from the humanities in terms of now like studying and thinking about these volumes of text we're producing. I mean, truly, like we can begin to like think and actually dedicate energy to like the aesthetic qualities of a code base. And, you know, that used to be like the stuff of like, you know, oh, you're wasting time beautifying the code base. But when beautifying the code base has one one thousandth of the cost that it used to have, of course, the code base should be beautiful. Of course, it should be well structured and, you know, easy to reason about.

So I think that's, I'm optimistic. You know, I think that, you know, that maybe the field of software engineering will shrink, but I think maybe it will stay the same size and we'll all be building 10 to 100 times more software. I think that's the more, you know, that's my hope, actually, is that, you know, we're building a lot more like we're building the software tools that would never have been justifiable to build in the past, like personalized, like apps that only serve 100 people or less, like, you know, really hyper specialized tools that they deliver incredible value to small groups of people. And in the past, like maybe it would cost a million dollars to build that tool or that product. And so you think about like, well, there's 100 people. So, of course, we're not going to spend $10,000 a person to build that thing. But if you can build the same thing now in a week or two and a $200 Claude Max plan, of course, somebody has to babysit the agent and their time is not free. So you could back out the cost, but maybe the cost of like a $1 million bespoke software product now requires like, you know, somebody who's an expert in agentic programming and two weeks of time,