Resources

Nic Crane | What they forgot to teach you about becoming an open source contributor | RStudio (2022)

video
Oct 24, 2022
16:03

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hello, my name is Nick Crane, I'm a software engineer at Voltron Data and I'm also one of the maintainers on the Apache Arrow project. I work mainly on the R package. I started learning R about 12 years ago and I just got so excited about the idea of open source. I went from being this passive consumer of software to now having this thing that I could pull apart. It was like seeing the Matrix, seeing inside all these functions and things like that.

But despite that being quite a while ago, it took me about six more years, I'd say, to really start to connect with the R communities and another few years on top of that to make my first contribution. So what I want to talk to you about today is the things that got in my way and stopped me getting involved with contributing to open source a bit sooner when I probably actually could have and actually it's been so beneficial to me, I almost wish I had sooner.

But first, so in the way that everybody does science, I wanted to do a Twitter poll. So question for my RStats folks who have contributed to open source, how many years had you been using open source libraries for before you made your first contribution? And actually it was pretty interesting to see that it was pretty nice to see that for most people pretty soon actually, two to three years of using open source libraries before they kind of really got involved. But actually, you know, it's quite a lot of people in a similar situation to I was where six more years before they really kind of made that first leap.

Why contribute to open source

So I'm going to start off just by talking about kind of why I contribute to open source and the benefits it's had for me. But before I go into that, I just want to start with a bit of disclaimer. I'm not saying that contributing to open source is necessary to be a good software engineer or a good data scientist. It's absolutely not. I think there's a certain amount of privilege being involved with even having the time to do so alongside everything else. So that's not what I'm going for here.

But going back to kind of like what I've gotten out of contributing to open source, I feel like it's been the best learning experience of my career. Definitely kind of, you know, contributing kind of code and non-code things and getting feedback from people has been the biggest acceleration of my development as a software engineer. Working in public is great. You know, I figure out something that's taken me ages and I blog about it and then somebody's like, okay, I get it now, really rewarding. And again, it's great to be part of a community. I think as well, it's great to have that proof of your skills. Like, I'm not the most necessarily confident person as a software engineer, but having those sort of bragging rights, you know, like I've kind of made those contributions to that project, okay, well, I guess I must know what I'm doing some of the time then. And also just like giving back to projects that have helped me in my work before.

Common barriers to getting involved

Okay, so that's kind of the benefits. But what are the kind of things that have kind of maybe got in my way or some quite common complaints? So like, I think one of the flip sides of open source being this kind of massive network of self-governing communities is that it has been quite kind of okay in some bits to, you know, have kind of quite strong personalities and people kind of not behaving amazingly in some bits anyway. So I think I definitely have that question, you know, are people going to be friendly or mean?

There's also the question of not knowing how. So I was at a point where I could write functions and even write R packages, but I've not made a pull request for a project yet. Other questions are things like, what if I don't have the skills to do this? What if I don't know where to start? And even just that fear that like working in public is scary. I get to share my successes, but I also kind of share my failures a bit.

So just to get kind of one barrier out of the way, if it's literally just the process of how do I make a pull request to a project that is stopping me from contributing to open source, I'd say please do check out the first contributions repo that's kind of looked after by forwards. So I was involved in kind of making the R version of this. And basically it takes you through the process of making a pull request. So if you already know a bit of Git and know how to use GitHub, that is all you need to kind of just walk through the kind of mechanical steps for it.

But actually I think the things that really hold people back from getting involved aren't kind of the actual process things, and it's more these kind of concerns or worries. So like the main bulk of this talk is going to be basically about things that I now know that I wish I'd known before.

How to pick where to get involved

Okay, so the first big question I had is how to pick where to get involved and what to work on. And I think this is just really complicated by maybe the view that I had of open source. And I really had this kind of slightly misguided view at first that open source is one big community. And that's just not true. There are lots of open source sub-communities. And even within languages, within R, we see this, right?

But I think even kind of stepping back from that, kind of my next view of open source wasn't quite right either. So to kind of illustrate it, so this is a screenshot from the movie Mean Girls. Now, if you've not seen that, I'm sure everybody at some point has seen some TV show or movie where you've got this kind of like American high school cafeteria and the new kids walking through, and it's like, are these people going to let me sit at their table or are they not? And that feeling of, you know, these people are kind of mean and these people are kind of nice and things like that. And this metaphor completely breaks down because everybody's nerds here.

But the point was like kind of viewing it as these people are mean and awful and these people are lovely over here isn't quite right. I think with a lot of projects, it's easy to forget, like a lot of projects are kind of staffed by volunteers. There's a lot of reasons why kind of some projects just don't want new contributors. So they might be going through a design phase. They might just be really small. And it does take effort to onboard a new contributor. So I kind of came to realize that is okay.

But then, of course, like the next obvious question is, so if some projects don't want new contributors, then how do I find the ones that do? And actually that was way easier to answer than I thought. So they'll tell you. So, for example, this is an issue from the Arrow project. This is one that we've labeled good first issue to say, you know, this is a good place to get started, get involved. And again, similarly, here's, I think this is an old one, but this is from dplyr. And it's been labeled both good first issue and help wanted, like clear signaling, like we want people to get involved here. I will say, like projects that don't have these labels, it doesn't mean they don't want new contributors. But if this is a concern of yours, this is kind of an easy way to kind of figure it out.

What to contribute

So the next question I guess then that I had is what to contribute. And I think commonly given advice that I think is good advice is all about starting small, build up trust, you know, show people you've read the guidelines and you can kind of like go from small things and bigger from there. I think one thing that was really important for me, though, actually, is it is way easier to contribute to packages that you use yourself. So my first contribution was to the broom package, and that went well in everything, but there were points during that kind of code contribution where I was like, I don't know if I can change this because I wasn't a regular user of the package.

And it would have been much easier if it was something that I had a bit more kind of day-to-day use with. And of course, like if you found problems in packages or bugs that you've experienced yourself, that's a really great place to get involved, kind of figuring out what's going on, fixing it, making a pull request. Equally with documentation you find unclear, I feel like if you're newer to something, you're actually an advantage here. Like you're coming to this problem with kind of like a fresh mindset, and therefore like if the documentation doesn't make sense to you, it's probably not going to make sense to somebody else, and that's a really nice early contribution to make.

So I think for me the really, really, really big question about how to get involved was how to navigate social dynamics. And this is a really tricky one, because as I said before, every project is different and made up of quite different people even kind of within the same project. Now there's this quote that I really like actually. So there's this book called Working in Public by Nadia Agbal, and if you're interested in kind of open source governance and how these things work, I would definitely recommend it. In the book, Nadia says, open source is complicated because it contains a messy mix of both technical and social norms, most of which play out in public.

open source is complicated because it contains a messy mix of both technical and social norms, most of which play out in public.

And I think this really gets to it for me. It's figuring out how to kind of change something, make a function, and that's one thing. But just figuring out how to be this new person in this group and kind of go in and not annoy people and kind of just get things done, that can be really tricky.

So one of the ways of overcoming this, I think, is a lot of projects have kind of contributing guidelines and documentation. So this is kind of like a screenshot of a page that I really, really like actually. It's the Apache Software Foundation kind of member participation guidelines. And it goes into detail, even like the minute detail of kind of like how to kind of phrase requests to kind of make significant changes to make it kind of as, I guess, as smoothly as possible. They say this doesn't just apply to Apache Software Foundation projects. And it's definitely worth kind of skimming through these if that's kind of a concern that you have yourself.

But then the other thing here, I think, is like sometimes just embracing imposter syndrome. Like you are new to this thing, and that's great. But like if you were pretending to be somebody that was there already, what would you do? So the answer could be you can go and read previous pull requests and previous issues and look at the kind of things that people do. So like every different package is going to have something slightly different. It's always good just to see what the conventions are and to follow them. I will say the one thing to kind of add to that is as a new contributor, it's sometimes better to err on the side of over-communicating rather than under-communicating. Because an established kind of contributor might kind of need to say less to get something accepted. But still, I found reading those previous PRs and issues has been like super, super useful.

Now, this is a really weird one for me. Something I've noticed, and again, this is very different between different projects, but definitely on some, there's this real difference in social norms and quite stark differences between what would come across a certain way in real life versus kind of how things look in a pull request. So to step back a second, we've got these projects that have maybe got busy maintainers, lots of different issues, and there's a certain level of bluntness that can appear rude, but it's people just trying to get through issues and just trying to get as much done as possible. And I've definitely had to kind of come to terms with this real duality of kind of maybe seeing that kind of feedback and getting it myself and not taking it as rude or personal or anything like that and just seeing it for what it is. But at the same time, actually not necessarily engaging in that communication style too much myself because I don't like it. And just because it's a normal thing, it doesn't mean I have to do it. But again, I can still exist in that world, and that is fine.

What the pull request process really looks like

So kind of my third question then was what does the process of getting a pull request actually look like? And this is another one where I've really had to kind of change my, I guess, the way that I view these kind of things. So previously, I kind of viewed it a bit like an essay. So you're in school, you do an essay, you do your work, you check it, you hand it in to somebody, and then you wait a bit and you get it back, and you've got some kind of mark that's an indication of your skill or competence. That is really not a good way to look at a pull request. You will have a bad time if you do that.

And the reason for that is a pull request is more like a conversation than an essay. The back and forth and suggestions for changing things and taking a different approach, that is part of one of the best things about getting involved in open source. That is when you are really learning. That is when you're getting feedback from people that are probably quite experienced at this to improve things. So yeah, definitely not treating it like an essay.

a pull request is more like a conversation than an essay. The back and forth and suggestions for changing things and taking a different approach, that is part of one of the best things about getting involved in open source.

And I think another thing kind of around this kind of pull request kind of approach as well is I've found that coming from a background where I've been working in kind of consultancy and a bit of software engineering on proprietary projects, working on open source libraries is a completely different thing in some really subtle ways. And this is something that I summarize as a task-oriented versus code-oriented approach. So when I was a consultant, I would be working building things with libraries. So the focus there is getting tasks done. You might be building an app for a user that needs to do a certain thing as part of their day job. You might have a known set of users or at least some of them you could maybe go sit down with them and see how they'd normally do that task. And it makes it a bit easier really to kind of see what they're likely to do and what they're not likely to do. And you can kind of get away with only testing kind of like certain paths.

In these circumstances, dependencies can be a lot less important. You can add in new packages and it not be a big deal. And you might be the only maintainer or part of a small team. So you can kind of do things however you want as long as it works. But this is completely different now working in open source libraries. Your users and your tasks are mostly unknown which means you have to be really rigorous with your testing and even kind of preventing paths that you think no one will ever take or testing things that you think no one will ever use actually does become quite important. There's often kind of a need to really avoid adding extra dependencies here. But that can lead to really good things.

So I think sometimes it can be easy to say, right, let's just use this other one or a bit from both. But when you're working on kind of open source libraries, you kind of can't do that. But that forces you then to learn to work with the tools that you've got in front of you and learn that kind of last 10% of a library to get tasks done. And I think, again, that's been something that's been really valuable for me. And also just this idea that you're writing code that isn't kind of your code. This is to be maintained by kind of maybe yourself but other people in the community. And that means you can't have these maybe weird idiosyncrasies or kind of code smells. You've got to be a bit more strict about these kind of things. And I think just kind of realizing that it's a really different kind of mindset between these two things kind of makes it a bit easier to accept what kind of might seem like pedantry sometimes but it's actually for a really good purpose.

Making meaningful contributions

But actually, this isn't even the right question. You could submit a pull request to a project and it might not get accepted. And that could be a good thing. You could try to implement something that doesn't quite work but it exposes something else in the package that actually needs changing and updating. And that's a helpful contribution. You've figured something out that can make the software better. And actually, I think the question should be how do I make a meaningful contribution?

Because actually, when I think about it, I feel really good about the code that I've kind of contributed towards because I think some of my most meaningful open source contributions have nothing to do with code. I love writing blog posts when I've learned something that I found difficult and then somebody else is like, yeah, no, that makes sense to me now, finally. I love improving documentation and taking that kind of beginner's mindset. Even tweeting about a bit of functionality that is not well advertised and somebody's suddenly like, I can now achieve something with that. I feel like that's kind of meaningful.

I think some of my most meaningful open source contributions have nothing to do with code.

So to summarize then, it's about people. There isn't a single open source community. There's a lot of different sub-communities within things. And there's no one-size-fits-all approach to saying you should do this here or you should behave like that in that way. It's about kind of just figuring it out for the thing that you're actually looking at at the time. I think, especially maybe as a new and more nervous person, just find communities that you want to be part of and want you to be part of them. Also, just remember everybody is new at some point and once you have done the work of reading the guidelines and kind of a few previous pull requests and approaches, after that there's a point at which you just have to jump right in and go for it and then welcome that kind of code review conversation. But as much as I'm talking about code now, open source contributions are about way more than just code and that is definitely something to keep in mind. Okay, thank you.