Kelly Bodwin | Translating from {tidymodels} and scikit-learn: Lessons from a 'bilingual' course

Transcript#

This transcript was generated automatically and may contain errors.

Thank you all for being here, my name is Kelly, I teach statistics and data science, and on this first slide you are seeing the beautiful city of Barcelona, Spain, and I put that up there because a week from today I will hopefully be there to meet my boyfriend's parents for the first time.

He is fully Spanish, I very wisely took 10 years of French, so I don't really speak Spanish, and I have been practicing a bit, and so that has gotten me thinking a little about translating between ordinary languages and translating between programming languages.

And so this whole hook was of course just an excuse for me to be like, hey everyone, I have a boyfriend, that's him.

And so I have been listening to this podcast that teaches Spanish, and there was this phrase that they were teaching that kind of struck me, and it looked like this, me gusta ver a mis amigos, sorry for the pronunciation, and that a, the ver a mis amigos was the point of the episode, it's kind of a weird construction for what they are trying to say is, I like to see my friends.

But of course if you were to translate word for word or literally to English, we would be saying, it is pleasing to me to see to my friends, which of course is not a thing that we say in English, and so this got me thinking about that idea of a one-to-one direct translation and how that's not really how languages actually work.

So can we apply this sort of theory of language learning to programming languages instead?

your API design is going to influence how people think about the actual concepts.

And then I want, when possible, if you can be kind of agnostic, if you can allow, like, I'm thinking of how, you know, the recipes and the model specification and tidymodels, it doesn't really matter which order you declare them. They're sort of separate entities that you put together later. I think that's a really good design choice, personally.

And then my last suggestion is just keeping kind of related decisions together. Like, I don't necessarily think scikit-learn is wrong, but I don't like this thing that you go through the whole setup and then you specify your predictors.

My advice for users here, try to match your code flow with your narrative, where you have choices. You know, if you are thinking through a problem in a certain way, try to match your code. And then, of course, like, intersperse your code with text. This is why that discussion and documentation is so important.

And then as much as you can kind of compartmentalize, make your steps modular, here's where we choose the model, here's where we choose the predictors, that also kind of lets you swap between languages.

And then my advice for educators here, you know, enforce a workflow on your students that matches the way that you teach it. Break your analyses up. Same idea as the practitioners. If you are isolating these decisions, then when they see them in different orders, they recognize that decision.

And then, you know, really push students to document, to write what they're doing, to state their decisions outside the context of the code.

Lesson 2: False friends

Next thing that I find difficult in learning languages is false friends, or faux amis, when I learned French. Words that sound really similar and have different meanings.

Or another one, if you come up to me and you say your talk was quite good, I'm going to be really happy, unless I find out that you are from the UK, where quite means like meh. So this word quite in English does not mean the same thing in every culture.

And so what does this look like in code? What it looks like to me, you know, is similar functions or arguments that behave differently. So by way of example, I want to talk a little about support vector machines. Not going to teach you support vector machines, I'm going to do this really quick. We are trying to find a line to separate these two classes. We want that line to sort of push the classes apart as much as possible. So we want those dotted lines to be as far from the solid line. But as we push the classes apart, we end up with, on the right-hand side, those points that are misclassified or that kind of fall inside that margin. So when we fit a support vector machine, we're trying to minimize that misclassification, maximize that margin.

And so as you would hope, you know, in the implementation in tidymodels, there is indeed a parameter that is the cost of having points fall inside of a margin. And in scikit-learn, as you would hope, there is a parameter C that has to do with the cost of points falling in the margin.

So let's look at those a little more closely. tidymodels version is the cost of predicting wrong. So the cost of a sample landing in the margin or like on the wrong side. In scikit-learn, notice that it says C. The strength of the regularization is inversely proportional to C. The regularization being that penalization for misclassification. So these work opposite. They're capturing the same thing. They're capturing the same idea. Again, neither is wrong.

But when my students sitting next to each other using different languages tried the same values for cost and for C, because they were trying to tune their model, they got different results. And that was very confusing to them and to me for quite some time.

And so what happens here is that then those students, of course, like conceptualize that penalty differently. So the tidymodels students are conceptualizing it as a penalty on misclassification, whereas the scikit-learn students are conceptualizing it as like a balancing parameter, a regularization parameter.

And so what I would remind everyone to do is, you know, be explicit if you're designing tools. I think cost of misclassification helped me track down what was going on. I think amount of regularization did not, because regularization is a little more general.

My advice for educators and practitioners, state everything in words. Say things like when the cost is higher, we expect less misclassification and the cost of that is a smaller margin. If you say that out loud, then you are understanding that parameter.

Lesson 3: Slang and wrapper functions

Term three, moving swiftly along. I think that learning language's slang terms become hard. My favorite slang term in English is stand. You know, the phrase stand to stand something or to be a stand. It means you're like our big fan, basically. We're all stands of R in here.

So this came from a music video and song by Eminem about a very obsessed fan whose name was literally stand, like Stanley. And somehow over the years, this word evolved to mean something much broader and something that is used, you know, and people don't even remember that it came from a person's name.

And so I think the equivalent of slang in code is maybe these helper wrapper functions that contain a whole lot of meaning, like way more than what that one word might suggest. And where I see this in my class is in tuning parameters.

So when you tune, and we're sticking here with those support vector machines. When you tune, you need to decide which values you're going to try, right? And in both Python on top and R on the bottom, you can just sort of manually create, you know, kind of that list of different values that we're going to try for our model.

The tidymodels also comes with the shortcut where we can access built-in functions with the names of the parameters, and it will automatically choose values for you that it thinks are reasonable. I love this. I'm not saying this is bad. I would much rather have Max Kuhn decide which values I'm tuning than me decide. That's great, right?

But again, this was my biggest shock in the class. The Python students picked it up better. Everyone understood tuning the first time that I taught it, and then nine weeks of hands-on activities later, if you asked the R students about tuning, they went into detail on the functions. And if you asked the Python students, they got the concept. We just try a bunch of different values and see what sticks. So that was surprising to me.

The educators, you know, think about if you're using a wrapper function, basically assume that what you're doing is glossing over that concept. So that's okay. Like, I do this with step PCA. I don't teach all of PCA necessarily when I'm preprocessing, but avoid shortcut functions on things that are actually, like, learning goals in your class.

Lesson 4: Meaning in gestures and defaults

So the last challenge I want to point out is, like, meaning that isn't in words, that's in gestures.

So here's your pop quiz in the code context. How does the k-means function choose initial clusters? I asked this on Twitter. This is what people said. This is what I would have said before I dug into this. And it is not correct.

So here's how I used to teach k-means. Choose your initial centroids by choosing three random observations. Assign all the points to the nearest centroid. Recompute the centroids of each cluster. Keep going until you have clusters. I've taught it that way for years. It turns out that nothing about that is correct about how k-means actually works.

The default in this function, although it says that it will choose three initial observations, down there, that method, Hartigan-Wong default method, Hartigan-Wong neither initializes that way, nor does it even update in the same way as the other methods. So everything I've just told you is not the default.

In scikit-learn, they're a little more upfront, but now this initialization, k++, that's different. At least they kind of explain it, but they just say it's smart. And then you can think about using different functions in R. That one contains k++. k++ is not an option to the base k-means function. But now we don't have an option to choose your iteration process. So there is literally no way to guarantee, at least with these two implementations, that the R and the Python output matches.

And so what was the results? Just frustration. Just you cannot make that output be the same with the implementations people use.

Recap and takeaways

So recap. We have these challenges, and these challenges that you see in language also appear in coding. So the developers in the room, what I want you to do. Don't assume that the choice of the parameter name tells people what that parameter does. Don't assume that these default choices are the obvious choices. They're not universal. You have to document them.

And the order that we code is the order that we think.

And the order that we code is the order that we think.

For practitioners, again, translation is not going to be a direct word for word. Please discuss what you're deciding. Please document. Please use text.

And for educators, really what you want to do is force your students to be explicit about every decision.

So thank you for sticking with me. The slides are online. Please find me. Please talk to me. I'm going to have to jet to the other talk.

Kelly Bodwin | Translating from {tidymodels} and scikit-learn: Lessons from a 'bilingual' course

Transcript#

The pop quiz

The bilingual course

Lesson 1: Grammar and sentence structure

Lesson 2: False friends

Lesson 3: Slang and wrapper functions

Lesson 4: Meaning in gestures and defaults

Recap and takeaways

Featured software#

tidymodels