Kelly Bodwin | Translating from {tidymodels} and scikit-learn: Lessons from a 'bilingual' course
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Thank you all for being here, my name is Kelly, I teach statistics and data science, and on this first slide you are seeing the beautiful city of Barcelona, Spain, and I put that up there because a week from today I will hopefully be there to meet my boyfriend's parents for the first time.
He is fully Spanish, I very wisely took 10 years of French, so I don't really speak Spanish, and I have been practicing a bit, and so that has gotten me thinking a little about translating between ordinary languages and translating between programming languages.
And so this whole hook was of course just an excuse for me to be like, hey everyone, I have a boyfriend, that's him.
And so I have been listening to this podcast that teaches Spanish, and there was this phrase that they were teaching that kind of struck me, and it looked like this, me gusta ver a mis amigos, sorry for the pronunciation, and that a, the ver a mis amigos was the point of the episode, it's kind of a weird construction for what they are trying to say is, I like to see my friends.
But of course if you were to translate word for word or literally to English, we would be saying, it is pleasing to me to see to my friends, which of course is not a thing that we say in English, and so this got me thinking about that idea of a one-to-one direct translation and how that's not really how languages actually work.
So can we apply this sort of theory of language learning to programming languages instead?
The pop quiz
So I'd like to start off with a pop quiz, it's the first talk of this year's conf besides the keynote, and I'm hitting you with a quiz, sorry not sorry, here's your pop quiz.
This code right here, is the object returned by that code going to be two-dimensional, like a matrix or a data frame or something with rows and columns, or is it going to be one-dimensional, like a vector or an array?
Kudos to you people who raised your hands for two-dimensional after seeing the one-dimensional. The answer, as you might expect, is it's a trick question, as we know, Posit is no longer an R-only company.
In R, this exact code right here, taking a column of a penguin's data set, is going to be one-dimensional, right? That's yanking out a column. In Python, that thing's two-dimensional, that is a data frame with one column.
This is an example of, there is not a direct translation between these two rather similar, rather close languages. That exact code will run in both languages, and it will produce a different structure of results.
The bilingual course
So what motivated this? There's a class that I teach every year at Cal Poly, it's called Statistical Learning, and we're upgrading it to Statistical Learning with R, but then I've added in the Python. This class is kind of an upper-level class, it's mostly, but not entirely, stat majors in their final years of their undergrad degree.
There's no computing prerequisite, I think we're going to change that. There's no computing prerequisite, so any R or Python that I use, I have to kind of spoon-feed a little bit.
Back when this class was first offered, 2020, I didn't teach it, it used kind of a hodgepodge, as many of us have done with modeling, it was using caret, and base R, and tidyverse for the data wrangling, and just kind of all over the place. So when I took it over, I changed it to tidymodels, and that was really fun, tidymodels was kind of new, and it made it so much easier to teach it, I could go on, but that's not the point of this talk.
And then this year, for some reason, instead of doing the smart thing and just reusing my materials with little tweaks that had gone very well, I decided to translate everything also into scikit-learn in Python.
So I want to tell you what this talk isn't. I am not encouraging you to do this. This is extra work for me, without a huge amount of benefit in the class. I am not trying to encourage you to teach a bilingual class, although if you're interested, I would love to talk to you about it.
I did it kind of because there's sort of these two funnels into this course, and one side of that funnel is the computer science data science minors, I wanted it to be accessible. I wanted it to fit in our eventual data science bachelor's program, and I really wanted to practice, and it's one of those things where you commit when you're feeling enthusiastic and then you're forced into it. So I wanted to practice Python, I wanted to pick up Quarto, which is nice for using both.
Lesson 1: Grammar and sentence structure
So thinking about how you learn a new language or how you translate between languages. And now I'm thinking back to Spanish and French. Probably the first thing that comes to mind that's hard that was in my first example is grammar. And specifically sentence structure. Why do these same words go in a totally different order in different languages?
And so how I might kind of parallel that in programming languages is this idea of what's the structure of your analysis. You have to do these multiple steps to do an analysis. But in different implementations, those steps might happen at different times.
So here's what I mean. Here's two code chunks. The top one is scikit-learn, is Python. The bottom one down there is tidymodels, is R, of course. And my question to you is when in these two chunks do we decide which predictors are going to go into our model? It's just a linear model with a couple of predictors.
And that's happening right here. So it feels kind of the same in both languages right now. At the moment where you're fitting the data after you've specified your model, that's when you're declaring which variables are part of this model.
But then you start thinking about preprocessing. So here's some Python code that includes a step where we are log transforming both of our quantitative predictors. And so we've added some stuff to the process that you can see in Python. The preprocessing decision happens early. And then the decision about which predictors we're actually doing that to happens at the fifth step.
But in tidymodels here, we now have this idea of recipes. So the recipe then contains both the decision about your predictors and the decision about how you're going to transform them. And even though it's all together, it's actually in the opposite order. You choose your predictors and then your transformations.
And so here's what happened with my students. The ones that were using tidymodels kind of focused really on feature selection. They really focused on which predictors are going to be in my model. And then once they decided that, they were like, eh, maybe we could log transform them. And the ones who used scikit-learn really focused on how do we adjust the data. They kind of thought of this transformation as part of the data cleaning. And they did that first and then allowed some predictors or others to be in their model.
So what's the takeaway from that observation? Well, developers, I just want you to be aware of how, like, your API design is going to influence how people think about the actual concepts.
your API design is going to influence how people think about the actual concepts.
And then I want, when possible, if you can be kind of agnostic, if you can allow, like, I'm thinking of how, you know, the recipes and the model specification and tidymodels, it doesn't really matter which order you declare them. They're sort of separate entities that you put together later. I think that's a really good design choice, personally.
And then my last suggestion is just keeping kind of related decisions together. Like, I don't necessarily think scikit-learn is wrong, but I don't like this thing that you go through the whole setup and then you specify your predictors.
My advice for users here, try to match your code flow with your narrative, where you have choices. You know, if you are thinking through a problem in a certain way, try to match your code. And then, of course, like, intersperse your code with text. This is why that discussion and documentation is so important.
And then as much as you can kind of compartmentalize, make your steps modular, here's where we choose the model, here's where we choose the predictors, that also kind of lets you swap between languages.
And then my advice for educators here, you know, enforce a workflow on your students that matches the way that you teach it. Break your analyses up. Same idea as the practitioners. If you are isolating these decisions, then when they see them in different orders, they recognize that decision.
And then, you know, really push students to document, to write what they're doing, to state their decisions outside the context of the code.
Lesson 2: False friends
Next thing that I find difficult in learning languages is false friends, or faux amis, when I learned French. Words that sound really similar and have different meanings.
Or another one, if you come up to me and you say your talk was quite good, I'm going to be really happy, unless I find out that you are from the UK, where quite means like meh. So this word quite in English does not mean the same thing in every culture.
And so what does this look like in code? What it looks like to me, you know, is similar functions or arguments that behave differently. So by way of example, I want to talk a little about support vector machines. Not going to teach you support vector machines, I'm going to do this really quick. We are trying to find a line to separate these two classes. We want that line to sort of push the classes apart as much as possible. So we want those dotted lines to be as far from the solid line. But as we push the classes apart, we end up with, on the right-hand side, those points that are misclassified or that kind of fall inside that margin. So when we fit a support vector machine, we're trying to minimize that misclassification, maximize that margin.
And so as you would hope, you know, in the implementation in tidymodels, there is indeed a parameter that is the cost of having points fall inside of a margin. And in scikit-learn, as you would hope, there is a parameter C that has to do with the cost of points falling in the margin.
So let's look at those a little more closely. tidymodels version is the cost of predicting wrong. So the cost of a sample landing in the margin or like on the wrong side. In scikit-learn, notice that it says C. The strength of the regularization is inversely proportional to C. The regularization being that penalization for misclassification. So these work opposite. They're capturing the same thing. They're capturing the same idea. Again, neither is wrong.
But when my students sitting next to each other using different languages tried the same values for cost and for C, because they were trying to tune their model, they got different results. And that was very confusing to them and to me for quite some time.
And so what happens here is that then those students, of course, like conceptualize that penalty differently. So the tidymodels students are conceptualizing it as a penalty on misclassification, whereas the scikit-learn students are conceptualizing it as like a balancing parameter, a regularization parameter.
And so what I would remind everyone to do is, you know, be explicit if you're designing tools. I think cost of misclassification helped me track down what was going on. I think amount of regularization did not, because regularization is a little more general.
My advice for educators and practitioners, state everything in words. Say things like when the cost is higher, we expect less misclassification and the cost of that is a smaller margin. If you say that out loud, then you are understanding that parameter.
Lesson 3: Slang and wrapper functions
Term three, moving swiftly along. I think that learning language's slang terms become hard. My favorite slang term in English is stand. You know, the phrase stand to stand something or to be a stand. It means you're like our big fan, basically. We're all stands of R in here.
So this came from a music video and song by Eminem about a very obsessed fan whose name was literally stand, like Stanley. And somehow over the years, this word evolved to mean something much broader and something that is used, you know, and people don't even remember that it came from a person's name.
And so I think the equivalent of slang in code is maybe these helper wrapper functions that contain a whole lot of meaning, like way more than what that one word might suggest. And where I see this in my class is in tuning parameters.
So when you tune, and we're sticking here with those support vector machines. When you tune, you need to decide which values you're going to try, right? And in both Python on top and R on the bottom, you can just sort of manually create, you know, kind of that list of different values that we're going to try for our model.
The tidymodels also comes with the shortcut where we can access built-in functions with the names of the parameters, and it will automatically choose values for you that it thinks are reasonable. I love this. I'm not saying this is bad. I would much rather have Max Kuhn decide which values I'm tuning than me decide. That's great, right?
But again, this was my biggest shock in the class. The Python students picked it up better. Everyone understood tuning the first time that I taught it, and then nine weeks of hands-on activities later, if you asked the R students about tuning, they went into detail on the functions. And if you asked the Python students, they got the concept. We just try a bunch of different values and see what sticks. So that was surprising to me.
The educators, you know, think about if you're using a wrapper function, basically assume that what you're doing is glossing over that concept. So that's okay. Like, I do this with step PCA. I don't teach all of PCA necessarily when I'm preprocessing, but avoid shortcut functions on things that are actually, like, learning goals in your class.
Lesson 4: Meaning in gestures and defaults
So the last challenge I want to point out is, like, meaning that isn't in words, that's in gestures.
So here's your pop quiz in the code context. How does the k-means function choose initial clusters? I asked this on Twitter. This is what people said. This is what I would have said before I dug into this. And it is not correct.
So here's how I used to teach k-means. Choose your initial centroids by choosing three random observations. Assign all the points to the nearest centroid. Recompute the centroids of each cluster. Keep going until you have clusters. I've taught it that way for years. It turns out that nothing about that is correct about how k-means actually works.
The default in this function, although it says that it will choose three initial observations, down there, that method, Hartigan-Wong default method, Hartigan-Wong neither initializes that way, nor does it even update in the same way as the other methods. So everything I've just told you is not the default.
In scikit-learn, they're a little more upfront, but now this initialization, k++, that's different. At least they kind of explain it, but they just say it's smart. And then you can think about using different functions in R. That one contains k++. k++ is not an option to the base k-means function. But now we don't have an option to choose your iteration process. So there is literally no way to guarantee, at least with these two implementations, that the R and the Python output matches.
And so what was the results? Just frustration. Just you cannot make that output be the same with the implementations people use.
Recap and takeaways
So recap. We have these challenges, and these challenges that you see in language also appear in coding. So the developers in the room, what I want you to do. Don't assume that the choice of the parameter name tells people what that parameter does. Don't assume that these default choices are the obvious choices. They're not universal. You have to document them.
And the order that we code is the order that we think.
And the order that we code is the order that we think.
For practitioners, again, translation is not going to be a direct word for word. Please discuss what you're deciding. Please document. Please use text.
And for educators, really what you want to do is force your students to be explicit about every decision.
So thank you for sticking with me. The slides are online. Please find me. Please talk to me. I'm going to have to jet to the other talk.
