Resources

Jamie Ralph | Developing internal tools for multi-lingual teams | RStudio (2022)

video
Oct 24, 2022
17:09

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, I'm Jamie and I'm going to talk about developing internal tools for multi-lingual teams, specifically teams that use Python and R.

To get started I want to talk about pizza. Imagine you're having a couple of friends over for dinner and you need to prepare them something. The easiest thing to do is to prepare them the same meal. However, they're both pretty fussy eaters. Do you make two entirely different meals, with all the extra ingredients, prep time and aggravation? No, you make pizza, because even if you make a margarita pizza for one friend and a Hawaiian for the other, they still have the same base, the same core ingredients.

What does pizza have to do with R and Python? The prospect of developing a tool in both R and Python can feel like you're about to prepare two entirely different meals. It's twice the effort, twice the cognitive load, and can take a lot more time. How can we take the principle of our pizza party, that is, making two different meals with the same base, and apply it to developing bilingual tools?

This is a question I think about a lot in my job. I work at Bumble, we are the company behind the dating and social networking apps, Badoo, Bumble and Fruits. My job is to develop internal packages for our data analysts. Most of our analysts work in Python, however, some of our projects are done in R. This means some of our internal packages are developed in both languages.

But why do we need internal packages? Well, internal packages can help to automate and standardise procedures that are specific to your organisation. For example, an internal package might automatically apply company styling to a report, or enable a user to connect to an internal database in the correct way. To borrow from our RStudio global talk, internal packages are kind of like members of your organisation with specific jobs to do.

Why develop in both Python and R? The answer is simple. In a multilingual organisation, we want everyone to follow the same internal procedures regardless of the language that they're using. The question I will try to address today is what strategies can we use to make developing tools simultaneously with Python and R easier? In truth, this is a very big topic with lots of branching ideas. Therefore, today, I will focus on three ideas that have helped me to develop bilingual tools.

Building identical generic functions

The first idea I want to talk about is building identical generic functions. But what is a generic function? Let's look at this from the R perspective. Generic functions are very common in the R language. For example, functions like print, or summary, or tidy from the broom package. The really cool thing about generics is that they behave in different ways depending on the class of an object that is passed to them. Under the hood, this is achieved using a process called method dispatch.

As an internal developer, you might want to build a package that uses custom internal classes, and you may want your users to be able to call methods like print or summary on them. In some cases, you may even want to define your own generic functions. Let's take a look at what generic functions look like in code. To do this, we will create a new package which will provide one function, say hello, and this will print a nice friendly greeting to an object that we pass to it.

To start, we define our generic function, say hello, and in the function body, we make a call to use method, passing it the name of our generic as a string. And that's it, that's our generic. But now we need a class specific implementation of this function. And to do this, let's make one for a data frame. And to do it, we will use dot notation, say hello dot data frame. And in this new function, we print a nice friendly greeting. And now if you want to catch all implementations say hello, we will use the default pseudo class, say hello dot default.

Okay, now we know how our R package is going to work. We need a Python version. Can we write our Python version in the same way as our R version? And the answer is yes. To do this, we're going to import single dispatch from the functuals module. And unlike R, we actually start by defining our fallback method for say hello. And we decorate it with single dispatch. Now, if you're not familiar with decorators, think of them as wrapper functions that change the behavior of a function. And now we're going to create a data frame implementation of say hello in Python, using a data frame object from the pandas library. And we do this with the register function. And here I am passing that data frame to the register function.

And there we have it. But what have we actually achieved here? Well, we have two different tools, one in R, one in Python that have the same foundation, a say hello generic with a data frame implementation. So both of my packages work the same conceptually. This is good for end users, they get a similar API to work with regardless of whether they're using R or Python. But also my code is physically quite similar. So when I go and make a change in the Python version of my code, I know where to go and make that change in the R version of my code. So building identical generic functions is going to save me a lot of time and effort.

So building identical generic functions is going to save me a lot of time and effort.

Identical error handling with classes

The next idea I will talk about is identical error handling with classes. Let's look at this from the Python perspective. Now, if you're a Python user, you're probably familiar with these kinds of exceptions, far not found error, zero division error, and key error. Exceptions in Python are themselves classes that inherit from other exception classes. And you can probably already tell that the names of these exceptions are quite descriptive, they form part of the error message.

Now as an internal developer, you can create your own custom internal errors, you can give them really descriptive names. And the error messages that they give can be as well specific as you want, even telling the user where to go for help if that error has been thrown. The powerful thing about exceptions in Python is that they can be handled explicitly. For example, with a try except statement, I asked Python to try running some code. And in the event of a specific error, in this case, a far not found error, I can run an alternative block of code. This gives me maximum control over how my code handles failure.

Okay, so we're building a Python package, and we are handling errors in this way. But can we handle errors in the same way in our R packages? The answer is yes. But let's look at some basic error handling in R first.

If you're an R user, you may have seen one of these two functions stop from base R and abort from the R line package. Both functions signal an error. And in R, if we want to handle errors in our code, we can use something like a try catch function. But notice here on the second line, the error handler is a function that will execute in response to any error. It's not designed to handle a specific error. And this is a problem. I now have a lot less control over my code. And more importantly, my Python and R packages handle failure in very different ways.

How can we overcome this? Well, the answer to this is within the abort function from Rlang. Let me illustrate this with an example. I'm going to create a new function abort credentials missing. And within called Rlang abort, I'm going to give a specific class name. Here it's error credentials missing. Now the really great thing about error functions in R is that you can create an equivalent Python exception.

So as an internal developer, you have your Python exceptions and you have your R error functions. They're one and the same. So when you make a change in one language, you know where to make that change in the other language. But what have we gained by giving our error a class name? Well, let's go back to try catch. You can see here on the second line, I can now write a specific error handler for my new error credentials missing. And now my Python package and my R package handle failure in the same way.

I want to round off this section on errors by talking about error chaining. In essence, error chaining is throwing an error in response to another error. One use for this is to give a high level context to a lower level error. Doing this in Python is very easy. For example, you can give your original error an alias. Here I'm giving it the alias E. And then I'm raising another exception from that original exception. And this gives me a kind of double error message. I get my original error and then Python tells me that this error caused another error.

Can we emulate this in R? The answer is yes. Now the way that we're going to do this is we're going to use try fetch from the Rlang package instead of a try catch. And in my error handler, I'm going to pass my condition object to the abort function, specifically the parent argument. And here I get a kind of double error message. And there we have it. Our Python and R packages now conduct error chaining in the same way.

Creating internal wrappers

The third idea I want to talk about is creating your own internal wrappers. But what is an internal wrapper? Well, it's a package written in one language that wraps functionality from another language. But why would you want to create your own internal wrapper? Well, imagine a scenario where a piece of functionality is available in one language but not the other. A well-known example is a statistical model that's released as an R package but not a Python package. But you want all of your analysts to be able to access that functionality. Another scenario you might encounter is that both languages have their own implementation of a piece of functionality, like a statistical model, but maybe the underlying computations aren't quite the same. And therefore, your analysts get different results.

Okay, let's kick this section off by looking at how we call Python from R. There are many ways to do this. My method of choice is the reticulate package. Reticulate gives you a number of different ways to call Python code in your R package. Let's look at a couple of examples. Now, the first example is importing Python modules into R. This works really well if the Python code you're trying to access is contained within a Python package. And to do this, I use the import function from reticulate. I give it the name of my module. Here, I'll call it pymodule, and I assign it to a variable. And now, I can access functions from that module using this dollar sign notation.

Now, a really good tip for when you're doing this in an R package is to create a global reference to your Python package. And then you can use it throughout your code. Notice here, I'm not actually importing anything yet. I'm going to specify how my Python module is imported in the onload function. This will execute when my R library is loaded. And here, I'm specifying delay load to true. And what this does is it gives my end user the ability to specify their own custom Python virtual environment to use with my package.

Another way that you can use Python code in your R package is to source Python files. This might work well if the code you're trying to access is contained within a Python script. To do this, I'm going to use a Python file. Here, it's called add.py, which is contained in my R package. And inside that file, I define a new function, add to. And now, I will use source Python from the reticulate package, and then I can access the add to function.

Okay. We've called Python from R. Now, let's call R from Python. Again, there are many ways to do this. My method of choice is the rpy2 package. rpy2 gives you a number of different ways to call R code from your Python package. Let's look at some examples.

The first method is to import R libraries into Python. To do this, we're going to import the import R function from rpy2. And then, in this example, we're actually going to import base R itself, which we can import like any other library. To access a function from base R, we use this dot notation. Here, I'm accessing the sum function.

Now, to use this function, I'm going to have to do a little bit of work. Sometimes, when you're using rpy2, you need to convert objects between R and Python. In this example, I need to convert my Python list into an R vector. I can do that using the int vector function from rpy2. You can check out the rpy2 documentation for more tips on object conversion.

Another way to access R code from Python using rpy2 is to create new functions and access them from the R environment. This approach works really well when you have multiple computation steps that need to happen in R. Now, to do this, we're going to first import the R objects module from rpy2. We're then going to define our R code as a string. Here, I'm just creating a new function that returns one. I then execute that R code using the R function from R objects. And then, I can access that new function using globalenv and assign it to a variable. And now, pyfunc is actually a function that I can execute.

Now, a really great utility that rpy2 gives you is the ability to define a temporary package structure. Now, we can do this by importing the stap function from rpy2. And just like before, we store our R code as a string. And then, we use stap to execute that code. And then, we can access functions from that code using this dot notation. So, this approach works really well when you are concerned about issues like namespace conflicts.

Okay. Now, for the bad news. Cross-language wrappers have their own technical quirks and headaches. For example, you have twice the number of environments to think about. Your users will need to have access to both Python and R. And your packages will need to deal with the relevant dependencies. From personal experience, Reticulate is a lot more explicit about how it handles Python environments. Whereas, with rpy2, you need to do a little bit more of the work yourself.

Now, another issue is object conversion. Sometimes, to work with cross-language wrappers, you need to handle the conversion of an object from one language to another, like you saw with rpy2. And a third issue are error messages. Sometimes, when an error occurs in the language that is being wrapped, the error message doesn't come through in its entirety, which can frustrate your efforts to create nice, helpful error messages. But with all this being said, cross-language wrappers are a really useful tool in your internal toolkit. They can deal with issues of missing functionality or inconsistent outputs between languages.

And there we have it. Three different ways to build bilingual tools without having to design two entirely different solutions. The goal of each of these ideas is to make the development process and the maintenance process much easier for you. If you're an internal tool developer working in a bilingual organization, I would encourage you to seek out these programming concepts and ideas that can help to bridge the gap between Python and R. Ultimately, the goal of this is to make your life as a developer much, much easier. Thank you for listening, and happy developing.

If you're an internal tool developer working in a bilingual organization, I would encourage you to seek out these programming concepts and ideas that can help to bridge the gap between Python and R. Ultimately, the goal of this is to make your life as a developer much, much easier.