Resources

June Choe | Cracking open ggplot internals with {ggtrace} | RStudio (2022)

video
Oct 24, 2022
19:10

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hello everyone, thanks for coming to my talk. My name is Jun. I'm a PhD student in linguistics at the University of Pennsylvania, and today I'll be talking about ggplot internals and my package ggtrace.

So I suspect that most of us here are already pretty familiar with ggplot, but just to get a sense of the room, how many of us has made, like, a box plot in a ggplot before? Alright, that's like literally everyone. Cool. And I make them all the time too. It's like one of the first plots I make when I get results back from an experiment. So let's say like a behavioral data of, like, button presses. So each row is a response from a participant, and there's columns for experiment condition and response time. So I can write a short code where I pass my data to ggplot, map condition to x and fill, and response times to y, and add a geom box plot layer. Pretty straightforward. That gives me a box plot.

So that was very simple to do in ggplot. And oftentimes I'll have participants who are behaving, let's say, more interestingly than others, and I care about identifying them for a lot of reasons. So I might also want to annotate the cutoff value for extreme outliers, like this. And of course I should calculate it myself if I, like, really care and if it's really important, but I'm just doing a quick exploratory analysis, and I already know how to tell ggplot to calculate this for me. So labeling that same value should be a trivial task, right? Except it turns out this is kind of tricky, because what we're trying to do is do something like a box plot layer, except divert its normal course of action under the hood.

So our new layer should calculate the same variable as our first layer, but then instead of using boxes and whiskers to draw all of them, it should label just one of them. And when it comes to these situations where the user wants to intervene in the process power layer comes to life, there's good news and bad news. The good news is, thanks to recent developers, we can write a layer that does all this, all inside, and all on the fly, all inside a geom label layer. So that's great, but the bad news is this is incomprehensible. Like, this is not your typical ggplot code. So we have strange functions called inside the AES, like stage and after stat, and we're using the stat and data arguments of the geom label layer, which we don't often do. So this is still vanilla ggplot, but this kind of code is very difficult to reason about, because it requires a mental model of how ggplot internals work.

And here's what I mean by that. So the code for our first layer is easy, because we can get away with thinking about a layer as just doing one big thing. So like, hey geom boxplot, here's all the stuff you need, now go do your thing. But we need to revise this simple model of how layers are built to understand code like the second layer, which reads more like, hey layer, you know that data that I gave you to draw? Apply this function to it first, and then proceed as normal with condition map to X and response times map to Y. Then after the boxplot stat steps in and transforms the data, then use the computed variable Y max to remap to the Y aesthetic and do the same for the label aesthetic. And then lastly, use the label geom to draw what you have.

That was a mouthful, and you get the point. What appeared to be a simple task of annotating a boxplot variable turns out to require a deep understanding of what goes on where in the internals. And it's not just because there's more code that is difficult. It's conceptually difficult, because it makes references to different steps in the internals that we as users don't get to see. And I don't know about you, but I would love to be able to write code like this. So if a knowledge, if it requires the knowledge of the internals, if that's what it takes, then maybe ggplot internals aren't just for developers anymore.

then maybe ggplot internals aren't just for developers anymore.

And so that's what I want to talk about today. I would like to propose a model of ggplot internals for users. One that's immediately practical to users just wanting to get better at ggplot, but also one that offers an accessible starting point for aspiring developers who want to extend ggplot. So here's my proposal in a nutshell. Why don't we just pretend that ggplot internals is a one big data wrangling function? Because it kind of is. And to do that, I have developed my package ggtrace, which I'll be showcasing today.

Reframing ggplot internals as data wrangling

So here is the outline for the rest of my talk. So first I'll introduce this reframing of ggplot internals as data wrangling, which will also provide context and motivation for my package ggtrace. And then I'll demonstrate how we can put this idea into practice using a simple example, using functions from ggtrace, starting with an example of a bar plot, and then returning to our box plot annotation problem and then building up the solution from scratch.

So our dive into the internals begins with this factual observation that each layer of a ggplot has an underlying data frame representation. So in other words, layers are just data frames until they get drawn. So we can actually see this using ggplot's own layer data function. So we pass it our plot and the index of the layer we want to inspect, and it gives us the data that's underlying the box plot layer of our plot. And this is the actual data frame that gets sent off to the quote unquote drawing system, which is kind of like the graphical object land with like grid functions and gtable functions. But for our purposes, we'll just pretend that all of the internals is a function that just takes the raw data that we give it, which is the experiment data that we provided, and makes it drawing ready, like this, and does this process for every layer in our plot according to each layer's own specifications.

And this doesn't look too bad, right? Our input was a tidy data, and the output is just another tidy data, where each row represents a box plot, and the columns are aesthetics that describe each box plot, including the value for the upper whiskers stored in the Y max column. And crucially, as you might suspect, this process of making a layer's data drawing ready happens in steps, not all at once. So why do we care about that as users? Well, because again, different pieces of the code that we write also kicks in in different steps, step by step, to make the layer's data more and more drawing ready. So it's literally like data wrangling happening under the hood, except you don't get to see your data again after it goes off and off your hands.

So I'm simplifying things a bit here, but for our purposes as users, we can think of the internals as divided up or sliced into these four big steps, which I'll call the before stat, the before geom, or before stat, after stat, before geom, and after scale. And because they happen in order, how a piece of our code, layer code, changes the current state of a data in this pipeline at one step has consequences for what kind of code we can write for that data in another step.

So, so far, so good. Now we just need to see what the data actually looks like at these steps. So we go digging into the internals, and we find that the data passes through these scary looking things called ggproto methods, and not like the kind of dplyr functions we're familiar with. And the four steps that we care about as users, those are also the inputs and outputs of these ggproto methods. These three to be exact. So in the internals, the raw data is input, and the before stat data is the state of the layer data when it's passed into the layer ggproto objects compute statistic method, that the after stat data is the output of that same method, and then the before geom data is the input to this next method, and then the after scale data is the output of this other method, and then some more stuff happens, and then the data becomes drawing ready, and then it gets sent off to be drawn.

So the fact that the internals are written this way is a big reason why it's deliberately kept hidden from us users. ggproto is scary, and we're honestly, like, better off not knowing anything about them. Like this first method here, layer compute statistic, it looks like this, and it's like nothing you've seen before. It's like a function, but it's composed of two other functions, and it has this argument called self, which is like a big thing from object oriented programming that's like very intimidating to me as an R user, and maybe to you too, although if you are interested in OOP, you're in the right place just like 40 minutes earlier, and this method is actually also like a nested call to several other ggproto methods, and they're like all written in like base R, and it can get very overwhelming, and then recently they also added in like vectors integration and CLI integration, and it's a lot. Like, okay, I wanted to know what goes on in the internals, but not like this. This is developer territory, right? All I care about is just seeing the data, like give me the data frame so that I can anticipate what my code will do in the middle of the pipeline when it kicks in, and so that's the motivation behind ggtrace.

Introducing ggtrace

So ggtrace is a package that allows us, the users, to drop inside, like parachute down this internal pipeline at any point we want and intercept the data at that step and interact with it using the kind of data wrangling tools that we're already familiar with. So ggtrace comes with a family of workflow functions in the form of ggtrace action value, which all take three arguments, the ggplot object, the ggproto method, and when to interact with that method. So this talk will showcase just two functions from the inspect workflow, ggtrace inspect args and ggtrace inspect return, which lets us take a snapshot of a layer's data as it goes in and out of ggproto methods.

So for example, if we want to look at the state of a layer's data in the before stat stage, then we can use ggtrace inspect args to intercept the data that was passed into layer compute statistic when it is first called for our plot, which gives us the box plot layers after stat data. And then likewise, if we want to look at the layers after stat, or sorry, that was the before stat data. If we want to look at the after stat data, we need to use ggtrace inspect return instead to intercept the data when it is returned by that same method, which gives us the box plot layers after stat data.

Bar plot walkthrough

So now that we're equipped with the tools to peek inside the internals, let's use these inspect functions to walk through how a layer gets built for the bar plot, step by step. And we'll be using the penguins data set. So if you're not familiar, this is a data set where each row is a penguin with columns for a species and bill length. And so using geom bar, we can visualize the count of penguins in each species by mapping the species column to x and fill. So like simple, right? Pretty straightforward. We don't really think about how layers are built in steps when we write code like this, but once we spell out the defaults, like stat equals count and y equals after stat count, we see clear parallels to our more complex box plot annotation layer, which is why I like this example. So in the internals, the bar layers data starts off as the penguins data and ends up in its drawing ready form. And to see how this all happens, we use ggtrace to intercept the data at the four steps that we care about.

So let's look at some data frames. We start by intercepting the data at the before stat stage. So we use ggtrace inspect rs again to pull out the data argument that was passed into this ggproto method, which gives us a data frame that looks like this. So here are initial aesthetic mappings are reflected in the presence of these columns named after the actual aesthetics, like x and fill, which is like basically select and rename. And we see some other things that happen to the data at this point as well, like the x is numeric and we have these new columns like panel and group. If we pass a function to the data argument of the layer, this is also the point by which that would get applied to our data. And, but the real significance of the before stat data is that it validates the layers choice of stat. So our before stat data has a column for x and that satisfies the layers count stat, which requires either on x or y aesthetic as we see from the documentation.

If we specify both x and y like this, then the plot fails to build and it errors specifically at stat count. And so we can set error equals true in our ggtrace function to debug the layers data. And here we can see the reason why it's because both the columns x and y are present when the stat goes in to look at the data. And we, if we, uh, but as long as the stat has what it needs, then it will transform the data, spit it back out for us to see in the after stat stage. So at this point we see that the data basically underwent a group by and summarize. So we have one row for each bar and new variables like count and prop.

So the after stat stage is significant because it's an opportunity for us to declare more aesthetic mappings except we can use variables from this after stat data frame. So the fact that we have a column called count in the after stat data is what allows this default mapping of y equals after stat count. And that's essentially just calling mutate y equals count on the after stat data. And like, this isn't just a metaphor, like aesthetic mappings are literally powered by tidy evil. So the symbol count is evaluated to the vector count. So you can do things like y equals count over a sum of count, which will give you proportions and have consequences down the line for the plot. And any after stat mappings that we declare like this will get applied to the data for us to see by the time the data reaches the before geom stage. So we see now the data has that y column present when we intercepted here. And just like in the before stat, the before geom step data, the before geom's data validates the layer's choice of geom. So here we have both x and y columns present and that satisfies our layer's geom, which is geom bar and it requires both x and y aesthetics.

If we instead override this default and not map anything to y at all by setting it to null, then the stat is satisfied, but the geom down the line is not. So it's geom bar that throws the error. And so conceptually we can see how after scale mappings are sometimes necessary to make a stat and a geom work. So every layer has a stat and a geom and they need to fit together to make a layer come to life.

Okay. So the next time that we intercept the data is in the after scale stage. So by this point, the non-positional scales have stepped in to transform aesthetics like fill, which is now a column of the actual color values instead of the names of the penguin species. We also see that the bar geom has stepped in to add default values for bar related aesthetics, which are again, just columns like color and size. This stage is significant because just like the after stat, we can declare more aesthetic mappings one last time before the data sent off to be drawn. So for example, we have the size column by this point representing the thickness of the borders around the bars. It's actually kind of hard to tell the default value. It's small. So if we wanted to make it like five times thicker than it was originally going to be, then like you can kind of just do that on the fly, grab the size column when it exists in the after scale data, multiply it by five and map it or assign it back to the size aesthetic. And it's kind of like, so aesthetic mappings are kind of like scheduling a mutate call on your data for later. And this can happen on the fly without us needing to figure out the default value first.

And then, you know, for the database folks in the room, this is also what allows you to do stuff like make the outline of the bars a little bit darker than the fill of the bars. And then that can make your plots look prettier. I'll just leave this here for you if you want to look at it later. So that concludes our bar plot walkthrough. So to recap, a layer's data becomes drawing ready in steps in the internals and using the inspect function from ggtrace, we were able to see what the data looks like at a certain steps and crucially how the state of the layer's data can inform what kind of layer code that we can write.

Solving the box plot annotation problem

And so with that, we circle back to our box plot annotation layer. Again, we have this box plot of response times by condition that we made at the beginning. And we would like to add a layer labeling the value of the upper whisker in the second box plot. And so applying what we just learned, we can write this layer in steps just like how the layer itself is built up in steps. So this is kind of a hard code. So we'll get stuck at some point, but then I'll show you how we can use ggtrace to debug our way out of it.

So first thing we start is a high level description of what we want. So every layer has a stat and a geom, so maybe let's start there. We want to use a label to draw a box plot variable. In code, that means geom label with stat equals box plot or stat box plot with geom equals label. Again, every layer needs both, and it's actually your choice of syntax. They're the same thing. And next we can specify the data that we want the layer to plot. So we only care about the second box plot. We say filter for condition is B, and let that apply. Then we build up the aesthetic mappings, starting with the ones that the stat needs first, because the stat always receives the data first. So we specify the X and the Y, and then additional ones that the geom might need later, like label, which we want to be the same as the Y max value that the stat computes.

And so that gives us this big chunk of code, and we're almost there, except for this error that says geom label is missing a Y aesthetic. And like what? We do have a Y aesthetic here. So maybe we're not sure what's wrong, but we know how to debug this. The geom is complaining about a missing aesthetic, and so something must be wrong with the data that it receives. So then we inspect the before geom snapshot of the layer's data, and we do see that the Y column has been dropped. So it turns out that the box plot stat consumes the Y column and then returns a summary across multiple columns like this, so the Y column is missing now. So to resolve that, we can use the stage function, again, this is a ggplot stage function, to map Y at the start to satisfy the stat, and then remap to Y in the after stat to satisfy the geom. And then that gets us the layer that we want, because this ensures that both the before stat data and the before geom data has a Y column present. So then that's what finally gets the layer working.

Conclusion

So that concludes cracking open ggplot internals with ggtrace. Again, ggtrace is helpful. It helps us learn the internals as users by exploiting the fact that a lot of the internals is just manipulating data frames. And by reframing the internals into familiar data wrangling functional programming terms that we're all familiar with, we arrive at a conceptual understanding of how the internals work. So this lets us write more powerful ggplot code as users while still mostly abstracting away from the scary implementational details like ggproto and baseR functions.

ggtrace helps us learn the internals as users by exploiting the fact that a lot of the internals is just manipulating data frames. And by reframing the internals into familiar data wrangling functional programming terms that we're all familiar with, we arrive at a conceptual understanding of how the internals work.

But if you are interested in that, ggtrace has other workflow functions that you can use. So just really quickly, there's function from the capture workflow, like ggtrace capture fn, which essentially records the behavior of a ggproto method when it executes, and then you can have it as a standalone function. You have functions from the hijack workflow, which kind of lets you pass or make ggproto methods return arbitrary values, and so you can kind of try to hack the internals and see what you can come up with.

And yeah, that's it. ggtrace is on GitHub. Here's some links to other materials, and I'd love to hear your feedback.