June Choe | Cracking open ggplot internals with {ggtrace}

Transcript#

This transcript was generated automatically and may contain errors.

Hello everyone, thanks for coming to my talk. My name is Jun. I'm a PhD student in linguistics at the University of Pennsylvania, and today I'll be talking about ggplot internals and my package ggtrace.

So I suspect that most of us here are already pretty familiar with ggplot, but just to get a sense of the room, how many of us has made, like, a box plot in a ggplot before? Alright, that's like literally everyone. Cool. And I make them all the time too. It's like one of the first plots I make when I get results back from an experiment. So let's say like a behavioral data of, like, button presses. So each row is a response from a participant, and there's columns for experiment condition and response time. So I can write a short code where I pass my data to ggplot, map condition to x and fill, and response times to y, and add a geom box plot layer. Pretty straightforward. That gives me a box plot.

So that was very simple to do in ggplot. And oftentimes I'll have participants who are behaving, let's say, more interestingly than others, and I care about identifying them for a lot of reasons. So I might also want to annotate the cutoff value for extreme outliers, like this. And of course I should calculate it myself if I, like, really care and if it's really important, but I'm just doing a quick exploratory analysis, and I already know how to tell ggplot to calculate this for me. So labeling that same value should be a trivial task, right? Except it turns out this is kind of tricky, because what we're trying to do is do something like a box plot layer, except divert its normal course of action under the hood.

So our new layer should calculate the same variable as our first layer, but then instead of using boxes and whiskers to draw all of them, it should label just one of them. And when it comes to these situations where the user wants to intervene in the process power layer comes to life, there's good news and bad news. The good news is, thanks to recent developers, we can write a layer that does all this, all inside, and all on the fly, all inside a geom label layer. So that's great, but the bad news is this is incomprehensible. Like, this is not your typical ggplot code. So we have strange functions called inside the AES, like stage and after stat, and we're using the stat and data arguments of the geom label layer, which we don't often do. So this is still vanilla ggplot, but this kind of code is very difficult to reason about, because it requires a mental model of how ggplot internals work.

And here's what I mean by that. So the code for our first layer is easy, because we can get away with thinking about a layer as just doing one big thing. So like, hey geom boxplot, here's all the stuff you need, now go do your thing. But we need to revise this simple model of how layers are built to understand code like the second layer, which reads more like, hey layer, you know that data that I gave you to draw? Apply this function to it first, and then proceed as normal with condition map to X and response times map to Y. Then after the boxplot stat steps in and transforms the data, then use the computed variable Y max to remap to the Y aesthetic and do the same for the label aesthetic. And then lastly, use the label geom to draw what you have.

That was a mouthful, and you get the point. What appeared to be a simple task of annotating a boxplot variable turns out to require a deep understanding of what goes on where in the internals. And it's not just because there's more code that is difficult. It's conceptually difficult, because it makes references to different steps in the internals that we as users don't get to see. And I don't know about you, but I would love to be able to write code like this. So if a knowledge, if it requires the knowledge of the internals, if that's what it takes, then maybe ggplot internals aren't just for developers anymore.

then maybe ggplot internals aren't just for developers anymore.

And so that's what I want to talk about today. I would like to propose a model of ggplot internals for users. One that's immediately practical to users just wanting to get better at ggplot, but also one that offers an accessible starting point for aspiring developers who want to extend ggplot. So here's my proposal in a nutshell. Why don't we just pretend that ggplot internals is a one big data wrangling function? Because it kind of is. And to do that, I have developed my package ggtrace, which I'll be showcasing today.

ggtrace helps us learn the internals as users by exploiting the fact that a lot of the internals is just manipulating data frames. And by reframing the internals into familiar data wrangling functional programming terms that we're all familiar with, we arrive at a conceptual understanding of how the internals work.

But if you are interested in that, ggtrace has other workflow functions that you can use. So just really quickly, there's function from the capture workflow, like ggtrace capture fn, which essentially records the behavior of a ggproto method when it executes, and then you can have it as a standalone function. You have functions from the hijack workflow, which kind of lets you pass or make ggproto methods return arbitrary values, and so you can kind of try to hack the internals and see what you can come up with.

And yeah, that's it. ggtrace is on GitHub. Here's some links to other materials, and I'd love to hear your feedback.

June Choe | Cracking open ggplot internals with {ggtrace} | RStudio (2022)

Transcript#

Reframing ggplot internals as data wrangling

Introducing ggtrace

Bar plot walkthrough

Solving the box plot annotation problem

Conclusion

Featured software#

ggplot2

rstudio