Shannon Pileggi - Context is King

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is Shannon Pileggi. I'm a data scientist and I work in clinical trials and I work in data all day long. However, today the talk I'm going to show you, I'm not going to show you any clinical trials data because I think what I'm talking about is applicable to a wide audience and I want this talk to be accessible to all. I'm going to start off by talking to you about a situation that could happen in anyone's workplace, showcase the problems that that situation highlights, some solutions to those problems, and extensions to that solution. And I know that today we are in the tables session and I just want to get your brain oriented that we're like there's only a little bit of tables in this talk. I'm really going to be talking to you about metadata, so close the door y'all are stuck with me now.

The setting: an email, a report, and an ambiguous variable

All right, so the setting. You get an email. Hi Shannon, I see Travis is out on vacation. Can you rerun the flight delay report? Please walk us through the numbers at the next NYC Flights project meeting. Thanks. Sarah, Director of Flight Operations. Well like Travis is lead on this project but like we talked about this a little bit like I'm sure I can help him out and get this up and running. So I get to work, right? I pull down the code, I get the repo, and I'm so pleased that Travis has left like a fully reproducible report because it's a success, right? I get my table and I see my flights delay report. We've got three airports coming out of the New York City area and some sort of timing assessment.

Let's go back to the email and make sure I can fulfill this request. So it says please walk us through the numbers. Sarah doesn't just want me to hand off the report and deliver it to her. She wants me to tell the team what actually means, like what the numbers mean, all right? So can I do that? Let's look at the first cell. 56% of flights, something to do with JFK, something to do with early. Maybe this means that 56% of flights departing from JFK departed early. I talked to Travis, you know, a little bit about this project. I know there's some nuances, right? Like JFK, like an airport you can depart from, you can also arrive to. So maybe this means that 56% of flights arriving at JFK arrived early, and I don't know, and I'm pretty sure this is going to be a big deal to the director of flight operations. And the problem here is this variable name delay category. It's ambiguous. I don't know what it means.

And so I don't want to bother Travis. He's been sending me pictures of his vacation. I know he's having a great time on his like Groupon wine country tour, and so like surely I can figure this out. But in order to figure this out, I'm really gonna have to dive into the code, right? I'm gonna have to go on some sort of journey to understand the source data that it came from and the downstream variables that were used to create this report. And this comes back down to the idea of data stewardship. So as data travels across your organization from person to person, from report to report, from deliverable deliverable, how do we ensure the integrity of that data persists and we know what it actually means?

Source data context and variable labels

Like I said, we're gonna start by looking at the code. And we have a standard data wrangling script here. I'm just highlighting the lines that I want you to focus on. So what I see here are the two sources of data that I'm bringing into this analysis, the flights data and the airports data. This is actually from the NYC flights 13 package, which is used widely in the R for Data Science textbook. It's an example of a relational database. And like I said, I'm gonna have to understand the source data context, right? I can view the data in my RStudio data viewer. And to get that data context, I can look at the help file to see what the variables in that data set actually mean. However, most of us don't have data that live in an R package, right? And so we might have context about our data that lives somewhere else. For me, it lives in an Excel file, right? And so when I'm working, I have to go to that external Excel file and might say something like your data frame name, your variable, and some sort of description about what that variable actually means.

And when you're working like this, it feels like you're working in two separate tunnels and you can't see what's happening on the other side, right? You've got your development environment where you're actually programming, and then you've got your metadata about like what that data actually means, and you're not talking to each other simultaneously. So I think the source data context can and should be embedded in your data. And this is what it looks like in R. It means that when you look at your data set in R, you're going to see your variable name, and underneath, you're going to see your variable label.

Have you seen this before? Some of you might not have, right? Some of you might be surprised that this is actually a feature available to you right now in RStudio. And if you haven't seen it before, I can tell you a little bit of the history behind it. About 25 years ago, we started using R. About 13 years ago, we started using the RStudio IDE. And about nine years ago, in 2015, was when labels were actually integrated into the RStudio IDE Data Viewer. So this functionality has been around for nine years. Now, of course, we didn't come up with this idea of variable labels in a vacuum. It exists in other statistical programming softwares like SAS, SPSS, and Stata. And since we are at an open source conference, some of you might feel like you're marching to the dark side when we talk about this. However, you know, this is a wonderful feature of these programming languages, and I feel like we should be more like this.

And in fact, in 2015, when the data labels were introduced to the RStudio IDE Data Viewer, was the same year that the Haven package came out, which enables you to import SAS, SPSS, and Stata datasets into RStudio and retain their metadata in the form of variable and value labels. Now, I'm not here to talk to you about, like, what if you already have labeled data and you want to work with it in R. I'm here to talk to you about you have data and you want to label it. So how do we do that? I like to use a function in the labeled package called set variable labels, and you're just going to iterate over the variables with their descriptions in order to do this. If you want to read more about it, you can read my blog post, The Case for Variable Labels in R. You don't actually have to use the labeled package to do this. You can do this entirely in base R because a label is essentially an attribute on a column in your data frame. It's a way of storing metadata attached to an object. So you can read more about attributes in Advanced R.

And so just so you can see the difference between the two, on the left-hand side we are seeing unlabeled data, where all you see are your variable names, but on the right-hand side, and my labeled data is going to be denoted, underscore labeled, you're going to see your variable names with the very important context of what that data means embedded in your data. And you can imagine I have feelings about this, right? Like I find having the data context at my fingertips in my programming environment incredibly empowering. Allows me to get to my up to speed with my data so much faster.

Like I find having the data context at my fingertips in my programming environment incredibly empowering. Allows me to get to my up to speed with my data so much faster.

Do you have sufficient metadata to facilitate reusable data assets? And if you don't, I think you should stop everything you're doing and come up with a plan.

Speaking of, Positron has an open issue to display column labels in the Data Explorer. I invite you to follow along or contribute your thoughts and suggestions to the issue. I have some more resources for you. And I want to say thank you to the many individuals who have helped me develop this talk. If you want to get in touch with me, my contact information is on my website, piping hot data. These slides are on my GitHub repo. Context is king. This is my awesome data science team. We love using variable labels and Travis did consent for me to use his name. Thank you.

Q&A

Thank you so much, Shannon. I have to admit I do not use variable labels, but I will be trying them out next time I have the opportunity to. We have a few questions from our virtual audience.

Are there strategies to handle the way that label attributes are dropped by some data frame operations? Yeah. Relabel. Relabel. Yeah. Okay.

What kind of change management towards team norms was necessary to make this standard practice? I mean, we really had to think hard about our templating structure for our projects. And once we landed on a solution that worked for everyone, it was pretty easy.

Awesome. Can you speak to the pros of labeling as opposed to more explicit column naming? I don't think you can really cram all the context that you need in a single column name. And even if you were to, those contexts that you cram in there really wouldn't be output ready for figures and tables anyway. Yeah, definitely. How much do the addition of labels affect the data file sizes, if you know? I have no idea. That's a great question.

Is there a way to attach data types as well? Attach? I mean, I think, yeah, that's a tough one. I feel like you could include it in the label. Yeah. I mean, the default for Positron right now is to show a data type and not a label. And there's like conversation on the issue to like, are there ways to show both? And if you actually do want to swap it out, I've seen people write some slick code just for our studio, just to put the data type in the label attribute, and then it'll be there. Amazing. Thank you so much.

Shannon Pileggi - Context is King

Transcript#

The setting: an email, a report, and an ambiguous variable

Source data context and variable labels

Downstream variables and the case for labeling

Applications: data dictionaries, ggplot, and tables

Scale: clinical trials and bulk label assignment

Wrapping up and broader considerations

Q&A