Shannon Pileggi - Context is King
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, my name is Shannon Pileggi. I'm a data scientist and I work in clinical trials and I work in data all day long. However, today the talk I'm going to show you, I'm not going to show you any clinical trials data because I think what I'm talking about is applicable to a wide audience and I want this talk to be accessible to all. I'm going to start off by talking to you about a situation that could happen in anyone's workplace, showcase the problems that that situation highlights, some solutions to those problems, and extensions to that solution. And I know that today we are in the tables session and I just want to get your brain oriented that we're like there's only a little bit of tables in this talk. I'm really going to be talking to you about metadata, so close the door y'all are stuck with me now.
The setting: an email, a report, and an ambiguous variable
All right, so the setting. You get an email. Hi Shannon, I see Travis is out on vacation. Can you rerun the flight delay report? Please walk us through the numbers at the next NYC Flights project meeting. Thanks. Sarah, Director of Flight Operations. Well like Travis is lead on this project but like we talked about this a little bit like I'm sure I can help him out and get this up and running. So I get to work, right? I pull down the code, I get the repo, and I'm so pleased that Travis has left like a fully reproducible report because it's a success, right? I get my table and I see my flights delay report. We've got three airports coming out of the New York City area and some sort of timing assessment.
Let's go back to the email and make sure I can fulfill this request. So it says please walk us through the numbers. Sarah doesn't just want me to hand off the report and deliver it to her. She wants me to tell the team what actually means, like what the numbers mean, all right? So can I do that? Let's look at the first cell. 56% of flights, something to do with JFK, something to do with early. Maybe this means that 56% of flights departing from JFK departed early. I talked to Travis, you know, a little bit about this project. I know there's some nuances, right? Like JFK, like an airport you can depart from, you can also arrive to. So maybe this means that 56% of flights arriving at JFK arrived early, and I don't know, and I'm pretty sure this is going to be a big deal to the director of flight operations. And the problem here is this variable name delay category. It's ambiguous. I don't know what it means.
And so I don't want to bother Travis. He's been sending me pictures of his vacation. I know he's having a great time on his like Groupon wine country tour, and so like surely I can figure this out. But in order to figure this out, I'm really gonna have to dive into the code, right? I'm gonna have to go on some sort of journey to understand the source data that it came from and the downstream variables that were used to create this report. And this comes back down to the idea of data stewardship. So as data travels across your organization from person to person, from report to report, from deliverable deliverable, how do we ensure the integrity of that data persists and we know what it actually means?
Source data context and variable labels
Like I said, we're gonna start by looking at the code. And we have a standard data wrangling script here. I'm just highlighting the lines that I want you to focus on. So what I see here are the two sources of data that I'm bringing into this analysis, the flights data and the airports data. This is actually from the NYC flights 13 package, which is used widely in the R for Data Science textbook. It's an example of a relational database. And like I said, I'm gonna have to understand the source data context, right? I can view the data in my RStudio data viewer. And to get that data context, I can look at the help file to see what the variables in that data set actually mean. However, most of us don't have data that live in an R package, right? And so we might have context about our data that lives somewhere else. For me, it lives in an Excel file, right? And so when I'm working, I have to go to that external Excel file and might say something like your data frame name, your variable, and some sort of description about what that variable actually means.
And when you're working like this, it feels like you're working in two separate tunnels and you can't see what's happening on the other side, right? You've got your development environment where you're actually programming, and then you've got your metadata about like what that data actually means, and you're not talking to each other simultaneously. So I think the source data context can and should be embedded in your data. And this is what it looks like in R. It means that when you look at your data set in R, you're going to see your variable name, and underneath, you're going to see your variable label.
Have you seen this before? Some of you might not have, right? Some of you might be surprised that this is actually a feature available to you right now in RStudio. And if you haven't seen it before, I can tell you a little bit of the history behind it. About 25 years ago, we started using R. About 13 years ago, we started using the RStudio IDE. And about nine years ago, in 2015, was when labels were actually integrated into the RStudio IDE Data Viewer. So this functionality has been around for nine years. Now, of course, we didn't come up with this idea of variable labels in a vacuum. It exists in other statistical programming softwares like SAS, SPSS, and Stata. And since we are at an open source conference, some of you might feel like you're marching to the dark side when we talk about this. However, you know, this is a wonderful feature of these programming languages, and I feel like we should be more like this.
And in fact, in 2015, when the data labels were introduced to the RStudio IDE Data Viewer, was the same year that the Haven package came out, which enables you to import SAS, SPSS, and Stata datasets into RStudio and retain their metadata in the form of variable and value labels. Now, I'm not here to talk to you about, like, what if you already have labeled data and you want to work with it in R. I'm here to talk to you about you have data and you want to label it. So how do we do that? I like to use a function in the labeled package called set variable labels, and you're just going to iterate over the variables with their descriptions in order to do this. If you want to read more about it, you can read my blog post, The Case for Variable Labels in R. You don't actually have to use the labeled package to do this. You can do this entirely in base R because a label is essentially an attribute on a column in your data frame. It's a way of storing metadata attached to an object. So you can read more about attributes in Advanced R.
And so just so you can see the difference between the two, on the left-hand side we are seeing unlabeled data, where all you see are your variable names, but on the right-hand side, and my labeled data is going to be denoted, underscore labeled, you're going to see your variable names with the very important context of what that data means embedded in your data. And you can imagine I have feelings about this, right? Like I find having the data context at my fingertips in my programming environment incredibly empowering. Allows me to get to my up to speed with my data so much faster.
Like I find having the data context at my fingertips in my programming environment incredibly empowering. Allows me to get to my up to speed with my data so much faster.
Downstream variables and the case for labeling
There's other ways you can actually view the labeled labels of your data. So for example, we already mentioned it's an attribute of columns on your data frame, so you can see the values of those attributes when you call the store command as well. So far we've only talked about our source data context, but what we do in data science is we create new variables. So we're going to have some downstream context. And like I said, I have to go on some journey to figure this out. So just highlighting some quick points here, we're looking at something to do with an origin airport, and we're looking at some variable called departure delay. So I can deduce that delay category represents departure timing by origin airport. And now that I know that, when I do my data wrangling script, I should set my variable labels so that I know the context of what my new variables actually mean.
And I think as people who work with data, we've all felt these competing forces, right? On one hand, we want these really short succinct variable names so that we can write our code really fast. And on the other hand, we need some descriptive text so that we can understand the context of what our data means, and we can have it both ways to prop up our data. And so I truly believe that assigning variable labels encourages a disciplined practice of creating explicit and succinct variable descriptions and ensures that data context lives with the data. You should be doing this. It's going to help current you, future you, colleagues. It's going to help peer review processes. Can you imagine? You get to read code and actually know what it's supposed to do. And it's going to help with the creation of reusable data assets.
And it comes back down to this idea of data stewardship. When you're pulling data together, it's just the beginning of a journey. You don't know exactly where it's going to go in your organization, who it's going to go to, or what report it's going to land in. You don't know what's around the corner for that data. And to be fair, data stewardship, the concept of it, it's a little bit broader than what I'm talking about today. Really, it's the practice of ensuring that data assets are accessible, secure, trustworthy, and usable. And today, I want you to think about these trustworthy and usable aspects.
Applications: data dictionaries, ggplot, and tables
Some of you might be with me right now. Some of you might be on board. Yes, Shannon, I hear you. Variable labels, it's the way to go. And some of you might be feeling a little bit like this. Shannon, I hear the medicine that you're trying to give me, and I'm just not ready to swallow it. So let me sugarcoat it for you. There's some applications. Because now that we have metadata embedded in our data that we can access in our programming environment, we can do very cool programming things with it at our fingertips.
One example is a data dictionary. Suppose for your entire relational database, you have some sort of schema, maybe it goes into a list. You can iterate over that list to create a dictionary of your entire database right there in your programming environment. And when you look at that dictionary, it's easy to scroll through, it's easy to search, and figure out what is in your data schema. And so, for example, I can search on the word year. There we go. And it's going to pop up everywhere that it applies in my data set. It's a huge way to get up and running faster.
What about ggplot figures? So here, I am taking data that is indeed already labeled, and I'm creating a ggplot. And you can see the default for the ggplot is just to show your variable names, right? On the x-axis, you see the name, on the legend, you see delay category. Instead, we apply the function easy labs from ggeasy. It will automatically substitute in those variable labels into your plot for you.
What about tabling? So here, I'm starting with data that's actually not labeled. And I'm using the GT summary table with a table summary function in order to create a table. And you can see when you have unlabeled data, you're going to see the variable name. However, if instead you actually had labeled data and use that same table summary function, GT summary is going to use your variable label instead of your variable name in the output. And GT summary does leverage GT, and GT has the same behavior. And if Travis had just done that from the beginning, I wouldn't have had to go on my journey, and I wouldn't have had to even think about bothering him while he was on vacation.
Scale: clinical trials and bulk label assignment
And what does this look like in practice? I told you, I work on clinical trials, right? And the NYC flights data frame or relational database that we've talked about, it's got five data frames and 53 variables. One clinical trial that I work on has 90 data frames and 1,400 variables. That's one. We have 15 active trials in our portfolio. So I am constantly diving in and out of data all day long, and I have to get up to speed quickly in order to be effective at my job.
And that was just the source data, right? Because then we take that source data, and we tinker with it, and we combine it, and we enjoin it, and we rearrange it, and we create more data. That's our downstream data. And so that same trial, our downstream data also looks like an additional 50 data frames and 700 variables for our various reporting needs. And no one wants to write out 700 lines of code to label variables, I promise you that. So how do we do it? Here's our strategy. I'm not saying it's a strategy you should be using or anything like that. I just want to let you know how we do it.
And it's pretty low-tech. So we start off by maintaining a CSV with metadata. This was a decision so that we can look at the diff when we go to GitHub and see what variables were added and deleted and then changed over time. And then we take that CSV, we do our standard data wrangling operation, and then we can use an internal function called setDerivedVariableLabels from the croquet package, which is open source if you want to check it out. It's called derived variable labels because in our industry downstream variables are called derived. And then you have it. You're going to have a data frame that has applied all of your variable labels. What if you want to do it on a bigger scale? Maybe a list of data frames? It's going to be the same strategy. You're still going to do the hard work maintaining that CSV with your precious metadata. And you're still going to just apply that custom function for bulk label assignment. You're just going to do it in iterative framework. So all of your 700 variables can be labeled at one time.
Wrapping up and broader considerations
So let's wrap up. This is the process of data science, right? And every single place in this process, we have to understand the context of our data in order to be effective data scientists. So I think we can make this easier for us if we actually embed our context in our data. And there's a whole suite of packages that help you with this process, right? We have packages for importing label data, for assigning labels, for working with labels, and for doing really cool things with those labels. And this is great. And I think there's a lot of opportunity to grow here. I'd love to see other packages come up with cool ways to leverage their metadata or their variable labels.
And so, so far, I've only talked to you about this narrow world view. What if you're programming in R and the RStudio IDE and you're working with R data? And I realize there's a bigger world out there, a bigger universe, right? Like, we use different open source programming languages. We use different development environments. And we certainly ingest data from and save data to and connect with data in different ways, right? And if you're wondering how this looks for you, for your specific use case, I'm sorry, I probably don't know. But I do have some questions for you.
Do you have sufficient metadata to facilitate reusable data assets? And if you don't, I think you should stop everything you're doing and come up with a plan. And can you access and leverage the metadata in your programming environment? If you don't know, I encourage you to figure it out. And if you know the answer is no, I encourage you to talk to the people who build your tools and make your case for your vision of an awesome future where you can really take advantage of this to be a better data scientist and work more effectively.
Do you have sufficient metadata to facilitate reusable data assets? And if you don't, I think you should stop everything you're doing and come up with a plan.
Speaking of, Positron has an open issue to display column labels in the Data Explorer. I invite you to follow along or contribute your thoughts and suggestions to the issue. I have some more resources for you. And I want to say thank you to the many individuals who have helped me develop this talk. If you want to get in touch with me, my contact information is on my website, piping hot data. These slides are on my GitHub repo. Context is king. This is my awesome data science team. We love using variable labels and Travis did consent for me to use his name. Thank you.
Q&A
Thank you so much, Shannon. I have to admit I do not use variable labels, but I will be trying them out next time I have the opportunity to. We have a few questions from our virtual audience.
Are there strategies to handle the way that label attributes are dropped by some data frame operations? Yeah. Relabel. Relabel. Yeah. Okay.
What kind of change management towards team norms was necessary to make this standard practice? I mean, we really had to think hard about our templating structure for our projects. And once we landed on a solution that worked for everyone, it was pretty easy.
Awesome. Can you speak to the pros of labeling as opposed to more explicit column naming? I don't think you can really cram all the context that you need in a single column name. And even if you were to, those contexts that you cram in there really wouldn't be output ready for figures and tables anyway. Yeah, definitely. How much do the addition of labels affect the data file sizes, if you know? I have no idea. That's a great question.
Is there a way to attach data types as well? Attach? I mean, I think, yeah, that's a tough one. I feel like you could include it in the label. Yeah. I mean, the default for Positron right now is to show a data type and not a label. And there's like conversation on the issue to like, are there ways to show both? And if you actually do want to swap it out, I've seen people write some slick code just for our studio, just to put the data type in the label attribute, and then it'll be there. Amazing. Thank you so much.
