Aaron R. Williams | The tidysynthesis R package | RStudio (2022)

Transcript#

This transcript was generated automatically and may contain errors.

My name is Aaron Williams. I'm a senior data scientist at the Urban Institute. I make fake data for a living. In some scientific circles this would be a huge scandal, but for me it's an opportunity to further science. Administrative data, these are data that are collected for reasons entirely other than research, like administering the unemployment insurance system or enforcing the tax code, are incredibly useful for research, but understandably have huge confidentiality concerns. I work at the Urban Institute. We're a collection of economists, sociologists, criminologists, urban planners, and data scientists committed to improving the world through social and economic policy research. We would do a lot to access administrative data, and we do do a lot to access administrative data, but we understand there's a fundamental tension between accessing these data and the confidentiality of the data.

That's why at the Urban Institute we've been working with the IRS to create fully synthetic tax data. Synthetic data are fake data that have the statistical properties of confidential data and could be used for certain types of valid analyses, but have much lower confidentiality concerns because the data are fake. You can think of this as deep fakes for tax research.

You can think of this as deep fakes for tax research.

In this talk, I'm going to try and make the case that administrative data are very useful, but are difficult to access. I'm going to share how we're working with the IRS to safely expand access to administrative data for research, and I'm also going to talk about an R package that we're developing that makes it easier to synthesize data.

Simply put, tidysynthesis, the goal is all the power of tidymodels for data synthesis concisely with a few special tools.

So what does this actually look like? I'm going to show you how we're going to synthesize a data set today. I'm not going to synthesize unemployment insurance data. To the relief of many of you, I'm not going to synthesize tax data. We could imagine synthesizing electronic health records, or maybe we're a business and we're interested in unlocking information from our customer data without sacrificing the confidentiality of any individual in our data. But I'm not going to do that either. No, instead, we're going to synthesize penguins data because penguins deserve privacy, too.

So we're going to synthesize the Palmer penguins data set. We already had a good introduction to this earlier. Right, so we have information about species and island and sex and these numeric measures, measurements of the different penguins, and we're really going to focus on these numeric variables, and our goal is to create synthetic data that match the layout of that confidential data.

So we can start asking ourselves some questions of what this should look like. If you have 200 variables, what order should we use? Well, tidysynthesis has a function that employs different heuristics that you can use to determine synthesis order. In this case, I'm going to use my extensive knowledge of penguins, which is absolutely none, to determine that bill length is a very important variable, and then we're going to synthesize the remaining variables in order from most correlated with bill length to least correlated. Once we create this visit sequence object, it's going to be reused in a lot of other functions.

What kind of algorithm should we use for our predictive models? In this case, we're going to use a regression tree, and this is where Parsnip really shines in tidysynthesis. So we just specify a regression tree implemented with the R part engine. Now we need to specify that for all of our different models, and this is where we're really saving on all that R code. In this case, we just specify that we want to use the R part mod for every single variable. There's three ways to specify this. The first one is like this, where we use the same model for all variables. Alternatively, we can specify a default and then override it for specific variables, or we can specify a different model for every single variable.

What types of feature or target engineering should we use? In this case, this is where recipes really shines in tidysynthesis. We can specify any set of steps in recipes for our synthesis. In this case, we're not going to use any feature or target engineering. We're just going to use this helper function, construct recipes, to basically construct a series of formula.

Finally, how are we actually going to come up with synthetic data from our models? Now, if you just use the predict function, oftentimes you get a conditional mean or a conditional median. This actually makes for really bad synthetic data, because the data sets that you're creating don't have adequate sample variance, right? You don't have enough variation in the resulting data. So in this case, we're going to use a function called sample R part. If you're familiar with regression trees, you navigate to your final node, and oftentimes you might take the mean of that node. Instead, we're going to sample from that final node to sort of get this prediction distribution.

So I just walked you through two of the functions. There's a lot of other functions and objects in tidysynthesis that can control the synthesis process. There's noise for adding additional noise to predicted values. There's constraints for elegantly enforcing constraints in the data. And there's replicates if we actually want to synthesize more than one data set, which can be handled in parallel.

tidymodels is lazy, right? You create a bunch of objects, but a lot of the computation doesn't actually happen until you use a tune function or a fit function. And tidysynthesis is the exact same way. So much like you have a workflow in tidymodels, we have a pre-synth object where we bring together everything, and then the actual hard work starts when you run the synthesize function. This is where you go check Twitter, get your cup of coffee. If you have 200 variables, maybe let it run overnight.

Synthesizing penguin data

So I promised that we were going to synthesize some penguins data, and I have a quick video to show this off. In the right, you'll notice we have about 300 observations. If we wrote all this code out, it might be a few hundred lines of code, but in this case, it's only about 70 lines of code. I'm going to highlight everything and click run here in a second, and you'll see that it quickly gets down to the synthesize function.

Now if you've worked with the Palmer penguins data set before, I want you to close your eyes and imagine what that data set looks like. This will take about 10 to 12 seconds to run. And now open your eyes. Unless you have it perfectly memorized, hopefully this looks like what you were seeing in your mind, right? This looks like the penguins data set, but it's entirely made up. It's entirely fake.

So we have synthetic data. We're done, right? If this is all I had to do, I'd be very happy. I'd spend lots of time by the pool with a drink and, you know, with the little umbrellas and everything, relaxing. But in reality, I spend most of my day saying, have we done a good enough job?

We can imagine looking at this in a few different ways. I mean, on the surface, it looks like the penguins data set. We can look at the univariate distributions for the four numeric variables. Here I have density plots for them, where the synthetic data are in blue and the confidential data are in yellow. And you see that we pretty closely recreate the univariate distributions for each of them. And in fact, we only have 300 observations. So any differences here could just be because of sampling error.

We can also look at some bivariate relationships. Here I have the relationship between flipper length and bill length, flipper length and bill depth, and flipper length and body mass. Again, the synthetic data are in blue and the confidential data are in yellow. And in all three cases, I think we do a pretty good job of recreating the data. And in fact, if you fit regression lines, which I've done here, they look almost identical.

But maybe we're not satisfied. Maybe we want to specify a new synthesis and start all over again. The good news is we're not starting from the beginning. Just like how in tidymodels you can reuse objects sort of over and over and over again, here we can reuse all of our objects from earlier, add some information about adding noise to our predicted values, add some constraints, and synthesize.

Recap and call to action

So just as a recap, administrative data are useful but difficult to access. And I think there's a huge potential to understand more about our world if we can make it easier to access those data.

As a proof of concept, we've created two fully synthetic data sets with the IRS. And we've built this package called tidysynthesis so that other agencies, nonprofits, and businesses can implement these methods without having to write, you know, thousands of lines of code.

Now, the data privacy community, the synthetic data community are pretty small. And especially they don't have that many statisticians or data scientists. So if this is something that gets you excited, if you're interested in joining the raft of privacy penguins, please don't hesitate to reach out to me after this talk here in person, on Twitter at AWUnderground or on GitHub. Maybe this could be you. Thank you.

Aaron R. Williams | The tidysynthesis R package | RStudio (2022)

Transcript#

The value of administrative data

Synthetic tax data with the IRS

The sequential synthesis approach

The tidysynthesis package

Synthesizing penguin data

Recap and call to action

Featured software#

rstudio

tidymodels