Resources

Aaron R. Williams | The tidysynthesis R package | RStudio (2022)

video
Oct 24, 2022
14:34

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

My name is Aaron Williams. I'm a senior data scientist at the Urban Institute. I make fake data for a living. In some scientific circles this would be a huge scandal, but for me it's an opportunity to further science. Administrative data, these are data that are collected for reasons entirely other than research, like administering the unemployment insurance system or enforcing the tax code, are incredibly useful for research, but understandably have huge confidentiality concerns. I work at the Urban Institute. We're a collection of economists, sociologists, criminologists, urban planners, and data scientists committed to improving the world through social and economic policy research. We would do a lot to access administrative data, and we do do a lot to access administrative data, but we understand there's a fundamental tension between accessing these data and the confidentiality of the data.

That's why at the Urban Institute we've been working with the IRS to create fully synthetic tax data. Synthetic data are fake data that have the statistical properties of confidential data and could be used for certain types of valid analyses, but have much lower confidentiality concerns because the data are fake. You can think of this as deep fakes for tax research.

You can think of this as deep fakes for tax research.

In this talk, I'm going to try and make the case that administrative data are very useful, but are difficult to access. I'm going to share how we're working with the IRS to safely expand access to administrative data for research, and I'm also going to talk about an R package that we're developing that makes it easier to synthesize data.

The value of administrative data

Administrative data are useful, and I'll give you two examples. The first example comes from March of 2020. It was a scary time when there was a health crisis and an economic crisis unfolding at the same time. One issue, though, is that a lot of our indicators for the labor market are backward-looking, and so there's one indicator, one number that came from administrative data that I think painted an excellent picture of the labor market. Many people had never heard of it, but it ended up in a dramatic data visualization on the cover of the New York Times. This is initial weekly unemployment insurance claims. If you look at the bottom left, that starts in 2000, and it bounces about from 2000 to 2020, and then there's this black swan event that just shows how scary of a time it was in our labor market.

I'll give you another example. In 2013, Raj Chetty, Emanuel Saez, and co-authors released groundbreaking research about intergenerational mobility in the United States using data from 40 million anonymized tax records. This was really big research that deals with how Americans see themselves. We have that desire for our children to be better off than ourselves. One of the things that they found is that place matters quite a bit. For example, the probability that a child reaches the top quintile of the national income distribution, starting from a family in the bottom quintile, is 4.4% in Charlotte, North Carolina, but 12.9% in San Jose, California.

But some people say that this research never should have happened. In a traditional model with an agency like the IRS, you do correspondence research. You have to get your analysis plan approved. You may never actually see the data. You send off your code. They run it on the data. They review the results and the like. What Saez and Chetty did was radical and different. They actually became employees of the IRS. They had to go through an extensive background check, have their fingerprints taken, go through lots of training, have all their results reviewed. I've been through this process. Let me tell you, it's not pleasant at all.

The question is, how can we get this understanding, this value, this meaning from administrative data without all this inconvenience, all while protecting confidentiality? That's why we're working with synthetic data. If we can create fake data that match the statistical properties of the confidential data, ways of safely validating the results from the synthetic data against the confidential data, then we can conveniently unlock a lot of this potential.

Synthetic tax data with the IRS

We've worked with the IRS to create two kind of one-and-a-half data sets. The second one is still a bit of a work in progress. The first one is called the Supplemental SIN PUF, PUF for Public Use File. It's a collection of information that the IRS has about individuals who have not filed taxes and do not have an obligation to file taxes. It's a novel data set that's going to be very helpful for understanding the very low parts of the income distribution. The tax economists that I work with are pretty excited about it.

The second file that we're creating is called the SIN PUF, and it's representative of all taxpayers in the United States. Now, the first file that I mentioned had about 20 variables. This file has more than 200 variables, hundreds of thousands of observations, and it comes from a complex survey with 25 different strata.

The sequential synthesis approach

Let's consider a much simpler example for now where we have four variables and two observations. That's our confidential data set. The setup of what we're trying to do is create a new data set that has the exact same record layout. We're going to employ a sequential approach to create this data set. The first thing we'll do is we'll sample sex with replacement. Then we'll create a predictive model to predict age, conditional on sex. Then we'll create a predictive model to predict wages, conditional on sex and age. Finally, a predictive model to predict taxes, conditional on sex, age, and wages.

A lot of predictive models, and this is where I get to express my gratitude for tidymodels and all the people who have worked on this. We heard some great things about it this morning. tidymodels is a comprehensive framework for doing predictive modeling in R, and we use it extensively. In particular, we use Parsnip quite a bit, which is an interface to different machine learning and predictive models, and Recipes, which is a really powerful package for feature and target engineering.

Now, in the example that we're working on, we need to do three predictive models. I gave you another example where we need to do more than 200 predictive models. So we could imagine having an R script where it's model one, predict, model two, predict, model three, predict. But if you're going to go all the way to 200 variables, that's going to become a real pain really quickly.

The tidysynthesis package

So what tidysynthesis allows us to do is actually specify a whole sequence of predictive models for generating synthetic data. It also allows us to do a few things very specific to data synthesis. One is it allows us to add additional noise to predicted values for extra confidentiality protection. It allows us to specify constraints in the data, things like individual income can't exceed family income or interest income can't exceed your personal income. It also allows us to predict from our models in different ways, and I'll talk about that more in a second. So simply put, tidysynthesis, the goal is all the power of tidymodels for data synthesis concisely with a few special tools.

Simply put, tidysynthesis, the goal is all the power of tidymodels for data synthesis concisely with a few special tools.

So what does this actually look like? I'm going to show you how we're going to synthesize a data set today. I'm not going to synthesize unemployment insurance data. To the relief of many of you, I'm not going to synthesize tax data. We could imagine synthesizing electronic health records, or maybe we're a business and we're interested in unlocking information from our customer data without sacrificing the confidentiality of any individual in our data. But I'm not going to do that either. No, instead, we're going to synthesize penguins data because penguins deserve privacy, too.

So we're going to synthesize the Palmer penguins data set. We already had a good introduction to this earlier. Right, so we have information about species and island and sex and these numeric measures, measurements of the different penguins, and we're really going to focus on these numeric variables, and our goal is to create synthetic data that match the layout of that confidential data.

So we can start asking ourselves some questions of what this should look like. If you have 200 variables, what order should we use? Well, tidysynthesis has a function that employs different heuristics that you can use to determine synthesis order. In this case, I'm going to use my extensive knowledge of penguins, which is absolutely none, to determine that bill length is a very important variable, and then we're going to synthesize the remaining variables in order from most correlated with bill length to least correlated. Once we create this visit sequence object, it's going to be reused in a lot of other functions.

What kind of algorithm should we use for our predictive models? In this case, we're going to use a regression tree, and this is where Parsnip really shines in tidysynthesis. So we just specify a regression tree implemented with the R part engine. Now we need to specify that for all of our different models, and this is where we're really saving on all that R code. In this case, we just specify that we want to use the R part mod for every single variable. There's three ways to specify this. The first one is like this, where we use the same model for all variables. Alternatively, we can specify a default and then override it for specific variables, or we can specify a different model for every single variable.

What types of feature or target engineering should we use? In this case, this is where recipes really shines in tidysynthesis. We can specify any set of steps in recipes for our synthesis. In this case, we're not going to use any feature or target engineering. We're just going to use this helper function, construct recipes, to basically construct a series of formula.

Finally, how are we actually going to come up with synthetic data from our models? Now, if you just use the predict function, oftentimes you get a conditional mean or a conditional median. This actually makes for really bad synthetic data, because the data sets that you're creating don't have adequate sample variance, right? You don't have enough variation in the resulting data. So in this case, we're going to use a function called sample R part. If you're familiar with regression trees, you navigate to your final node, and oftentimes you might take the mean of that node. Instead, we're going to sample from that final node to sort of get this prediction distribution.

So I just walked you through two of the functions. There's a lot of other functions and objects in tidysynthesis that can control the synthesis process. There's noise for adding additional noise to predicted values. There's constraints for elegantly enforcing constraints in the data. And there's replicates if we actually want to synthesize more than one data set, which can be handled in parallel.

tidymodels is lazy, right? You create a bunch of objects, but a lot of the computation doesn't actually happen until you use a tune function or a fit function. And tidysynthesis is the exact same way. So much like you have a workflow in tidymodels, we have a pre-synth object where we bring together everything, and then the actual hard work starts when you run the synthesize function. This is where you go check Twitter, get your cup of coffee. If you have 200 variables, maybe let it run overnight.

Synthesizing penguin data

So I promised that we were going to synthesize some penguins data, and I have a quick video to show this off. In the right, you'll notice we have about 300 observations. If we wrote all this code out, it might be a few hundred lines of code, but in this case, it's only about 70 lines of code. I'm going to highlight everything and click run here in a second, and you'll see that it quickly gets down to the synthesize function.

Now if you've worked with the Palmer penguins data set before, I want you to close your eyes and imagine what that data set looks like. This will take about 10 to 12 seconds to run. And now open your eyes. Unless you have it perfectly memorized, hopefully this looks like what you were seeing in your mind, right? This looks like the penguins data set, but it's entirely made up. It's entirely fake.

So we have synthetic data. We're done, right? If this is all I had to do, I'd be very happy. I'd spend lots of time by the pool with a drink and, you know, with the little umbrellas and everything, relaxing. But in reality, I spend most of my day saying, have we done a good enough job?

We can imagine looking at this in a few different ways. I mean, on the surface, it looks like the penguins data set. We can look at the univariate distributions for the four numeric variables. Here I have density plots for them, where the synthetic data are in blue and the confidential data are in yellow. And you see that we pretty closely recreate the univariate distributions for each of them. And in fact, we only have 300 observations. So any differences here could just be because of sampling error.

We can also look at some bivariate relationships. Here I have the relationship between flipper length and bill length, flipper length and bill depth, and flipper length and body mass. Again, the synthetic data are in blue and the confidential data are in yellow. And in all three cases, I think we do a pretty good job of recreating the data. And in fact, if you fit regression lines, which I've done here, they look almost identical.

But maybe we're not satisfied. Maybe we want to specify a new synthesis and start all over again. The good news is we're not starting from the beginning. Just like how in tidymodels you can reuse objects sort of over and over and over again, here we can reuse all of our objects from earlier, add some information about adding noise to our predicted values, add some constraints, and synthesize.

Recap and call to action

So just as a recap, administrative data are useful but difficult to access. And I think there's a huge potential to understand more about our world if we can make it easier to access those data.

As a proof of concept, we've created two fully synthetic data sets with the IRS. And we've built this package called tidysynthesis so that other agencies, nonprofits, and businesses can implement these methods without having to write, you know, thousands of lines of code.

Now, the data privacy community, the synthetic data community are pretty small. And especially they don't have that many statisticians or data scientists. So if this is something that gets you excited, if you're interested in joining the raft of privacy penguins, please don't hesitate to reach out to me after this talk here in person, on Twitter at AWUnderground or on GitHub. Maybe this could be you. Thank you.