Allen Downey - A future of data science
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Good morning. Thank you. It's really, it's great to be here. I've really enjoyed the conference and I'm happy to talk to you today. I want to say a little bit about how I got here, how we as data scientists got here, and then where do we go from here? What is a future of data science?
So this is an opinionated talk, and I want to give you a chance to talk back to me, not like during the talk, but we'll do Q&A and I'll be around afterward, so tell me what you think.
The first baby problem
I'm going to start with my most successful data science project. In 2003, when my wife and I were expecting our first baby, I googled this question. Are first babies more likely to be late or early or both? And what I got at the time was anecdotal evidence. My cousin had two babies and they were both early and therefore I don't know.
What we need here is data. And the National Survey of Family Growth has data. This is a survey conducted by the CDC. It started in 1973. They've done 11 repetitions. They have more than 100,000 respondents now. And for female respondents, they have details on their pregnancies and pregnancy length, including duration in weeks.
And so I thought if we're asking this question around week 35, which is when this question becomes a little bit more pointed, we can ask what is the remaining lifetime or remaining duration of a pregnancy at that time. And you can see here the blue curve is for first babies and the orange curve is all other babies. And this is the distribution of total duration in weeks. And you can see that, in fact, first babies are a little bit less likely to be right at the nominal 59 weeks and a little bit more likely to be a couple of weeks late.
So now, if you Google this question, the first two hits on the page are me. And if you go farther down the page, you'll get this BBC article on the topic, which cites me. So I am now the world-renowned expert on this topic.
So why am I saying that this is my most successful project? Because I took a question from not answered and I put it over into the answered column of questions. And this was not hard to do. I needed data, and it was freely available, open data. Thank you to the NSFG. Simple methods, basic visualization, no fancy statistics, all free, open software, and a venue to make it visible, which was my blog. And that's all we need.
This is an example of what I hope data science is and can be, which is a set of tools and processes for answering questions, resolving debates, and making better decisions.
This is an example of what I hope data science is and can be, which is a set of tools and processes for answering questions, resolving debates, and making better decisions.
The Gartner hype cycle and data science
So that's how I got here. How did we get here, and where are we going? And to think about this, I think the Gartner Hype Cycle will help. If you are not familiar with it, this is the idea that new things, new technologies, go through a sequence of phases with these catching names. There's a technological trigger that gets things started. There is a peak of inflated expectations, a trough of disappointment, the slope of enlightenment, and then a plateau of productivity.
So what is the technology trigger that created data science? I'm going to suggest it's ENIAC. This is the first programmable, electronic, general-purpose digital computer in 1945. So how did this create data science? Here's my thesis, and here's the first opinionated part of this talk. Data science exists because statistics as a field missed the boat on computers.
Let me explain what I mean by that, especially if we look at statistics education. Most people, if you take the canonical introductory statistics class, you learn the central limit theorem, and you learn a few special cases where sampling distributions have a nice mathematical gap. And then people graduate from that class, and they encounter data for the first time, and they ask for help. And they go to the Reddit statistics forum, and they ask, which test should I use? This is the question over and over on Reddit statistics.
Which test should I use? Because their education has given them the idea that all of the problems have been solved. You just have to know which test to use. And if you don't, that's your fault. That's not true. I think the data science approach, the computational approach, says approximately none of the problems have been solved. What we need is a versatile set of tools to compose the solutions that we need. And what is the most versatile tool that we have? A programmable, electronic, general-purpose digital computer.
Teaching statistical inference computationally
So, I want to demonstrate this point, the difference between mathematical statistics and computational statistics, by teaching you all of statistical inference in 10 minutes. And we're going to do this by testing the variability hypothesis. If you're not familiar with this, it is the idea that in many species, males are more variable than females on many dimensions. It is a controversial idea. It has a long and interesting history.
So, if you're interested, do read about that. As my example, I'm going to try to use something minimally problematic, and we'll just talk about the difference in height between men and women. We need data. Again, the CDC is here to help us. The Behavioral Risk Factor Surveillance System, the BRFSS, is another repeated cross-sectional sample of adults in the U.S. They get more than 400,000 respondents during each cycle, and it includes self-reported heights and weights.
So, as a warm-up, let's just look at the difference in height between men and women. I'm going to do this by resampling. This is one of the core tools of computational statistics. It's the idea that you take the sample that you actually collected, use it to build a model of the population that you're interested in, and then use that model to generate lots and lots of synthesized samples. That's the idea of resampling. More specifically, with bootstrap resampling, we're going to do that by drawing from the original sample a new sample that is the same size, drawn with replacement.
And so now let's see what that looks like in code. It's a lot shorter to say in code than it was for me to say in words. There's what it looks like. It's a function that takes a pandas data frame, and it uses the sample method to generate a new data frame that is the same length, n, and it samples with replacement. So that's it. That's the bootstrap resampling process.
Now we need a test statistic, and we're going to start with just height, average height. So here's a function now that computes my test stat, which is the difference in height between the two groups. It uses the group by method to divide the entire sample by sex and then extract the height column, and then the second line there computes the mean in each group, and then diff computes the difference between those two means. So it's the difference in means between the heights of two groups.
The last thing I'm going to do is repeat that a thousand times. So this list comprehension runs resample a thousand times, so it generates new data frames, and then for each one it computes that test statistic.
Here's what the results look like. This is the distribution of those results. This is the sampling distribution of the difference in height between men and women, and it looks like men are taller than women on average. So you've all learned something this morning.
We can now use that sampling distribution to answer a couple of questions. So one is, how precise is that estimate? In other words, if we had run this experiment over and over, how much would we expect that estimate to vary from one sample to another? And we can do that by computing a confidence interval that contains 95% of those iterations that we just computed. And from that we can say men are taller by about 14.43 centimeters. I can put a confidence interval on that, and because my sample size is super big, the confidence interval is super small.
Second question is, if we had collected samples like this over and over, is it possible that it could have gone the other way? That in this sample, maybe women would have been taller than men, and not surprisingly, we'll find that that is unlikely. We can compute a p-value. Here's the function that does it. And this is based on an assumption that the tail of that sampling distribution behaves like a normal distribution. And in that case, I can use the CDF of the normal distribution right there to compute the probability that that difference we saw could have been on the other side of zero.
In this case, the p-value is super small, so we conclude that it is unlikely that we could see a difference that big by chance.
Now, if you know some mathematical statistics, and in this group I suspect that you do, you've probably looked at this example and said, wait a minute, you just did a difference in mean. That's a t-test. We could have done that in one line of code. I could have even looked it up in a table. Why are we doing all this?
Well, we didn't really care about the difference in means. We really care about the difference in standard deviation or some other measure of variability. So, okay, how are we going to do that? With mathematical statistics, we now have to go back to the drawing board and start over. Okay, it's not a t-test. What's the test for comparing the difference in standard deviations? I don't know. We've got to go back to Reddit and find out.
Whereas with mathematical statistics, I can take that example that I just showed you, seven lines of code, and I'm going to make it do the difference in standard deviation. Like that. It was kind of subtle. Let me make that a little clearer. Here's the difference in means. Here's the difference in standard deviations. This is the nice thing about computational statistics. I can do any test statistic just as easily.
Here's what that distribution looks like. This is the sampling distribution of the difference in standard deviations. And once again, it's precise because my sample size is so big, and the p-value is quite small. So this is saying that at least as measured by standard deviation, men are more variable than women. But maybe standard deviation isn't the right thing to measure because almost any growth process, if it makes bigger things, it probably has more variability. So we might be more interested in standard deviation relative to the mean.
So maybe we should use the coefficient of variation instead, because that's the ratio of standard deviation to mean. So we could go back to the Reddit statistics forum and ask how to compare coefficients of variation. Or again, we can go back to this example and change it so that it looks like that.
So, not too bad. It got a little bit more complicated, only because coefficient of variation isn't a built-in function. So I had to implement it. But we're still at a grand total of nine lines of code. And now we can see that the confidence interval is small and the p-value is small. So this result is statistically significant. But the difference is really small and probably has no consequences in real life. And it might actually just be the result of some data errors. There are suspiciously short and tall people in this dataset.
So I guess our answer is this dataset doesn't provide much support for the variability hypothesis. But that wasn't really the point of this whole thing. The point of this is that mathematical statistics only gets you so far.
Really, there is only one way to do hypothesis testing. And that is this framework. This is the there-is-only-one-test framework. It always starts with a dataset. And you have to choose a test statistic that quantifies the effect, the size of the effect that you are interested in. You use the data to create a model of the population. We did that by bootstrap resampling, but there are other ways to do it. You use that model now to generate lots of simulated datasets. For each dataset, you compute the same test statistic, collect all of the results. That's your sampling distribution. And from that, you can compute a confidence interval or a p-value.
That is it. That is all of statistical inference in 10 minutes. And nine lines of code.
So here's the analogy I like. If you have not watched this channel where they build Lego mechanisms, mathematical statistics is like the level one car that hits the first barrier, and it gets stuck, and it can't get over it. And computational statistics is like the last car that they build because it's got six wheels and a propeller, and it has stilts, and it can climb over anything.
So there's my claim. The technological trigger for data science was computation because statistics as a field missed the boat on computation. And with apologies to people in the room if you identify as a statistician, I'm really talking about the field. The field missed the boat, and it missed a lot of other boats.
The hype cycle: peak, trough, and beyond
So I think that's why data science came to be. And now the question is, have we hit the peak of expectations? Are we in the trough of disillusionment? Where are we here?
So for the peak of expectations, I want to nominate 2009 to 2012. In 2009, we got the Netflix Prize. And then in 2010, the first iteration of Kaggle, which turned machine learning into a spectator sport. In 2011, we got Moneyball, which turned spectator sports into machine learning. Coursera turned machine learning and made it available to everybody. And then in 2013, everybody who graduated got the sexiest job of the 21st century, which I'm going to nominate as the peak of inflated expectations. And when we look at it now, it's kind of like the peak of cringe.
So if that was the peak, what about the trough of disillusionment? I'm going to nominate 2016 to 2018 as the trough. 2016, Kathy O'Neill introduced us all to the dark side of big data. ProPublica got us thinking about algorithmic fairness and criminal justice. And famously, data scientists failed to predict the outcome of the 2016 election. Now that's not exactly correct, but that is how people remember it.
2017, from former executives at Facebook, we got some of the first evidence of the harms of social media. We got alarms about facial recognition, fairness, and race. We got warnings about algorithms and fairness and gender. We learned that Cambridge Analytica had been misusing Facebook data. We learned that Google had been misusing medical data. And just to cap it all off, we learned that machine learning is ruining spectator sports.
So that was not great. And I think that's how we ended up where we are right now in the trough of disillusionment.
Reasons for optimism: data journalism and data literacy
So where do we go from here? How do we climb the slope of enlightenment? And what does the plateau of productivity look like? So I want to suggest two things that make me optimistic about where we go from here. One that makes me nervous, and then I have a plea for what you can all do to help us get up that slope.
So here's one thing that makes me feel good. In 1985, we had some of the first data journalism. The USA Today started publishing infographics, including this hard-hitting piece about which parts of laundry people really dislike doing. In 2015, The Upshot published this, which is a three-dimensional interactive representation of the yield curve, one of the most notoriously difficult ideas in economics. Now, I'm not sure I still understand the yield curve, but I really appreciate their optimism about their audience.
I think data journalism is sneakily improving the level of data literacy in general. So I really think that's good. They are also complementing the ability of governments to generate data and share that data. Just as two small examples, The Washington Post now has a better database of gun deaths than the FBI does. And the New York Times has a better dataset of traffic safety than the Department of Transportation does. So these are good things. I think we're improving data literacy, and the availability of data is continuing to grow and grow.
The happiness data and negativity bias
Here's the thing I'm worried about, and this comes from the General Social Survey. If you are not familiar with it, it is a really great source of data. They have been doing a repeated cross-section of adults in the U.S. since the 1970s. They have a total now of more than 70,000 respondents. And one of the questions that they have asked during every cycle since 1972 is this one. Taken all together, how would you say things are these days? Would you say that you are very happy, pretty happy, or not too happy?
So I grabbed that data, and I have plotted it over time. This is the fraction of people who said very happy. It was in the mid-30s when the survey started. It was declining slowly, and then around 2010 started to decline a bit more quickly. Now, if you look at that over time, it does not look great, but it's a relatively slow decline.
Let me show you now what that looks like by year of birth. This is the fraction of people who say that they are very happy grouped by what year they were born in. So for people born in the 1880s, because this data set goes back a way. They were very old when they were interviewed, but they were pretty happy. It was declining for a while, and it has been declining very steeply now for people born in the 80s, 90s, and 2000s.
This is not just because we are interviewing them when they are young. And to see that, let me show you this, and this takes some unpacking. This is now grouped by decade of birth. So each line represents one decade of birth from the 1890s up to the bottom right-hand corner. Those are the people who were born in the 2000s, and following them over time, over the years of the survey. So there's a lot going on. Things kind of go up and down. For many groups, things have been declining recently, but the noteworthy thing in the lower right-hand corner there is that people born in the 90s and 2000s are more unhappy now than any previous generation at any age.
So why? Why is that happening? It's complicated. It's always complicated. There are going to be many factors. I want to talk about just one of them, which seems likely that at least part of this is excessive consumption of relentlessly negative media. Now, negativity bias is not new. There has always been negativity bias in the media and fundamentally in our heads. In our psychology, we have a built-in negative bias, but the pattern of consumption is a new thing. We have a new word for it, which is doomscrolling. Spending too much time reading large quantities of media, especially negative media.
So from a data science point of view, I think this is a data bias problem. It is a bias in our media diet, which suggests that data could be an antidote or at least a partial solution to the problem. Now, the world champion of that idea, that data can be an antidote to negativity, was Hans Rosling. This is his famous video showing a bubble graph of world development in the last hundred years as a function of income and life expectancy.
If you have not, I have four minutes and 47 seconds that you are really going to enjoy. If you have not read Hans Rosling's book, I recommend this very strongly. This is the antidote to a lot of incorrectly negative beliefs that many people have about the world. That is from 2018, so it is pretty current. But if you want the most up-to-date data, Our World in Data is the current champion of this idea. They do research and data to make progress against the world's largest problems. And if you explore their site, you will see lots of graphs where good things go up and to the right, like life expectancy, and bad things like poverty go down and to the right.
And what you'll find is that on long-term trends, almost everything is getting better. Now, people don't know this, and I want to do an experiment. This is from Gapminder. This is the same group that Hans Rosling started. They have a lot of tests that you can take to see if your perception of the world is accurate. And I'm going to give you one of them. So, how did the number of deaths per year from natural disasters change over the last 100 years?
If you said that it decreased to less than half, you are correct. And here's the data from Our World in Data. This is the raw number of deaths from natural disasters going back to 1900. And depending on when you start and how you compute the difference, it has declined by a factor of 5 or 10 in that time. At the same time, world population has gone up by about a factor of 5. So, as a rate, this has gone down substantially. This is good news.
People do not know this good news. When people take these quizzes, they do less well than chance. 84% of people got this wrong. There were only three choices. 33% should have got it right. 66% should have got it wrong. 84% is too many. So, people don't know this.
I gave one of these quizzes to my data science class. I had the students take the quiz and then write some reflections on it. This is the Google survey. I don't expect you to read all of this, but I want to draw your attention to one word that appeared in almost every response. I was too pessimistic. And one person just wrote, I'm a pessimist. And I want to tell you what I told them. You are not a pessimist. You have been misled by a relentlessly negative media diet. On long-term trends, almost everything is getting better.
You are not a pessimist. You have been misled by a relentlessly negative media diet. On long-term trends, almost everything is getting better.
I have found that when I say that, people get angry. And my conclusion is, we need to say three things at the same time. On long-term trends, almost everything is getting better. And we still have serious problems to solve. And our history of solving problems suggests that we can solve these new ones too.
And when I say that, yes, I'm including climate change. So, where I think we are on climate change, we have not responded as quickly as we should have. We're still not doing everything we should be doing. But many environmental trends are already going in the right direction. And there are paths between now and the end of the century toward a stable, healthy environment and a good quality of life for everybody on the planet. So, we are not doomed. But a lot of people think we are.
If you think we're doomed, or someone in your life thinks we're doomed, please give them this book. I don't have a chance here to make a complete case about where we are here. But Hannah Ritchie, who is not coincidentally a researcher at Our World in Data, this book, I think, makes the case really well. Get a copy, read it out loud to somebody under 30.
A plea for data science on the slope of enlightenment
So, negativity bias, I think, is a serious threat to our well-being because it undermines our ability to address the important problems that we need to address. So, finally, here is my plea. Here is what I would like you all to do to help us get out of this trough of disillusionment and onto the plateau of productivity. First, stop telling kids they'll die from climate change. Second, stop reading the news. This is the other book I want to recommend. The title says it all. But if you have trouble mustering the strength to stop reading the news, this book might help.
Last thing, use data to understand the world better so that we know how to make the world better. There's a lot of data out there. And I'll tell you a secret. A lot of the organizations that generate open data sets, and especially government agencies, have enough resources to make the data set and publish it, and often not a lot of resources to do much with it. So, when you're the first person to go into one of these data sets and really look around and explore, you will inevitably find interesting things, like first babies are more likely to be late, or things that are more important than that.
So, take advantage of the data that's out there. And this group in particular has all the tools and processes that you need to answer questions, resolve debates, and make better decisions. It's the tools of open science. And let me turn that sideways so we can read it. Open data, open source software, open methodology, open peer review, open access, and especially from my point of view, open educational resources.
So, I want to end on that, because it's one of the things that I work on. I'll give you a few links and credit for some of the resources that I've used. If you are interested in that first baby example, that is from ThinkStats. And the third edition is what I'm working on right now. It's available free at that link. I'll also give you another chance to get these slides so that you can get those links. The data there, as I said, is from the National Survey of Family Growth run by the CDC.
The resampling example, if you're interested in that, that is from Elements of Data Science, also available free. And all of that data was from the BRFSS, also from the CDC. Finally, that happiness example is based on Chapter 10 of Probably Overthinking It. And that data is from the General Social Survey.
Last thing, all of the notebooks are available. Those links will take you to the notebook running on Colab, so if you want to replicate anything I did or run your own experiments, you can do that. And as of Monday, I learned enough R to translate my Python examples into R. So if you want to see the variability example in R, it's there. This is, I will admit, the first R program that I wrote beyond Hello World, and I am sharing it with all of you to do a code review. So I would like to hear what you think about my first attempt.
So, I feel like I've had my chance to talk. Here are five ways that you can get in touch with me if you want to talk back, but we also have a chance to take some questions, and again, you can grab the slides using either that link or the QR code.
Q&A
Thank you. Thanks for a really inspiring talk, Alan. So, again, if you've got questions, you can ask them on the Slido link from the lab, and we've got a few coming in already.
Okay, two, we're kind of starting at the beginning of the talk. So, two questions kind of related about bootstrapping. First of all, like survey data, you know, it's really important to account for the sampling design and the weights, and what about the case where you're looking at smaller data sets with heavier tails? Like, where do you kind of draw the line between computational statistics and mathematical theory?
Yes, all good questions. So, the first one about the sampling design, I actually cut that from the slide, but it is in the notebook. You'll see the BRFSS uses stratified sampling, so I had to correct for that by using the sampling weights as part of the bootstrap process, and that all I had to do was change the sample method to take an additional parameter, which is the sampling weights. So, that's one of the nice things about bootstrap is that kind of re-weighting is super easy to do.
With small samples, you do run into the Achilles heel of bootstrapping, which is data diversity. If you have a small sample, and you draw new samples from it, the new samples will just all look the same, and you won't see enough diversity in the results. And in that case, one option is to switch to a parametric bootstrap, where instead of just doing sampling with replacement, you take your data, you build a model of the population, but the model now is a parametric model, and so it has some smoothness to it, and now when you draw samples, those samples will be continuous and diverse in all the ways that we want.
So, where do I draw the line? I really don't. I use computational stats for everything. One concept I learned about recently, and maybe you've heard of too, is this idea of the bitter lesson. This is an idea from, and I'll probably paraphrase it incorrectly, but from computer science and AI, that simple methods that use computation end up winning regardless over time, and it felt to me like that's what you're really saying about the bootstrap. It's a simple method. It requires computation, but over time, computation continues to get better and better, and we don't have to worry about deriving all these special cases.
Okay, next question. I think this question possibly comes from a pessimist. I'm going to stay there. I can cure you. How do you suggest we calibrate our worldview when you account for Simpson's paradox, where all of these large trends are positive, but if you break them down into smaller categories, they all go the opposite way?
I think that's possible in theory, but I don't think I've seen examples of it. One of my examples looking at poverty globally, the global trend was declining poverty, and then when you divide it up by region, all of the regions are negative. There will always be exceptions. There will be short-term reversals, and there will be specific places and times where things are not necessarily positive, but the long-term trends are positive. I have not seen examples of Simpson's paradox there.
What was the hardest part of writing your first R script? Oh, you know what the answer is going to be. It was fixing my environment. I had an old, broken installation of R, and it took me an hour and a half to get it fixed. It's not Python. Most people don't have an environment.
What do you think that statistics educators need to do to get on the boat, and what do data science educators need to do to make sure their students stay statistically responsible and literate?
Well, so one piece of this, a lot of universities are creating data science programs, and they're going about it in a way that is how universities work, which is that they get a bunch of computer scientists and a bunch of statisticians and try to make them play together, and the assumption is that that is somehow going to be data science, and I think that doesn't work. I think data science is a different thing from both of those, and starting with people who are professionally attached to those identities is probably not the right starting place, so I think if you're going to do data science education, you need data scientists.
What do you sacrifice in your life to be so prolific in your output? I was amazed by all the books you've done. Oh, thank you. I really don't feel like it's a sacrifice at all. I really enjoy the work. I look forward to it. I make consistent progress, so I don't feel like I have to work in sudden bursts of staying up all night. I don't do all-nighters, that kind of thing, but I don't think I'm neglecting other parts of my life.
I will also say, as someone who's written quite a few books, too, I find that, for me personally, it's that writing regularly, just making that steady progress, and that advice I've given to everyone who wants to write a book, so far, no one has successfully followed that advice. It's a very simple idea, but it's very, very hard to do. For me, it's the most relaxing kind of work that I do. So if I'm stressed about something, taking some time to focus and do some writing is a relief.
Humans and governments often only seem to act on big challenges, like climate change, when the challenge seems like a crisis. How do we inspire change without a sense of doom?
Yeah, and I do think that's part of how we got to climate numerism, which is, climate denialism was such a problem for so long that people felt like they had to turn up the heat, metaphorically, and insist more and more strongly, this is a crisis, we need to react now, and we've almost gone too far. So yeah, how do you get people engaged without invoking that overshoot? That is a really hard question, I don't know.
But it also reminds me, we do have a long history of actually solving problems. And so a recent, relatively current one is the hole in the ozone layer. In the 1980s, that was the big crisis. And when was the last time you even heard anybody say ozone layer? And the reason is that we have largely solved that problem. We banned most of the chemicals that were causing the damage. The ozone hole has been closing for decades, and will probably completely close by the end of the century. So we actually got together. The Montreal Protocol, I think, was the treaty, the international treaty. I think 190 countries signed on to it and got it done.
How do you reconcile a desire to be an informed and engaged citizen with advice that says to read less news to be happy? So I've started doing this. Within the last year, I consume almost no news media. I do read The Economist, so that's an exception. And what I've found is that if something is important enough that I really need to know it, I will hear about it.
Should we distinguish data science from science in general? Is there a meaningful distinction? Yeah. So, you know, is data science just doing science? Because when you're doing science, you're almost always working with data, and you are using that data to answer questions, resolve debates, and make better decisions. So, yeah, I think there's an argument there. That last part, though, might be a difference, which is that science is usually about creating knowledge and not necessarily designing things or making decisions. So maybe what that means is that data science is actually broader than science because it also includes those elements.
How do you reconcile optimism against an overwhelming amount of climate scientists that feel we are approaching an unreconcilable ecological disaster and whose historical recommendations have a bias towards being overly optimistic already?
So, yes, the one way that I could be wrong, if we look at long-term trends and we say, hey, things are going well, if we keep doing what we're doing... Because, you know, those long-term trends didn't happen automatically. We did things that made them happen. So if we keep doing what we're doing and keep solving problems as we confront them
