Allen Downey - A future of data science

Transcript#

This transcript was generated automatically and may contain errors.

Good morning. Thank you. It's really, it's great to be here. I've really enjoyed the conference and I'm happy to talk to you today. I want to say a little bit about how I got here, how we as data scientists got here, and then where do we go from here? What is a future of data science?

So this is an opinionated talk, and I want to give you a chance to talk back to me, not like during the talk, but we'll do Q&A and I'll be around afterward, so tell me what you think.

The first baby problem

I'm going to start with my most successful data science project. In 2003, when my wife and I were expecting our first baby, I googled this question. Are first babies more likely to be late or early or both? And what I got at the time was anecdotal evidence. My cousin had two babies and they were both early and therefore I don't know.

What we need here is data. And the National Survey of Family Growth has data. This is a survey conducted by the CDC. It started in 1973. They've done 11 repetitions. They have more than 100,000 respondents now. And for female respondents, they have details on their pregnancies and pregnancy length, including duration in weeks.

And so I thought if we're asking this question around week 35, which is when this question becomes a little bit more pointed, we can ask what is the remaining lifetime or remaining duration of a pregnancy at that time. And you can see here the blue curve is for first babies and the orange curve is all other babies. And this is the distribution of total duration in weeks. And you can see that, in fact, first babies are a little bit less likely to be right at the nominal 59 weeks and a little bit more likely to be a couple of weeks late.

So now, if you Google this question, the first two hits on the page are me. And if you go farther down the page, you'll get this BBC article on the topic, which cites me. So I am now the world-renowned expert on this topic.

So why am I saying that this is my most successful project? Because I took a question from not answered and I put it over into the answered column of questions. And this was not hard to do. I needed data, and it was freely available, open data. Thank you to the NSFG. Simple methods, basic visualization, no fancy statistics, all free, open software, and a venue to make it visible, which was my blog. And that's all we need.

This is an example of what I hope data science is and can be, which is a set of tools and processes for answering questions, resolving debates, and making better decisions.

This is an example of what I hope data science is and can be, which is a set of tools and processes for answering questions, resolving debates, and making better decisions.

You are not a pessimist. You have been misled by a relentlessly negative media diet. On long-term trends, almost everything is getting better.

I have found that when I say that, people get angry. And my conclusion is, we need to say three things at the same time. On long-term trends, almost everything is getting better. And we still have serious problems to solve. And our history of solving problems suggests that we can solve these new ones too.

And when I say that, yes, I'm including climate change. So, where I think we are on climate change, we have not responded as quickly as we should have. We're still not doing everything we should be doing. But many environmental trends are already going in the right direction. And there are paths between now and the end of the century toward a stable, healthy environment and a good quality of life for everybody on the planet. So, we are not doomed. But a lot of people think we are.

If you think we're doomed, or someone in your life thinks we're doomed, please give them this book. I don't have a chance here to make a complete case about where we are here. But Hannah Ritchie, who is not coincidentally a researcher at Our World in Data, this book, I think, makes the case really well. Get a copy, read it out loud to somebody under 30.

A plea for data science on the slope of enlightenment

So, negativity bias, I think, is a serious threat to our well-being because it undermines our ability to address the important problems that we need to address. So, finally, here is my plea. Here is what I would like you all to do to help us get out of this trough of disillusionment and onto the plateau of productivity. First, stop telling kids they'll die from climate change. Second, stop reading the news. This is the other book I want to recommend. The title says it all. But if you have trouble mustering the strength to stop reading the news, this book might help.

Last thing, use data to understand the world better so that we know how to make the world better. There's a lot of data out there. And I'll tell you a secret. A lot of the organizations that generate open data sets, and especially government agencies, have enough resources to make the data set and publish it, and often not a lot of resources to do much with it. So, when you're the first person to go into one of these data sets and really look around and explore, you will inevitably find interesting things, like first babies are more likely to be late, or things that are more important than that.

So, take advantage of the data that's out there. And this group in particular has all the tools and processes that you need to answer questions, resolve debates, and make better decisions. It's the tools of open science. And let me turn that sideways so we can read it. Open data, open source software, open methodology, open peer review, open access, and especially from my point of view, open educational resources.

So, I want to end on that, because it's one of the things that I work on. I'll give you a few links and credit for some of the resources that I've used. If you are interested in that first baby example, that is from ThinkStats. And the third edition is what I'm working on right now. It's available free at that link. I'll also give you another chance to get these slides so that you can get those links. The data there, as I said, is from the National Survey of Family Growth run by the CDC.

The resampling example, if you're interested in that, that is from Elements of Data Science, also available free. And all of that data was from the BRFSS, also from the CDC. Finally, that happiness example is based on Chapter 10 of Probably Overthinking It. And that data is from the General Social Survey.

Last thing, all of the notebooks are available. Those links will take you to the notebook running on Colab, so if you want to replicate anything I did or run your own experiments, you can do that. And as of Monday, I learned enough R to translate my Python examples into R. So if you want to see the variability example in R, it's there. This is, I will admit, the first R program that I wrote beyond Hello World, and I am sharing it with all of you to do a code review. So I would like to hear what you think about my first attempt.

So, I feel like I've had my chance to talk. Here are five ways that you can get in touch with me if you want to talk back, but we also have a chance to take some questions, and again, you can grab the slides using either that link or the QR code.

Q&A

Thank you. Thanks for a really inspiring talk, Alan. So, again, if you've got questions, you can ask them on the Slido link from the lab, and we've got a few coming in already.

Okay, two, we're kind of starting at the beginning of the talk. So, two questions kind of related about bootstrapping. First of all, like survey data, you know, it's really important to account for the sampling design and the weights, and what about the case where you're looking at smaller data sets with heavier tails? Like, where do you kind of draw the line between computational statistics and mathematical theory?

Yes, all good questions. So, the first one about the sampling design, I actually cut that from the slide, but it is in the notebook. You'll see the BRFSS uses stratified sampling, so I had to correct for that by using the sampling weights as part of the bootstrap process, and that all I had to do was change the sample method to take an additional parameter, which is the sampling weights. So, that's one of the nice things about bootstrap is that kind of re-weighting is super easy to do.

With small samples, you do run into the Achilles heel of bootstrapping, which is data diversity. If you have a small sample, and you draw new samples from it, the new samples will just all look the same, and you won't see enough diversity in the results. And in that case, one option is to switch to a parametric bootstrap, where instead of just doing sampling with replacement, you take your data, you build a model of the population, but the model now is a parametric model, and so it has some smoothness to it, and now when you draw samples, those samples will be continuous and diverse in all the ways that we want.

So, where do I draw the line? I really don't. I use computational stats for everything. One concept I learned about recently, and maybe you've heard of too, is this idea of the bitter lesson. This is an idea from, and I'll probably paraphrase it incorrectly, but from computer science and AI, that simple methods that use computation end up winning regardless over time, and it felt to me like that's what you're really saying about the bootstrap. It's a simple method. It requires computation, but over time, computation continues to get better and better, and we don't have to worry about deriving all these special cases.

Okay, next question. I think this question possibly comes from a pessimist. I'm going to stay there. I can cure you. How do you suggest we calibrate our worldview when you account for Simpson's paradox, where all of these large trends are positive, but if you break them down into smaller categories, they all go the opposite way?

I think that's possible in theory, but I don't think I've seen examples of it. One of my examples looking at poverty globally, the global trend was declining poverty, and then when you divide it up by region, all of the regions are negative. There will always be exceptions. There will be short-term reversals, and there will be specific places and times where things are not necessarily positive, but the long-term trends are positive. I have not seen examples of Simpson's paradox there.

What was the hardest part of writing your first R script? Oh, you know what the answer is going to be. It was fixing my environment. I had an old, broken installation of R, and it took me an hour and a half to get it fixed. It's not Python. Most people don't have an environment.

What do you think that statistics educators need to do to get on the boat, and what do data science educators need to do to make sure their students stay statistically responsible and literate?

Well, so one piece of this, a lot of universities are creating data science programs, and they're going about it in a way that is how universities work, which is that they get a bunch of computer scientists and a bunch of statisticians and try to make them play together, and the assumption is that that is somehow going to be data science, and I think that doesn't work. I think data science is a different thing from both of those, and starting with people who are professionally attached to those identities is probably not the right starting place, so I think if you're going to do data science education, you need data scientists.

What do you sacrifice in your life to be so prolific in your output? I was amazed by all the books you've done. Oh, thank you. I really don't feel like it's a sacrifice at all. I really enjoy the work. I look forward to it. I make consistent progress, so I don't feel like I have to work in sudden bursts of staying up all night. I don't do all-nighters, that kind of thing, but I don't think I'm neglecting other parts of my life.

I will also say, as someone who's written quite a few books, too, I find that, for me personally, it's that writing regularly, just making that steady progress, and that advice I've given to everyone who wants to write a book, so far, no one has successfully followed that advice. It's a very simple idea, but it's very, very hard to do. For me, it's the most relaxing kind of work that I do. So if I'm stressed about something, taking some time to focus and do some writing is a relief.

Humans and governments often only seem to act on big challenges, like climate change, when the challenge seems like a crisis. How do we inspire change without a sense of doom?

Yeah, and I do think that's part of how we got to climate numerism, which is, climate denialism was such a problem for so long that people felt like they had to turn up the heat, metaphorically, and insist more and more strongly, this is a crisis, we need to react now, and we've almost gone too far. So yeah, how do you get people engaged without invoking that overshoot? That is a really hard question, I don't know.

But it also reminds me, we do have a long history of actually solving problems. And so a recent, relatively current one is the hole in the ozone layer. In the 1980s, that was the big crisis. And when was the last time you even heard anybody say ozone layer? And the reason is that we have largely solved that problem. We banned most of the chemicals that were causing the damage. The ozone hole has been closing for decades, and will probably completely close by the end of the century. So we actually got together. The Montreal Protocol, I think, was the treaty, the international treaty. I think 190 countries signed on to it and got it done.

How do you reconcile a desire to be an informed and engaged citizen with advice that says to read less news to be happy? So I've started doing this. Within the last year, I consume almost no news media. I do read The Economist, so that's an exception. And what I've found is that if something is important enough that I really need to know it, I will hear about it.

Should we distinguish data science from science in general? Is there a meaningful distinction? Yeah. So, you know, is data science just doing science? Because when you're doing science, you're almost always working with data, and you are using that data to answer questions, resolve debates, and make better decisions. So, yeah, I think there's an argument there. That last part, though, might be a difference, which is that science is usually about creating knowledge and not necessarily designing things or making decisions. So maybe what that means is that data science is actually broader than science because it also includes those elements.

How do you reconcile optimism against an overwhelming amount of climate scientists that feel we are approaching an unreconcilable ecological disaster and whose historical recommendations have a bias towards being overly optimistic already?

So, yes, the one way that I could be wrong, if we look at long-term trends and we say, hey, things are going well, if we keep doing what we're doing... Because, you know, those long-term trends didn't happen automatically. We did things that made them happen. So if we keep doing what we're doing and keep solving problems as we confront them

Allen Downey - A future of data science

Transcript#

The first baby problem

The Gartner hype cycle and data science

Teaching statistical inference computationally

The hype cycle: peak, trough, and beyond

Reasons for optimism: data journalism and data literacy

The happiness data and negativity bias

A plea for data science on the slope of enlightenment

Q&A