PydyTuesday | Getting Data from the TidyTuesday Repo with Python
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, I'm Libby. I'm a data science practitioner, I'm a data community builder, and I also teach Python and R for data science. And I love TidyTuesday, so I cannot wait to introduce it to you if you've never used it before. TidyTuesday is a social challenge in the data community where we all work on the same data set in various different ways, and then we share on social media what we worked on. It's a very low-stakes way to participate in the data community, to make friends, to get feedback, and even to build a portfolio and just learn new things by using diverse types of data. There are hundreds of data sets that have a lot of context about them, so it's a great way to practice your skills understanding different data sets, explaining different types of data phenomenon, and doing all kinds of different projects that you might not get to do with your everyday data.
friends, to get feedback, and even to build a portfolio and just learn new things by using diverse types of data.
Navigating the TidyTuesday repo
The easiest way to find it is just to head to Google, so let's start there. All right, we are at Google. Let's just type in TidyTuesday, all one word, and the first thing that we get is exactly what we want, the official repo for the TidyTuesday project. Anytime you are on the TidyTuesday repo, you're going to see a lot of files. If you're not used to seeing GitHub repositories before, there's always lots of files and folders at the top. Just scroll down until you find the readme, and you'll see as I click around, I will continually scroll down. This will tell you a little bit about the TidyTuesday project, and then you will see the data sets for this year.
There's one new data set every single week going all the way back to 2018, so you can click around and really explore some stuff. You'll always have a week number, a date, the data set name, the source of the data, and also an article, maybe a blog post, something that uses the data or talks about how it was collected. This is really helpful context, and it's not something that we usually get when we just pull a data set off of Kaggle and maybe have no context for it or no data dictionary. Really meaningful data dictionary, I should say.
Let's go back to 2021, and I can just show you what it looks like when you click one of these years. Again, you're going to have to really scroll down, because we have all of these data folders for each of the weeks, but eventually we will get a readme where we have full navigation, and this is really helpful. So you can scroll down and see we have wealth and income data from the Urban Institute. We have video game and sliced data from Steam. We have Netflix titles from Kaggle. So, so many different diverse types of data. There's always something different to do with it. If you ever want to get back to the main page and you are not used to GitHub readmes or repos, you can go to this tidy Tuesday root file here, root folder, and click that.
This week's data and file structure
All right. So we're back at the main repository, and let's look at this week's data. I'm going to scroll down and click agencies from the FBI crime data API. What I really want to explain is just how to pull the data in and how to navigate this file structure on the left-hand side. So you're going to see two chunks here for this week's data, one in R and one in Python. I will show you how I use this code in just a second. On the left-hand side, you'll see our file that has our data file, agencies.csv. You'll see a PNG file. That is the image that's on this page, and then you'll also see a readme and a meta.yaml file. There are different files for these different datasets. This is just what this week's happens to be.
So we have a couple of options for reading in this CSV data. I'm going to use the python-tidytuesday package. You can see it imported here and the get date function used here. It's really convenient. It's just going to pull all of these files into my project folder for me. But you can also go grab the CSV file locally, or you can use this raw link to the CSV file. So if I go click on this CSV file over here, I can click the raw button, or I can right click the raw button, just like this, and hit copy link address. This works the same way on Mac and Windows, but I happen to be on a Mac. So if I click copy link address, you'll see that I could open this up, paste it in, and it's just the raw CSV link. This can be used in your code to directly access it in Panda's read underscore CSV. You can also just click this raw button and it will take you directly to that same page.
There's one more option here. You can just download the raw file. So this will download the CSV file directly to your computer, and then you can just move it wherever you want it into your project structure and use it from there.
Pulling data in with Positron
Let's head over to Positron, and I will show you how I pull it in. So in this project environment, I have Pandas, Requests, and python-tidytuesday installed, and I'm going to import Pandas as PD and python-tidytuesday as PDT. That will allow me to use the python-tidytuesday function get date, and I'll just use the date for our repo. If I go back to my Chrome tab, you can see right up here is that date 2025-02-18. So we'll head back to Positron here, and we can run this, and watch the left-hand side here in Positron. You'll see all those files pop up when this runs. So there we go. It tells me that all of these files were downloaded, and here they are. So I can see my CSV file right there, and what I want to do is read it in with Pandas read underscore CSV. Since it's in my project structure, I can just access it with the direct relative path, and when I run that over on my variable pane over here, you can see the agency's data set.
Now because I'm in Positron, I get the Data Explorer, which is amazing. Just click this little button right here, and it's going to open up a pane where I can explore my data in tabular form on the right-hand side, and then on the left-hand side, I get my variables. Their types are over here as symbols. I also get a distribution of my data, so I get a little histogram, and I get a percentage of missingness, which is fantastic. So, so helpful.
Now because I'm in Positron, I get the Data Explorer, which is amazing. I also get a distribution of my data, so I get a little histogram, and I get a percentage of missingness, which is fantastic.
I really hope that this has been helpful. I hope that you will hop in and join the TidyTuesday revolution with us, and when you use TidyTuesday hashtag, you are joining a big community. If I go over to BlueSky right now and I search TidyTuesday, I can already see people, because it's Tuesday, interacting and playing with the data and posting about it. So get out there, have fun, share the TidyTuesday love, and I cannot wait to see the amazing things that you create and share.


