Priyanka Gagneja | Exploratory Data Analysis | RStudio

Transcript#

This transcript was generated automatically and may contain errors.

All right. Okay, so as Rachel, you know, started the introduction, today we're here to talk about a little bit of the project workflow for Exploratory Data Projects. And I actually gave this talk originally last month with our ladies Bergen, and that was my actual first talk, I guess I should say. So I'm sort of repeating, but with a little bit of addition from since then, and now I found out two more packages. So I've added as in my last slide, I'm not talking about it too much, because I didn't have enough time to, you know, explore it so much. But I guess they look promising. And so I added the links.

So you know, if you come back to these, deck later, which I'll be sharing with Rachel, and I think in the meetup group, you would have access to those links. So all right, let's get it going.

So I guess to begin with about myself, I'm a data science practitioner, and I also do some side gigs on freelancing. At this point, I currently work as a data science consultant with OnPoint Insights. And I, you know, sort of consult a client in biomedical devices company, leading their reporting needs from the quality and complaints side of things. I also TA for, currently, I'm also TAing for applied linear modeling class at University of Pennsylvania for the social sciences department. I'm an avid hour user. I always have, I guess, two or three art projects open at one time on my laptop. And I try to be as active in the community for our ladies and our for the years as much as possible. I started at Boston College in the US and my prior education has been in India. And I'm also, you know, big promoter, proponent of women in data science, analytics, and I've been ambassador of some of these in the past few years.

One day's work for me, one day's worth of work when I would have to, I would be probably making all of these charts and exploring, you know, every individual column within the data set.

Demo: Data Reporter output

So that's only Data Explorer. Going back to the data reporter package. Now, this is how the output of, you know, the PDF format of this output looks like. So it gives you how many rows and columns you have, how many. Now, I guess this is the check that I was talking about. So it talks about how many missing values you have, you know, how many prefix and suffixes, whitespaces. So basically that gives you a quicker sense of, you know, what data frame or what column you're looking at, what data types this, you know, variable is. If you know about it, you know, you basically get a better, you are able to quickly decide if there are any changes that you want to make.

Again, the summary table of number of unique values, missing values, and it tells you if there are any problems, you know, one of some of these problems that have been highlighted, including outliers and misclassified numerical integer variables, and, you know, things like that. So that's, I think that's pretty useful and powerful. It gives you, in a similar way, a univariate distribution here with a histogram and a little bit about the variable itself. I think the one I do want to bring attention to is these values, which it identifies as outliers, which may not necessarily be outliers in your data, but since it specifically highlights those values, it gives you a sense of, you know, I'm thinking of right words. It gives you a little idea of maybe something that you want to go back to and check. So it just sort of highlights that, you know, it's a sign that you can go back and look at this, go back to this column and check a little more in detail.

Demo: Skimr skim function

So this gives you a lot of information again, some of which is repeated about how many rows and columns, how many character and numeric variables. It also allows you groupings. In this case, it says, you know, the group variable is none here. And I think that's one of the strongest suits of Skimr I find against these other two. So now this piece is the summary statistics that I was talking about. And I think one that I personally like a lot is this quick sort of the sparklines, but it gives you the histogram distribution right here in front of you in the console, which is also kind of useful in, especially, let's say, for example, in this case, as the listings and inventory column, you can see that these are skewed and, you know, this one is uniformly distributed, which is a good piece of information to have.

And going back to the grouping piece, you could also use, you know, your data to be, you could group it by, let me do it by city or, I don't know, I want to do it by date because I want to keep it small. So you can see how quickly, again, for each year, so let's see. So group column is year. So for each year, it is going to give you how, you know, different the summary statistic is for all the rest of the columns, again, including all those histograms in the end. So it kind of, you know, is another way of quickly looking at your data, especially if there are more numeric variables than others.

Q&A: large datasets and package performance

I'm just curious if you've tried these summary reports on very large datasets, and if so, how that works. It looks really interesting, but, you know, if you've got something with like a million rows, is it not reasonable? Yeah, I wouldn't say it's not reasonable, but yes, I would agree that it would take relatively longer time. And I have personally experienced that with Data Reporter. I think when I tried it was with a sort of million, you know, with a dataset of sort of that tune that you're asking. And Data Reporter, because of, I think, what goes in the background, it is generally slower than Data Explorer or Skimr package. It honestly might even take like five to ten minutes, but with Data Explorer, that would be relatively faster, like maybe under one minute or so. That's my experience.

Thank you. Yeah, but yeah, with Data Reporter, I remember how excited I was to share those results with my team. It was worth it, I guess, even if, you know, I just ran it and maybe went to get a coffee.

Phase two: rpivotTable

So now on that note, I figured I found useful these following packages where our pivot table is some is a synonymous to, you know, Microsoft Excel pivot table, which I'm sure a lot of us have used before we moved on to, you know, before we moved on to our journey. And I personally found it very useful because I was working with a lot of user level data, but it was in, you know, in terms of their daily activity on the app and which meant that the data set that I have, it's good. So for each user, I had multiple records and every time I did some filtering or, you know, any sort of transformation or wrangling of my data, I always from the QA perspective, I've always wanted to see how many users did I lose out, you know, in this funnel or in this analysis chain. So, you know, I always was looking at, you know, looking at the unique number of users. So instead of having to do, you know, select user ID distinct, and then, you know, give me that count of the final number, you know, I sort of had created this little chain of my diverse code. But again, you know, having to do copy paste that again and again, I figured this package was what solved my problem.

So now within this package, when I look at it, I sort of, you know, I created these little notes here on how do I find this useful. So it allows me to do a lot of quick and dirty exploration, like I was saying. So every time I would filter the data, I would pipe in our pivot table function in it. And then, you know, from that point onwards, whatever the state of my data is at this point, whether it's at the granular level, whether it has been grouped or whatnot, I could always, you know, just pick and choose things I wanted to look at.

So again, back to the Tx housing data, you can see that it brings you pretty much the same setup as a, you know, Excel pivot table. It allows it, you know, on the slide first, it gives me the count of the entire number of rows, which is a default, right? But I can always look at count of unique values by which column. In this case, I want to look at how many unique cities do I have. In my case, I would go back and look at the user IDs. We could look at how many years data are we looking at in this case. So he has 16 years of data in this Tx housing, which is great. We could go back and say, you know, just quickly to the sum of the sales values, like how much money are we talking about here? I could spread this by years. I could break it up by cities and years and, you know, all those things.

So again, this is pretty simple. And I have not even bothered to actually look if there are any more functions within this package, because in this use case that I have in the situation that I was in, you know, being able to do this just quickly and see, oh, okay, this looks very small. This looks very high. You know, so based on that, and, you know, maybe this is a specific area of issue that I want to focus on. And that's all I needed to know in the exploratory process that I was in. And when I come across these things, you know, I will make a quick note of what I'm observing, and then I move on to doing the next step of things.

And I think one thing that I sort of want to bring in is the reason why I'm talking about all of these packages or, you know, different iterations in using these is that when you talk of an exploratory project, you basically don't always need everything for your stakeholders or for your final outputs in your final presentations when you're working on them. And hence, I feel that even though I am, you know, very much an automation person and, you know, pro-code, but I have started to appreciate that not all the time I need to create all the code. I don't need to always, you know, have a ggplot2 code to be able to have a look at that final output. And hence, you know, these quick analyses, they help me a lot move faster in terms of what I'm trying to achieve.

esquisse package

So moving on to esquisse, which is, I kind of really like this package. So this package really allows you a very quick exploration in a visual manner, which I think is what people would be able to appreciate. It is basically a package that runs your, a lot of, it basically still, you know, generates ggplot2 packages, but with the user level of interaction, a UI kind of Tableau kind of platform for people, it also allows you to do a lot of data wrangling on the go. And it also does the code generation, which was the previous limitation I was mentioning.

And I think one good thing to know, one thing to know is that when you run this esquisse package function, it actually brings up Shiny app for you.

So when, once that, you know, Shiny app opens up, it gives you, oh, I don't have any data frames right now. All right, one second. I had to close this first.

So now I'm bringing up the TX housing again. When, when you bring this up for the first time, it will always ask you to, you know, share or pick, select which data frame you want to work with. You can change the environment. You can actually also get the data from Google sheets and some external files. But I guess personally, I've just mostly used something from my environment. So we go ahead and you say import data. And then this, this is sort of a quick Tableau style visual that appears for you, the UI that comes for you. It allows you to, you know, it basically brings all the columns from your data set here. It allows you to, you know, make different charts and plots. As you can see, it gives you all the defaults of, or all the options that you normally get in a ggplot. Because like I said, this actually runs a ggplot2 in the background for you.

So let's see. Let me just drag in the sales column. So what it will do is it also gives you a little warning of, you know, things that you have missed out. So for the missing values that have been removed, which every ggplot2 also warns you in general. So now what this does is by default, since it was a numeric variable, it actually plotted a histogram of this data, this entire column. You could choose to change things, then in certain cases, you may not be able to, you know, some of the plots will not be highlighted if that data type does not support it.

A couple of other things that I want to talk about is it gives you all the, you know, like even, you know, beautification options, I do all the theme and the filtering options that you generally would write to get to your final code. I think most of that is sort of also covered in these options here. So you can change your X label, Y label, you can change the caption and title of your chart. And then within the plot options, you can choose to vary things here, then, you know, all those things that you get, that you would always use as a layer in your ggplot2 command, or, you know, your ggplot2 code, you can change the transformation. You can flip the coordinate if you would want. And there's an option to add smooth line.

Again, here, you know, I think this is what I was saying, from the wrangling perspective, you can just by quickly selecting these values, or, you know, maybe deselecting values from this categorical variable, you can do all that filtering of your data. So all of these things, you know, make it so easy and so quick for you in terms of getting the kind of output that you're looking for. And then the magic part of it is, you get all the code on based on what you did here, which you could just click here to insert it in the script right away, or you could copy in the clipboard and go back to your code and write it. So that's why I guess this is one of my favorite packages, which allows you to, you know, play around with things.

And then the magic part of it is, you get all the code on based on what you did here, which you could just click here to insert it in the script right away, or you could copy in the clipboard and go back to your code and write it.

But yes, I guess the challenge with this is that as your data gets bigger and bigger, I guess with a million rows, this starts to lose weight, or, you know, it starts to hang and take much longer. But in general, I guess, when working with that big data, what I have, what I've always been suggested by my bosses is that, you know, start with a smaller subset, and, you know, then play around with things. If something seems useful, then, you know, you can take the code from this and then maybe run it with a bigger data set in, you know, in the next section or next code chunk. But still, I still feel that, you know, even if your data is too big, there is definitely a lot of value in using this, and, you know, sort of saving time.