Resources

Priyanka Gagneja | Exploratory Data Analysis | RStudio

video
Dec 8, 2021
58:39

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

All right. Okay, so as Rachel, you know, started the introduction, today we're here to talk about a little bit of the project workflow for Exploratory Data Projects. And I actually gave this talk originally last month with our ladies Bergen, and that was my actual first talk, I guess I should say. So I'm sort of repeating, but with a little bit of addition from since then, and now I found out two more packages. So I've added as in my last slide, I'm not talking about it too much, because I didn't have enough time to, you know, explore it so much. But I guess they look promising. And so I added the links.

So you know, if you come back to these, deck later, which I'll be sharing with Rachel, and I think in the meetup group, you would have access to those links. So all right, let's get it going.

So I guess to begin with about myself, I'm a data science practitioner, and I also do some side gigs on freelancing. At this point, I currently work as a data science consultant with OnPoint Insights. And I, you know, sort of consult a client in biomedical devices company, leading their reporting needs from the quality and complaints side of things. I also TA for, currently, I'm also TAing for applied linear modeling class at University of Pennsylvania for the social sciences department. I'm an avid hour user. I always have, I guess, two or three art projects open at one time on my laptop. And I try to be as active in the community for our ladies and our for the years as much as possible. I started at Boston College in the US and my prior education has been in India. And I'm also, you know, big promoter, proponent of women in data science, analytics, and I've been ambassador of some of these in the past few years.

Motivation for the talk

And then I'm ready to jump in. The big motivation for this talk was basically came in for based on the projects that I was working on earlier this year. And I had, I ended up with a bunch of projects on exploratory, you know, side of things, very open ended projects, but I just had to figure out something and I didn't know what, where to start. So as a result of which what happened was I realized after my third project that I was doing a lot of repetitive work. And for a person like me, I tend towards automation and hence, you know, my idea of sort of trying to streamline or automate as much as possible of my, you know, sort of ED workflow would be. And another, I think the challenging piece of it was that it was taking so much time for me to, you know, get the data, generate all the plots over and over again. And I was ending up with not enough time to actually look at their outputs and generate insights out of it, out of it to make better use of my time.

And hence, I started moving towards streamlining my thought process. And in that process, I came across lots of packages, some of which you've already, you know, heard some of which you may not know. My talk would be basically divided into two parts. One being the absolute first steps, meaning these are the packages that, you know, will follow in the next subsequent slides would be about, you would be able to use them, or I preferred using them when I come across a data set absolutely fresh when I don't know much about it. And I'm like absolutely literally exploring what is there in that, you know, how many columns there is, you know, how does, how are different unique values looking in it. And then there will be a second step of, the second phase of this would be when, you know, we kind of have a sense of the data, but we start to visualize it a little more specifically.

Phase one: DataExplorer package

So again, coming back to the phase one or absolutely new data exploration phase, the packages that I discovered and I want to talk about and share with everyone is the first one being data explorer package. And what this package does is where we were abating, it helps you with three main steps. It helps you do a lot of, it automatically does a lot of exploratory data analysis for you. And then the results that it provides you allow you to look into, you know, a faster, quicker feature engineering. And then overall, it also provides you, this package also provides you with functions that you can use for data reporting purposes. Again, that is about the low level details of what each variable in the data is and how the bivariate and multivariate relationships exist in the dataset that you have at hand. There are lots of functions that this package provides, but my go-to has been create report function. And I'll do a little demo of this. And TxHousing is one of the datasets available in ggplot2 package, which I have taken to, you know, talk about and show the example for.

The slide is also has all the links to all the packages. It also provides a way to replace missing values. I am probably not going to go through all the details of this, but this is one of the very promising packages that I've seen and come across. I can actually demo that first, or maybe I'll just go through the next package as well. And I'll finish the phase one and then I'll demo the rest of the packages.

Data Reporter package

So moving on, the next package that I want to talk about is data reporter package. It was one of the packages that I've, I heard about in a YR conference, I guess more than a year ago. Now this package very specifically focuses or is built with the mindset of doing quality checks on your data. It has some of the repetitive things that you would see with your data explorer results, but I think the stronger part of it or the better focus of this package is more on the quality of the data. So it tells you, you know, more than just missing values, it also talks about, you know, what outliers seemingly you would see in this column in this data set.

And again, with many other functions available, if I have to just pick one function that is helpful or that, you know, sort of brings out everything, all the essence of this package for me is make data report package, sorry, function. And we'll go through this demo again with the same housing data set and see, you know, what are the additional value, what is the additional value that it brings? So again, I just reiterate that this package with some similarity to data explorer, but also has its additional value of how it brings some of your data checks and it helps you identify the possible errors in the existing system that you are in right now.

Skimr package

Moving on, I guess Skimr package is pretty famous. And Skimr package with again, with its only one function, that would be a Skimr skim function that helps you look at the, you know, different values in your column. It provides you some of the statistics with very easy to modify defaults. And one very, I guess the conspicuous difference between all of these packages is, or the results of these report functions is that data explorer create report, it gives you a nice HTML format response, data reporter by default provides you with a PDF output and Skimr skim function actually, you know, throws out your results in the console. So again, that could be another way of, you know, reason for choosing whatever you choose for the ease of being within the system, like, you know, within your RStudio window, if that's what you would want.

Demo: DataExplorer create report

So this, with just one function, I was able to look at this, you know, let me show you this, this rich report that I get tells me a lot about the data set that I have. And let's say I've not looked at this TX housing at all. And it tells me how many rows it has, how many columns it has, you know, what type data type it is, you know, what the numeric, discrete, continuous, and how many missing values do I have. Now, after that, you know, quick table, it also tells you how many missing values you have by percentage in each of these, you know, data sets. So how many, in how many columns, how many observations, meaning rows, then a quick structure of how your data looks like. Now, by each column, how many of these are, you know, what percentage of rows in each column are missing. And it gives you a quick bar chart of that and, you know, percentages on top of that. Again, everything in, you know, you would want to do in univariate distribution, bivariate analysis, it also goes up to the extent of giving you QQ plot to basically test for normality. Correlation analysis, I think a lot of times we also, we do need these numbers. I guess I personally have not used PCA from these results, but it goes up to that extent. So it's pretty detailed and pretty time saving, I would say at least. One day's work for me, one day's worth of work when I would have to, I would be probably making all of these charts and exploring, you know, every individual column within the data set.

One day's work for me, one day's worth of work when I would have to, I would be probably making all of these charts and exploring, you know, every individual column within the data set.

Demo: Data Reporter output

So that's only Data Explorer. Going back to the data reporter package. Now, this is how the output of, you know, the PDF format of this output looks like. So it gives you how many rows and columns you have, how many. Now, I guess this is the check that I was talking about. So it talks about how many missing values you have, you know, how many prefix and suffixes, whitespaces. So basically that gives you a quicker sense of, you know, what data frame or what column you're looking at, what data types this, you know, variable is. If you know about it, you know, you basically get a better, you are able to quickly decide if there are any changes that you want to make.

Again, the summary table of number of unique values, missing values, and it tells you if there are any problems, you know, one of some of these problems that have been highlighted, including outliers and misclassified numerical integer variables, and, you know, things like that. So that's, I think that's pretty useful and powerful. It gives you, in a similar way, a univariate distribution here with a histogram and a little bit about the variable itself. I think the one I do want to bring attention to is these values, which it identifies as outliers, which may not necessarily be outliers in your data, but since it specifically highlights those values, it gives you a sense of, you know, I'm thinking of right words. It gives you a little idea of maybe something that you want to go back to and check. So it just sort of highlights that, you know, it's a sign that you can go back and look at this, go back to this column and check a little more in detail.

Demo: Skimr skim function

So this gives you a lot of information again, some of which is repeated about how many rows and columns, how many character and numeric variables. It also allows you groupings. In this case, it says, you know, the group variable is none here. And I think that's one of the strongest suits of Skimr I find against these other two. So now this piece is the summary statistics that I was talking about. And I think one that I personally like a lot is this quick sort of the sparklines, but it gives you the histogram distribution right here in front of you in the console, which is also kind of useful in, especially, let's say, for example, in this case, as the listings and inventory column, you can see that these are skewed and, you know, this one is uniformly distributed, which is a good piece of information to have.

And going back to the grouping piece, you could also use, you know, your data to be, you could group it by, let me do it by city or, I don't know, I want to do it by date because I want to keep it small. So you can see how quickly, again, for each year, so let's see. So group column is year. So for each year, it is going to give you how, you know, different the summary statistic is for all the rest of the columns, again, including all those histograms in the end. So it kind of, you know, is another way of quickly looking at your data, especially if there are more numeric variables than others.

Q&A: large datasets and package performance

I'm just curious if you've tried these summary reports on very large datasets, and if so, how that works. It looks really interesting, but, you know, if you've got something with like a million rows, is it not reasonable? Yeah, I wouldn't say it's not reasonable, but yes, I would agree that it would take relatively longer time. And I have personally experienced that with Data Reporter. I think when I tried it was with a sort of million, you know, with a dataset of sort of that tune that you're asking. And Data Reporter, because of, I think, what goes in the background, it is generally slower than Data Explorer or Skimr package. It honestly might even take like five to ten minutes, but with Data Explorer, that would be relatively faster, like maybe under one minute or so. That's my experience.

Thank you. Yeah, but yeah, with Data Reporter, I remember how excited I was to share those results with my team. It was worth it, I guess, even if, you know, I just ran it and maybe went to get a coffee.

Phase two: rpivotTable

So now on that note, I figured I found useful these following packages where our pivot table is some is a synonymous to, you know, Microsoft Excel pivot table, which I'm sure a lot of us have used before we moved on to, you know, before we moved on to our journey. And I personally found it very useful because I was working with a lot of user level data, but it was in, you know, in terms of their daily activity on the app and which meant that the data set that I have, it's good. So for each user, I had multiple records and every time I did some filtering or, you know, any sort of transformation or wrangling of my data, I always from the QA perspective, I've always wanted to see how many users did I lose out, you know, in this funnel or in this analysis chain. So, you know, I always was looking at, you know, looking at the unique number of users. So instead of having to do, you know, select user ID distinct, and then, you know, give me that count of the final number, you know, I sort of had created this little chain of my diverse code. But again, you know, having to do copy paste that again and again, I figured this package was what solved my problem.

So now within this package, when I look at it, I sort of, you know, I created these little notes here on how do I find this useful. So it allows me to do a lot of quick and dirty exploration, like I was saying. So every time I would filter the data, I would pipe in our pivot table function in it. And then, you know, from that point onwards, whatever the state of my data is at this point, whether it's at the granular level, whether it has been grouped or whatnot, I could always, you know, just pick and choose things I wanted to look at.

So again, back to the Tx housing data, you can see that it brings you pretty much the same setup as a, you know, Excel pivot table. It allows it, you know, on the slide first, it gives me the count of the entire number of rows, which is a default, right? But I can always look at count of unique values by which column. In this case, I want to look at how many unique cities do I have. In my case, I would go back and look at the user IDs. We could look at how many years data are we looking at in this case. So he has 16 years of data in this Tx housing, which is great. We could go back and say, you know, just quickly to the sum of the sales values, like how much money are we talking about here? I could spread this by years. I could break it up by cities and years and, you know, all those things.

So again, this is pretty simple. And I have not even bothered to actually look if there are any more functions within this package, because in this use case that I have in the situation that I was in, you know, being able to do this just quickly and see, oh, okay, this looks very small. This looks very high. You know, so based on that, and, you know, maybe this is a specific area of issue that I want to focus on. And that's all I needed to know in the exploratory process that I was in. And when I come across these things, you know, I will make a quick note of what I'm observing, and then I move on to doing the next step of things.

And I think one thing that I sort of want to bring in is the reason why I'm talking about all of these packages or, you know, different iterations in using these is that when you talk of an exploratory project, you basically don't always need everything for your stakeholders or for your final outputs in your final presentations when you're working on them. And hence, I feel that even though I am, you know, very much an automation person and, you know, pro-code, but I have started to appreciate that not all the time I need to create all the code. I don't need to always, you know, have a ggplot2 code to be able to have a look at that final output. And hence, you know, these quick analyses, they help me a lot move faster in terms of what I'm trying to achieve.

esquisse package

So moving on to esquisse, which is, I kind of really like this package. So this package really allows you a very quick exploration in a visual manner, which I think is what people would be able to appreciate. It is basically a package that runs your, a lot of, it basically still, you know, generates ggplot2 packages, but with the user level of interaction, a UI kind of Tableau kind of platform for people, it also allows you to do a lot of data wrangling on the go. And it also does the code generation, which was the previous limitation I was mentioning.

And I think one good thing to know, one thing to know is that when you run this esquisse package function, it actually brings up Shiny app for you.

So when, once that, you know, Shiny app opens up, it gives you, oh, I don't have any data frames right now. All right, one second. I had to close this first.

So now I'm bringing up the TX housing again. When, when you bring this up for the first time, it will always ask you to, you know, share or pick, select which data frame you want to work with. You can change the environment. You can actually also get the data from Google sheets and some external files. But I guess personally, I've just mostly used something from my environment. So we go ahead and you say import data. And then this, this is sort of a quick Tableau style visual that appears for you, the UI that comes for you. It allows you to, you know, it basically brings all the columns from your data set here. It allows you to, you know, make different charts and plots. As you can see, it gives you all the defaults of, or all the options that you normally get in a ggplot. Because like I said, this actually runs a ggplot2 in the background for you.

So let's see. Let me just drag in the sales column. So what it will do is it also gives you a little warning of, you know, things that you have missed out. So for the missing values that have been removed, which every ggplot2 also warns you in general. So now what this does is by default, since it was a numeric variable, it actually plotted a histogram of this data, this entire column. You could choose to change things, then in certain cases, you may not be able to, you know, some of the plots will not be highlighted if that data type does not support it.

A couple of other things that I want to talk about is it gives you all the, you know, like even, you know, beautification options, I do all the theme and the filtering options that you generally would write to get to your final code. I think most of that is sort of also covered in these options here. So you can change your X label, Y label, you can change the caption and title of your chart. And then within the plot options, you can choose to vary things here, then, you know, all those things that you get, that you would always use as a layer in your ggplot2 command, or, you know, your ggplot2 code, you can change the transformation. You can flip the coordinate if you would want. And there's an option to add smooth line.

Again, here, you know, I think this is what I was saying, from the wrangling perspective, you can just by quickly selecting these values, or, you know, maybe deselecting values from this categorical variable, you can do all that filtering of your data. So all of these things, you know, make it so easy and so quick for you in terms of getting the kind of output that you're looking for. And then the magic part of it is, you get all the code on based on what you did here, which you could just click here to insert it in the script right away, or you could copy in the clipboard and go back to your code and write it. So that's why I guess this is one of my favorite packages, which allows you to, you know, play around with things.

And then the magic part of it is, you get all the code on based on what you did here, which you could just click here to insert it in the script right away, or you could copy in the clipboard and go back to your code and write it.

But yes, I guess the challenge with this is that as your data gets bigger and bigger, I guess with a million rows, this starts to lose weight, or, you know, it starts to hang and take much longer. But in general, I guess, when working with that big data, what I have, what I've always been suggested by my bosses is that, you know, start with a smaller subset, and, you know, then play around with things. If something seems useful, then, you know, you can take the code from this and then maybe run it with a bigger data set in, you know, in the next section or next code chunk. But still, I still feel that, you know, even if your data is too big, there is definitely a lot of value in using this, and, you know, sort of saving time.

chronicle package

And then, the final one is the chronicle package. It is a package that allows you to do quick exploration and report generation. But to demo it, what it does is it provides you with a lot of add underscore functions and it provides you with render report function and report columns. And then, add functions are actually wrapper to make functions, which is how, this is how it looks like. So, again, even with chronicle, you know, visuals that you're going to be looking at, you still have ggplot2 under the hood. The function that you would write would be a make underscore, for example, bar plot or a line plot and things like that. But what goes under the hood, you know, is really a bar plot function is being defined with lots of these, what do you say, customizations, which you can actually pass on here. So, it's just a way of wrapping, you know, creating a bar plot with the layer style to, you know, these options that you can mention within your one function call.

So, in a way, it makes it shorter or quicker to get to those final outputs that you're looking at, which, again, like I was mentioning earlier, it's really a situation, depends on use case where a ggplot2 code would be better or a chronicle code would, you know, using a function from chronicle would make you do things faster. And then it reduces the time to generate plots is what I was trying to mention, because it's still a code, but it kind of, you know, brings it all together and it tells you, like, you know, you don't have to look for what was my continuous scale option or, you know, what other theme options that I was trying to look at. So, you have all those options within your function call and you can choose to, you know, change or not change any of those specifics.

So, this is how a chronicle, you know, functions look like. So, for example, if I'm creating a report, I could use add underscore text function. I could add a table and a rain cloud and a scatter plot. Now, and then you're basically just piping what all functions you want instead of, you know, plus layer, adding the layer using a plus operator in your ggplot. So, this is, you know, to me, it's much more stronger and powerful in that sense that, you know, now at this point, you can have tables and charts and everything all, you know, in a singular report. And then, like I was mentioning earlier, the render report function, it allows you to use more than one, generate more than one output formats.

So, this is how the output of the chronicle demo looks like on this data set. So, this is on Iris. I guess I didn't get to change a lot on this. So, what you're looking at is all the add underscore functions that we used. First, we looked at, we added the text of, which is the title. Then we added the table of, you know, what we are looking at. And, you know, the title, this is how it looks like. Then you were adding a rain cloud plot, which I think is the most important for me. Because I have personally struggled a lot with creating a rain cloud plot. So, this was so simple for me. And I guess this was my one most important, if nothing at all, I think this is my go-to, why I would use this package. And then, you know, we added a scatter plot. And then this report basically just has that, whatever you needed to do with it. And then I say, give me this as a report format. I want PDF and I want RMVF formats. And this is one of the PDF format outputs that I'm showing you.

Bonus packages and wrap-up

So, yeah, I guess that's pretty much about it. And these are some of the resources, links that I've used, I guess, pretty much some of these things that I'd opened and showed you in between. The bonus was these two packages that I also come across since I actually did this last presentation last month. And between yesterday, I saw that there is another package, which is pretty strong in terms of, again, I think this also uses a Shiny app in the process. It's a Dlooker package and a Descriptar package. I think Descriptar was shared by Indrajit Patil, so I should give this credit to him. It came on my LinkedIn feed. And yeah, with that, I think I'm good. And that's all I wanted to share with all of you.

That was awesome, Priyanka. Thank you so much. It's crazy how many packages there are, and to keep up to date on all of them is almost impossible.

Q&A: text-heavy datasets

So I had just one small query. When you do this EDA on these various data sets, is there any particular package or choice of packages which is more suited for text-heavy data exploration, like data sets which is having more columns or values which are having more characters rather than numerics? For example, I use a skimmer a lot, but that's what I have observed. It's very good, obviously, if there are more numeric variables. But when it comes with text or character-heavy data sets, it's a struggle for me.

So based on this list of packages that I've talked about, I would say potentially Data Explorer would be good because then it will throw out, like with the create report, it will throw out all the bar charts with the percentage or the number of records that you have for each of those categories within that variable. So for example, a month column has January, February, March. So it'll throw out what you see, how much of your data is that you see in each month. And otherwise, it could be our pivot table that could be useful in the way I see it.

And I see Matthew had just asked, are you able to share the sample code with us as well in the presentation? So a prior version of this and most of the report samples that I talked about, they're already on my GitHub. But I will be uploading this this one that, you know, the slide that I've shared today, the one I was talking through also on my GitHub, and then I'll share the link with Rachel.

Q&A: Python equivalents

Hi, Priyanka. I'll be back again. So by any chance, by any chance, if you like come across like these similar packages for Python, I think probably there is a parallel for Data Explorer like with the same name for Python. And you'd be like, I think you'll be really excited to know there is a LUX package that I recently attended like a data umbrella session that I attended and LUX package in the Python ecosystem is amazing. It seemed very promising. And I was like, oh my god, why don't we have it in our LUX? It's you should explore that. It's very it's I guess a step forward from things that we've talked about today. It adds a lot of, I guess, intuition behind, you know, what you've done and the data that it sees, and then it gives you options to do next. So it's really powerful.

ggplot2 is an R package which you use for data visualization. So it goes to this theory of grammar of graphics and it's a layered approach where you basically write first ggplot2 function and then you use a plus operator to add more specific customizations to that code. So you basically, you know, start with the ggplot2 function, then you keep adding what kind of plot you want, and then, you know, specific details about how your visual should look like. So theme allows you to change how the text should look like, how the, you know, the color in the background and the grid lines and whatnot can be changed.

There it is, Rpy2. I was close. I had R2 and Py, just not in the right order. But I would, if you want to mix and match, I would stick with reticulate and Markdown. In fact, I just thought I think I'll actually do a demo for lightning talk on doing some pretty advanced reticulate with R Markdown and doing web apps with it for that next chat.

Plot nine, that's it. Plot nine uses ggplot4 Python. Oh, interesting. Awesome. Thank you. I appreciate that.

Thank you all so much. Have a great rest of the day. Bye. Thank you so much, everyone, for coming in. I hope everybody gained something or the other. I learned a lot in the process. So I appreciate this opportunity.