Martin Henze | We R KaggleRs - At the Intersection of Data Science Communities

Transcript#

This transcript was generated automatically and may contain errors.

5 years ago, I decided to learn more about two exciting frameworks that I kept hearing about more and more often, everywhere really, in bars, restaurants. It got really spooky. The first is the R Tidyverse from Hadley's and Garrett's great book, Alpha Data Science. Some of you might have heard of it, maybe. And the second one is the Kaggle Machine Learning and Data Science Platform. And little did I know at the time that a combination of those frameworks would transform my career and open up a whole new world of exciting challenges.

And much like for our Dragon Ball friends there, this fusion really boosted my power level. So today, I would like to show you how the Kaggle community can help you to grow, to learn, and to inspire others. And whether you're a regular in Kaggle, have only ever dipped your toes into the community, or have never heard of it at all, I hope that my talk and my story will inspire you to make the most out of the amazing opportunities that the Kaggle community provides. And that, in turn, you will then share your great ideas and creativity with the rest of us.

What is Kaggle?

Now, that sounds great, you might say. I have a very small question, though. What is Kaggle? Well, I'm glad you asked. Kaggle.com was launched 13 years ago as a website for machine learning competitions. And that means that the goal is to build competitive predictive models on a specific machine learning problem, with the best models in the end scoring highest on an objective leaderboard. And while competitions remain one of the main cornerstones, today, Kaggle covers many key aspects of the data science landscape.

Users can upload and curate their own data sets. Lively discussion forums host many conversations about machine learning and data science. And Kagglers can build, run, and publish their own notebooks directly on the site. And there's so much to learn here, so much to discover. But I also realize that all these possibilities can be a little bit confusing, right? Maybe intimidating when you're joining the site at first. And I get it. I really do.

When I joined Kaggle back in the day, I felt the same. I felt a little bit hesitant when I saw all these smart people and their clever ideas. So much so, in fact, that I decided to join up anonymously under the pseudonym Heads or Tails. Nobody knew who I really was. And that helped to alleviate much of the pressure that I would have otherwise put myself under when it comes to actively participating in the community. So with this little trick, I felt free to work my regular job during the day and be on Kaggle in my secret identity in the evenings and, well, nights.

And but what I quickly realized is that much like the R community, the Kaggle community was exceptionally welcoming and supportive of its new members. And it remains so to this day.

Starting with notebooks

The other trick that helped me was at first to only focus on one aspect of Kaggle rather than trying to do everything at once. And in my case, I gravitated towards the notebooks. You see, I'm a very visual person and my academic background had trained me to write coherent reports. And my goal on Kaggle was to learn new tools and new methods. And I always learn best when I can apply these new tools on a practical problem. So it came natural to me to start writing and publishing notebooks to learn and to share what I learned with others.

And so today my talk will be evolving around the notebooks. But much of what I'm telling you can be translated to Kaggle's other aspects as well, like discussions or competitions or also courses which might be more relevant for you and your goals. And all these aspects are representative of Kaggle as a whole. And in turn, they also contain smaller areas that are representative of them. So in a sense, much like a cauliflower, this is a fractal approach and a fractal talk in which I would like to show you that I can be less than just a one-dimensional speaker.

So I started learning the R-Tidyverse and to write Kaggle notebooks to practice what I had learned. And I also started reading the notebooks that other people had written. And the notebooks that other people were putting out. And that really exploded my horizons. There was so much creativity, so many great ideas out there.

To give you an example, I remember when I first discovered this notebook written by Jonathan in R on U.S. flight tracking data. And I had never worked with geospatial data before. And his map visuals, they just blew my mind. Right? The way that he presented his insights was so clean and well thought out. There's a lot going on in this map. The different cities, the flight routes. But the design choices make this visual very clear and accessible.

So I wanted to learn about geospatial data. And then this competition launched, which was about predicting taxi trip durations in New York City. But it was really about writing notebooks that other people would find useful. Exploring the data, breaking it down, finding all the hidden insights. So I took my exploratory approach and I started to learn from scratch about maps and geo coordinates and all these things. And this is my very first interactive leaflet map that I built for this notebook that I wrote. And I'm still using leaflet today based on what I learned from this experience. It's a great tool.

And even better, my notebook ended up receiving the most community votes in this competition and winning me a prize. That was awesome. I mean, I was over the moon.

And I found my new passion. So over the following months, I jumped into almost every new competition that was launched so that I could learn about the specific tools that were needed to understand the data. And my goal was to be the first person to write a comprehensive exploratory data analysis for the competition data. So that when new people joined the competition, they could get a jump start. So EDA became my focus. And it allowed me to discover and to share lots of cool stuff.

So for instance, I joined a time series forecasting competition to learn about methods like Arima or tools like Profit. And this particular example was about predicting web views for specific international Wikipedia pages. So here in this example plot, you see that the views for the Wiki page of the band 21 Pilots exploded dramatically in early 2016. I don't really know them. I don't know why that happened. But there are some interesting lag effects here as to how this rise happened in the different countries that are coded by different colors. And some interesting lags in these spikes too that pop up, right? So features like this made this competition interesting and challenging.

And I also joined my first NLP competitions, learning how to use Julia's and David's great TidyTags package. And in this particular one, I was flexing my storytelling muscles. And I wrote my own chilling short story around the analysis of language used by classic horror writers like Shelley or Lovecraft, right? So you see that love and death are there at the center of most of these stories intertwined in an eternal battle.

And with every new notebook and every new competition, I was learning something new. New tools, new tricks, and new ideas for meeting other people's notebooks and exchanging thoughts with them. You see, one of the great things about Kaggle notebooks is that you can fork and tweak them directly on the website using this little button up there. And you see this often with competition modeling code when people build on each other's models and make them better step by step.

The Kaggle community is very good at learning from each other and at standing on each other's shoulders to see further than we otherwise could. And myself, I owe a lot to the support of the Kaggle community, which made me the very first notebooks grandmaster, which is a title that you can earn through community votes on your notebooks. And then you can take your achievements and your work and you can showcase it to the world through your Kaggle profile. For instance, to recruiters, right? So this is my profile, which kind of summarizes my top notebooks and data sets and so on in a very comprehensive way.

The Kaggle community is very good at learning from each other and at standing on each other's shoulders to see further than we otherwise could.

And I'm pretty positive that it helped me a lot in my transition from the ivory towers of academia to doing data science in the real world. In a lot of the interviews that I had with various companies, the lessons that I had learned on Kaggle, they were invaluable. And I'm sure the whole grandmaster thing didn't hurt either, right? But yeah, so I believe that the use of Kaggle as a portfolio remains very underrated even today. It's a great way to showcase your work, especially in notebooks. And I would really like to encourage you to make your Kaggle notebooks public. I've read a lot of notebooks over the last five years, and virtually all of them had something interesting that I could learn. And I felt very grateful towards the authors.

Hidden gems initiative

In fact, two years ago, I decided to start an initiative around the Kaggle notebooks. What I'm talking about is an initiative to discover and highlight great Kaggle notebooks, which, for whatever reason, haven't gotten the attention and kudos from the community that they deserve. I'm calling them hidden gems. And after reading hundreds of notebooks and writing quite a few myself, I would like to share with you what Kaggle taught me on how to write a great notebook, right? So here are a few tips that will be illustrated by different hidden gems.

So first of all, you want to introduce your work to capture your reader's attention and draw them into your analysis. A couple of topical images or illustrations go always well with a few opening paragraphs that explain the goals, the context, and set the scene for what's to follow. So like Laura here, you could use an image of cocoa beans to lead into your flavorful analysis of chocolate taste ratings.

Explaining your methodology will help your reader to follow your analysis through its various steps. And this can be part of your introduction. And you can even go as far as Ram Shankar did and build a flow chart out of little pictograms that explain the different steps of your analysis. And this is really some next level methodology communication right there.

Great visuals, of course. They are at the heart of many fantastic notebooks. And whether it's for exploration or for communication, you want your data visualizations to be clear and accessible in communicating the insights that you derived. And that can mean simple visuals, but it can also mean complex but well designed. So in this example here, Daniel was creating a visual for climate models and rising CO2 levels that would not look out of place in a science or nature paper on some high-profile website. It's meticulously designed with insets and annotations and definitely takes a little time to digest. So like with all of them, if you want to learn more, definitely check out the notebook.

But visuals themselves are not enough, right? You also want to document your findings and what you read in those visuals, not just for your audience but also for yourself. So that if you come back to your notebook maybe six months later, you will thank past you for writing down what was so interesting about these plots that you made. And Parul here demonstrates this very well with her points to note, complete with a little pictogram as a visual anchor. And documentation is, of course, the cousin of narration. So you always want to have a nice narrative flow that leads your reader through your analysis from one step to another.

As my final tip, for today at least, you can take your visualizations and exploration from the exploration of data to the explanation of models. In one of the most outstanding hidden gems, Jack here was charting the journey of an image through a neural network, which is a fantastic title, by the way. Titles are important too. But what I want you to focus on is the way in which he uses the data visualizations to show how this handwritten digit gets transformed through the different layers of the neural network all the way down to the classification layer and what happens in between. All right, so if you want to learn more about deep learning or if you just want to read something really cool, check out Jack's notebook. It's great.

R on Kaggle and getting started

So some of you might have noticed that all of these examples that I showed you for hidden gems were from Python notebooks as a bit of a contrast to the earlier R notebooks that I showed. And today Python is definitely the most popular language on Kaggle, especially for deep learning applications. But there's still plenty of great R content around. In 300 hidden gems that I collected so far, I found that the ratio was three quarters of the notebooks were in Python and one quarter in R. And it shouldn't stop you if a notebook is in Python, you can still learn a lot from it.

In fact, one neat way of learning might be to take the ideas from a Python notebook and translate them into R, right? Or vice versa. That might be a good way for you to get started on Kaggle. But for the closing thought of this talk, I might have an even better way for you to start your own Kaggle journey. Because I've put together a Kaggle dataset specifically for this conference. And it contains the names of all the packages that are currently on CRAN together with quite a bit of the metadata and all their release history. So there are quite a few features in there that could be interesting for R users. And you can start digging into these features directly on the website by pressing this button there, which will create a new notebook and will get you to an editor and you can start analyzing the data right there on the website.

So I would like to invite you to go to the dataset page, for instance, by scanning this handy QR code that's there in the corner. Open up a new notebook and start your very own Kaggle journey right here, right today. And I will be looking forward to see all the great ideas and creativity that you would bring in. I would love to welcome all of you to the Kaggle community. And right now, I welcome your questions. Thank you.

Martin Henze | We R KaggleRs - At the Intersection of Data Science Communities | RStudio (2022)

Transcript#

What is Kaggle?

Starting with notebooks

Hidden gems initiative

R on Kaggle and getting started

Featured software#

rstudio