Resources

Liz Roten | Oddly Satisfying - Find delight in the mundane | RStudio (2022)

video
Oct 24, 2022
17:14

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi everyone, my name is Liz Roten, and today I'm going to talk to you about how you can take the project that you really hate and turn it into something oddly satisfying.

So this adventure starts with a truck. Or rather, a truck corridor study. My manager pitched this project to me as a straightforward, quick data science win. The only thing we needed to do was update the data. All of the methods and extra analysis were settled. It was going to be very easy. And it wasn't.

And my first inkling of this was when I started to look into the project materials. They were bad. They were pretty bad. And how bad are we talking here?

So on the left, you can see a bunch of files, and the font is way too small, because there are simply too many. And there are a couple of things I want to point out. So there are multiple final reports. And the problem when you have something and you name it and you have multiple things and you name all of them final, the word final stops meaning anything. There's no telling what actually truly is the final document.

There are also multiple documents that were marked as revised. And marked revised maybe August 11th or August 14th. However, when this project first took place, it took place over more than one year. So was it August 2017 or 2018? I don't have any information from that. And I don't know who modified it or revised it, when they revised it or why it was revised.

And there is a single Excel workbook with all the analysis, including a worksheet with over 15,000 rows and a whole bunch of nasty formulas that will make your skin crawl.

Oddly satisfying as a mindset

And so now that I've introduced you to this mess, I'd like to pivot to something a little better. And it's this Internet phenomenon of things that are oddly satisfying. So oddly satisfying things is a content of a genre of content. And it's generally something like someone taking something super dirty and making it pristinely clean. Or painting something very precisely.

I, in my opinion, find this tiny crab holding up a tiny fish very satisfying. With the composition, you have the yellow and the blue and the foreground and the background. I think it's satisfying.

So this first video is of a person with a smoothie. They add this purple powder into the smoothie. And you see the color come in to the smoothie as it's, like, smoothing around. And it's very mesmerizing. And lovely.

And that's of a person using a turntable to ice a tart. So you see the little tart, you see the turntable turning and a person using a piping bag to pipe this perfect spiral. And then once the spiral is complete, they grab a blowtorch, get a little color on there. And finally, they pan over to all of the other absolutely beautiful, identical tarts. And there's something about these videos that just soothes my soul.

And so I had this moment of what if I take the truck study and make it this satisfying? I essentially decided to take my truck that made me very sad and turn it into the best truck that has ever existed.

Three steps: assess, do, and leave a gift

And I learned a few things along the way that I'd like to share with you. And they come in three. First, assess the damage. Second, do and document. And finally, leave a gift.

So let's assess the damage. You're essentially going to complete a project intake. Start by reading through everything. And I mean actually everything. The more you read through now, the fewer surprises you're going to have later on and the better feel you're going to have for the project itself.

As you're doing this, you're going to make a hate list. Which is a list of all of the things about this project that you hate. Maybe the documentation is vague. Maybe the documentation and the calculation aren't doing the same thing that they should be doing. Whatever it is, write it down and put it aside for later.

You're also going to replicate your findings or the results from the previous study, if applicable. And this is going to test your knowledge of this project. And it's going to give you an opportunity to work through it hands on. And even if the numbers don't align exactly, you need to be able to explain why they don't align exactly. Was it a rounding issue or a data source issue? Whatever it is, you just need to explain it.

And the most important part of your intake is finding your thing. And your thing is this one aspect or multiple of your professional life that makes you excited. Something that you know you can work on and you can be proud of. You're going to find a way to apply your thing to this project.

Maybe you're really into plots. You're going to go out and find all of the ggplot2 extensions and make beautiful plots. Or perhaps you're into maps. You're going to make some maps with some excellent legend work and it will all look wonderful. Or maybe you're into speed and efficiency. Break out all of your vector-based programming skills and make this thing lightning fast. Whatever it is, it's important that it's yours and it's important that you get excited about doing it.

Your thing is what's going to power this project and it will also keep you sane.

Your thing is what's going to power this project and it will also keep you sane.

And my thing is documentation. If you're ever unsure of what your thing could be with a project, try documentation. And that leads us to the second part. To do and document.

Do and document

So you're going to do the thing. Complete the work. And you're going to document along the way. There is one sentence that I think encapsulates why documentation is so important better than anything else. And that is this quote. That documentation is a love letter to your future self. And this is the attitude you want to go into documentation with.

That documentation is a love letter to your future self.

And there are folks across this conference that will preach to you the gospel of documentation. But there are a couple things I would like to add in. One of them is to modularize. Which we did hear about a bit this morning in the keynote.

So like this beautiful tart, every slice can be divided up equally and stand on its own. You want to treat your code like this. Generally, how do you know if you need to modularize your code? If you have a single script with more than 500 lines, too long. You should consider modularization. If you're doing more than two scrolls, it's not going to be sustainable.

So on the left, there is some code that does some spatial processing and will take a while to run. It also depends on loading in a bunch of datasets and having libraries in your environment. And that's fine. It will run. And the script is going to take a long time. But if you modularize, you can go into a few different you can create scripts that are named according to what they do. So load packages does nothing but load packages for a solid 15 lines. Load data only loads the data. It doesn't do any extra modeling or spatial processing. It just loads the data. And they don't have to be completely independent. If they need to run sequentially, you can name them 1, 2, 3, and that gives you a signal that they need to be run sequentially.

I'd also like to advocate for a bespoke readme or a very fancy readme. A readme is an explanatory document that accompanies computer files or software, according to Wikipedia. And I think it gets overlooked in data science from my observation. There are three parts of a readme that make it truly bespoke and fancy. Context, structure, and key information.

So your context should answer the questions, what even is this thing? Why does this thing exist? And what is its history? Answer these questions in the beginning of your readme, and it will ground this project in what the truth and the true meaning of why you're even working on this.

The second is structure. How is everything organized? Where does the data live? Where do you actually do the thing to run the analysis and get the output? And this will also force you to take a step back from your code and think about it in a more structural manner.

And finally, key information. This is where you break out the hate list, and you're going to do the exact opposite of the hate list. You're going to turn it into a love list. And your key information should answer the question of, at the most fundamental level, what is the operative component and how is it done? So with the truck corridor study, it was a weighted mean, which wasn't apparent from the beginning. And so instead of leaving it a mystery, I added it in a snippet of code that doesn't by itself run, but it gives what is the most important information.

Leave a gift

And that's my next and final part. Is to leave a gift, not a mystery.

This is what the energy you want to bring in to your project debrief. The first thing to do is to take a break. Once the project is done, really truly actually done, shut it, put it away, and don't look at it for at least two, three weeks. And then you're going to come back in and condense and clarify.

So as I was working through the truck corridor study, I made something like nine R markdowns, and most of them did not go anywhere. Either they were a dead end or some exploratory analysis or they were worked into some other part of the analysis. But you wouldn't know that just from looking at these files. And if you were going to work on this project, you would have to read through all of these. So instead, I broke it down into three R markdowns. A readme, a bespoke readme, a summary, and methods. And it's much more clear and concise, and we have a couple aspects that are inside those markdowns.

In the markdowns, I've documented not only what, but why. So what? We used 2019 traffic data. That could be a full sentence. If I'm looking at this in the future, and I see that this project was taking place in 2021, and they'd already released the 2020 data, why in the world would they use 2019? And it's likely obvious to most of us that 2020 was quite anomalous, which is why we used the 2020 data. It was quite anomalous. Also known as a garbage fire.

Additionally, we set all of the corridors were given scores, and those scores were broken into tiers. And we set the tier breaks at 15.4 and 26.2 points. Those are very specific. And it was important, it's not 15.5, it's 15.4. And not 26.2 or 30, it's exactly 26.2. So why would we do that? Because our transportation planners used their best judgment given a wide array of factors outside of just these numbers.

I also really advocate for reproducibility. You want to record package versions. And there are lots of ways that you can do this, including tools like Groundhog, Pack, or RN. Take your pick. Just be consistent. I'm also a big fan of a good time stamped output. You can integrate a time stamp into the name of the file that you're outputting, or you can implement it in, like, a column in a data frame. Whatever you do, you need to have a year, month, and day. All three.

So this project is done for me for now. And I would love to tell you about how I came to love trucks and see their importance in the Twin Cities region and how they're perfect for our future economic opportunities, but I still don't care about trucks. They're just not my thing.

But every single person in my office knows about the documentation I made for this project. And I am so proud of the work I did about something I don't care about. And so at the end of it, I'm feeling less like, oh, I don't care about this. I feel a lot like this little crab. But instead of a fish, I'm holding up a truck. Thank you.

And I am so proud of the work I did about something I don't care about.