Rebecca Hadi | Exploring Query Optimization: How a few lines of code can save hours of time

video

Oct 24, 2022

15:34

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

My name is Rebecca Hadi and I'm here today to tell you about a time that I crashed a database and I learned some techniques to help from ever doing that again that I'm going to share with you.

I was working on the 31st floor of an office building in downtown Seattle that had a beautiful view of the Puget Sound and Olympic Mountains. I was working on the marketing analytics team at a large fashion retailer and I was working on a project that involved migrating marketing attribution reporting from our old Clickstream vendor to an in-house solution.

This involved doing a backfill. Has anyone in the room ever done a data backfill? Yeah, I see a few hands. Has anyone ever had performance issues with that backfill? I certainly have.

So this project required a backfill of a year and a half of historical Clickstream data. But at this time, I had been a data analyst for about three years and through that time had written SQL queries almost every day. So naturally, I thought I was an expert.

So I sat down at my desk, put on my headphones, turned on a score by Trent Reznor and got to work.

The fateful backfill

I was working in an Amazon Redshift database with page view level website data. And for those of you who may not be familiar with website data, any time that you hover or click or view a page, it's generating data. So you can imagine that that can be a pretty large set of data for a single person, not to mention a website that has millions of visitors a year.

In the past, I had worked at health insurance companies. And so I was used to working with medical claims data. And that data was more in the millions of rows. This data set was 4 billion rows.

So I had my query and I sat down on my desk and I had my two monitors. On the right-hand side, I had my SQL IDE and I set off my backfill. And on my left monitor, I had some documentation open that I was working on. I updated some JIRA tickets. And let's be honest, check my Twitter feed.

After getting sucked into that for about 45 minutes, I turned to my screen and I noticed that my query was still running. And I thought, hmm, that doesn't seem very good.

So I tried to cancel it. And I got the rainbow wheel of death and had to kill it. And then I tried to run it again. And about an hour later, there was still a problem. So I tried to cancel it again. And again, it crashed.

So I thought to myself, what is going on with this? I'm a SQL expert and obviously my code is perfect so it can't be anything that I'm doing. On the 31st floor, it held the entire data science and analytics team, which was about 100 people made up of data analysts, engineers, and scientists. So I thought, it's probably just a busy time on the database. I'm probably running into some concurrency issues and it's just slow in general.

Then I had a brilliant idea. I'll just run it at night. Traffic on the database will be low and everything will be fine. So I went home, took the bus, and before I went to bed that night, I set off my query to run into the night.

I woke up a little bit early the next morning to a Slack notification on my phone from the database administrator who had never direct messaged me before. And all it said was, hi.

Oh, fudge, what have I done? My heart sank because I knew that something was wrong. It turned out that the query that I ran ran for over seven hours. And not only that, it took compute resources from other nightly ETL jobs, which made them run slower, which made them finish later, which ended up delaying all of our business reporting for that day.

I was mortified. I had not only made my life harder, I had made the database administrator's life harder, my team's life harder because their stakeholders' reporting was delayed as well. And needless to say, the business was not very happy with me.

But lucky for me, the DBA reached out to me and offered to help me rewrite this query. So I swallowed my pride, walked down to the 25th floor where the engineering team sat, pulled up a chair, and had a one-hour meeting that changed the course of my SQL life forever. And what I learned, I'm going to share with you today.

So I swallowed my pride, walked down to the 25th floor where the engineering team sat, pulled up a chair, and had a one-hour meeting that changed the course of my SQL life forever.

Explain plans, distribution styles, and sort keys

Imagine for a second that you're going to explore a dark cave. What items might you want to help you? For me, I would want a map. I would want to know where I'm going. Does this cave even have an exit? How many tunnels are there?

I would also want a team or a cavern crew. I would want people to help carry equipment and make the journey easier. It's also very dark in this cave, so a flashlight would be helpful to be able to see where I'm going.

An explain plan is like a map. It returns a step-by-step list of actions that your query will take to return your dataset. It also returns the estimated computational cost of that query. It's very easy to run. You only have to put the word explain in front of your query and run it.

The first time I saw an explain plan, I was very overwhelmed. I swear my head exploded. But it turns out that it's a very helpful tool because that computational cost can help you understand if changes that you're making to your query are helpful. So in my instance, where I would run my query, wait an hour, kill it, run it, wait an hour, kill it, if I had run explain plans, I could see that the changes that I was making would reduce the computational cost without having to wait that extra hour. So explain plans are very useful.

For our cavern crew or cave team, we want everyone to be contributing equally. We don't want one person to be carrying all of the backpacks and the other people to not be carrying anything. That doesn't seem, one, very nice, or two, very efficient.

From a query perspective, compute nodes are like your cave team. You want all of your compute nodes contributing equally and leveraging all of their resources. To make sure that happens, you need to appropriately specify the distribution style. Distribution means that your data is properly distributed across the compute nodes and does not have significant skew.

But how do you do that? How do you know what's the right distribution style? You think about which column you're going to join on.

If you're planning on joining on a unique ID, such as an order ID or a customer ID, then dist style key is what you want to use. And the choice of your dist key depends on which column you are going to use to join to another dataset.

Dist style all is very dangerous. It copies your data across each of the compute nodes. So in the case where you're not going to join your table to anything ever and it's a small dataset, you can get away with dist style all, but I personally wouldn't recommend it. And if you're not sure what you're going to join on, maybe you have a table that's a lot of rows, but it doesn't have a good unique ID or a candidate for a dist key, and maybe it's a combination of fields that's making it have a lot of rows, then dist style even is a good candidate.

And the beauty of the explain plan is that when we make these changes to our table and try to run our query, we can run an explain plan to know if those changes actually made a difference.

Your sort key is like a flashlight. It helps you only search relevant data. Imagine that you have a dataset that has three years of data, 2020, 2021, and 2022. I only care about 2022. I'm living in the present. So I have a where clause that filters on where year equals 2022. And my result set would be filtered.

However, if I'm not, if the table I'm querying from does not have a sort key, even though my result set is filtered, the query had to scan the entire table to find it, which is not very efficient. And so when you specify a sort key, it helps your query know immediately where to find the data that you're looking for, so it reduces the computation needed to pull back your result set.

To decide on a sort key, you think about which column you will filter on. A lot of times that's a date.

Applying the lessons to the real dataset

This is an example of the dataset that I was working with on that fateful night when I ruined everything. There's a user ID, a session ID, an event ID, and an event time. There's also an event type and a marketing tag. I was working on the marketing analytics team, and so I cared about traffic and sales driven by marketing.

Based on what we just learned, we know that distribution style key is a good fit for this table because it has unique identifiers. But which unique identifier would we choose as a dist key? The answer is it depends. If I was keeping the dataset at this level, I would want to specify the dist key as the most granular ID. Using this dataset, I would likely join on the event ID to another table. So therefore, it should be the dist key.

Pulling together everything that we have learned, on the left-hand side, I have a query that takes that page view level data that we were just looking at and aggregates it to the session level. If you can imagine landing on a website, you can have multiple page views during a session, and so I cared about what marketing is at the session level.

And so when I'm creating this temporary table, I've specified my distribution style as key, and I've specified my dist key as session ID because that's what I'm going to use to join to another dataset. So I know that all my compute nodes will be contributing equally. I also have my flashlight or my sort key on session start time because that's what I'm going to use when I end up filtering from this temporary table.

On the right, I have a query that leverages the table that I just created and joins it to another table called web orders. And I'm joining to that table on user ID and session ID. Because we have specified session ID as a distribution key, all of the compute nodes will be used in this join, and it will make it run much more efficiently. Because we have the sort key specified as session start time, I'm filtering on anything greater than January 1st of 2022, and so therefore, my query doesn't have to search for data that it doesn't care about. And we've saved time there as well.

Results and lessons learned

We've run our query. We have our results set in a reasonable amount of time. We've made it to the light at the end of the tunnel on the other side of the cave, which is a big win.

Some other wins that came out of this experience. For one, my query that had run overnight for seven hours ended up running in under one hour during a high traffic time on the database. So ultimately, the runtime was reduced by six hours because of these changes. I also built a strong relationship with the database administrator that was very useful for future cross-functional projects between the data science team and the engineering team.

My favorite lesson came from I documented my findings and I shared it back with my team. I helped them learn these techniques and apply them to their own queries to avoid making the mistakes that I made. And through that process of sharing my experience with my team, it also helped me have a paradigm shift towards embracing mistakes. Because mistakes, even very business negative impacting ones, can help you learn a lot.

Because mistakes, even very business negative impacting ones, can help you learn a lot.

So if you find yourself with a long-running query, there are a few things you can do. You can run an explain plan. You can check the distribution style. And you can specify sort keys. And that is how you can save hours of time with just a few lines of code. Thank you.

Featured software#