Kshitij Aranke - Demystifying Data Modeling
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi everyone. I'm Kshitij, I'm a senior software engineer at dbt labs, uh, the bottom left, uh, and I'm out of London. So it's been great to kind of come all the way here. I think the, our community has really like influenced a lot of how I think about data. And so it's really encouraging to kind of come back and contribute a little bit about data modeling and why it's important for data engineers. And given the pull of the last talk, also data scientists.
So the agenda today is going to be fairly simple. We're going to talk about what data modeling is, why it's important and how you can get started. So maybe before I get started, quick show of hands, how many people, if you know what data modeling is, how many of you would be comfortable explaining it to one of your colleagues? That's, that's good. That's kind of the point of this talk, uh, is to, is, is to get there so that you can tell them that you learned this one thing at a conference, uh, apart from other things.
The Jaffel shop scenario
So welcome to the Jaffel shop. This is an example we use in dbt a lot. So a Jaffel is really just an Australian grilled cheese. And Australian grilled cheese. And essentially this is kind of our toy project. And so you've been hired at this Jaffel shop in order to fix a problem that sales have slowed a bit. So you kind of want to dig in and kind of reaccelerate that.
So the prompt you've gone from, uh, your boss is can we identify high value customers with no recent orders? So we can email them a coupon, uh, because they may have just fallen out of the habit of ordering. So they want to email them and get them back. Um, and so if you notice like both of these phrases in italics, high value customers, no recent orders are fairly vague, and so your job as a data engineer is often to kind of like, uh, turn these into metrics that you can kind of go and identify.
So what do you have as raw ingredients? So you have these three CSV files, uh, maybe a bit small to read, but they're customers, orders, and payments, uh, that have come out of the different systems. Um, and you're also given kind of this ERD diagram between the different. And so you have order ID maps to like order ID, like order.ID maps to payments.orderID, customers.ID maps to like, uh, orders.userID, so on and so forth. So you've been given these two things.
And what you decide to do is you decide to analyze, uh, analyze and kind of break up the two different parts of the prompt into high value customers and no recent orders. So you decide to look at for high value customers, decide to look at number of orders and their lifetime value, which is the sum total of how much they've spent with you. And for no recent orders, you decide, okay, we'll look at the dates of their first order and their most recent order.
Building the analysis
So essentially I'm going to speed through these next few slides because most of you are probably very familiar with these, but this is kind of like data extraction. So you read the CSVs in, uh, we prep the orders, we kind of group by customer ID and do some aggregates on like first order, most recent order, number of orders. Um, and then also on the payment, same, uh, group by customer ID, uh, and then convert from cents to dollars, right? Very basic transformation.
And so where this gets us is now we have, uh, the number of orders, the first order, most recent order, and we kind of need to get customer lifetime value. Um, and so what you do is essentially you take all of these, uh, data frames that you had previously got, and you kind of assemble them, uh, and you do left joins from customers onto orders, payments, and then you like rename total amount to LTV, and then you do select to get them in the right order. This should be pretty standard for most of you.
Uh, and now you have everything you need. So what you decide to do is you decide to write this out. Uh, in my case, I'm writing it out to a CSV, but you could write it out to a database or any other place. Um, and then essentially you decide to visualize it, uh, in this case on a shiny dashboard. Uh, so I think there's a natural break around $30 of lifetime value. So you decide to kind of take that as your break point.
Um, and of course, like you can go play around with this a lot more, but be the long, uh, so what we've done is we've actually converted this vague prompt of high value customers with no recent orders to customers with lifetime value greater than $30 and the most recent order more than 30 days ago, that second part is something you kind of like figure out this is a reasonable assumption to me. Um, cool.
So, um, you do that and it's a hit, uh, the emails work on their customers who had forgotten that they love your product. So they come back and order more from you, uh, and all as well.
The analytics workflow
And so what we've actually stumbled on here is really the analytics workflow, which is that you have raw data. That's kind of disorganized. It may be a bit messy. Uh, you, you have a transformation step, which is cleaning it, reorganizing it, uh, testing and documenting, uh, these workflows and your data. And then a kind of final deployment step, uh, that produces these clean data sets that you can then use, uh, downstream in your BI tools, your ML models, or any other kind of operational analytics you may want to do.
And so the next thing that happens is you're about to go on holiday. Uh, and your boss asks you, can you document and test this workflow for the rest of the team? And so you say, okay.
So the first thing you do is kind of like refactor it a bit. Uh, so it's a bit cleaner. Uh, this also hopefully you should be doing, uh, and then so you create this new is HB field. And then you decide to document this. This is just mark down, uh, some notes, uh, which are what is customers? It's a dimensional model. Uh, so dimensional model is like, uh, data modeling terminology for saying like, this is an object. It's not a verb. Uh, it is a thing that doesn't really change, uh, of a customer. Uh, how often it gets refreshed, which is daily, uh, and who maintains it like growth and pricing maintain.
And then the different columns on it, which are like customer ID. Uh, and then of course, as you can see, we have to write some tests on it, which are that the customer ID should always be unique, uh, and it should never be null. Um, then a few other columns like first name, last name, first order, most recent order. Uh, this one, it's useful to document that these are UTC date times, um, because time zones are hard. Uh, and we, we also want to add a constraint on most recent order that it should be greater than or equal to your first order number of orders. Uh, customer LTV, like if you're not familiar with that term, you should document what this is. Like sum of totals of all orders placed. Uh, and this new flag we've introduced, which is, is high value is customer high LTV greater than or equal to 30.
So I think one of the beautiful parts of documenting things this way is that your tests just naturally flow out of your documentation. It's not like you have to think of brand new tests to write. So from those documentation, we've got customer ID must be unique, not null. High value is the lifetime value greater than or equal to 30, most recent order greater than or equal to first order, and then a new test, because we said that the, the data is refreshed daily, we've added in this new loaded at test that says that the data must be fresher than 36 hours in this case.
one of the beautiful parts of documenting things this way is that your tests just naturally flow out of your documentation. It's not like you have to think of brand new tests to write.
Write, audit, publish
And there's a million different ways to deploy a data engineering workload. So I'm not going to talk about any of them. I'm just going to talk about a common pattern that you should maybe think about when you do this, which is pretty useful, it's called write audit publish. So the right is basically when you have new data, you write it to a separate place in staging. You don't write this to production. When then you run tests on this new staging data to make sure that the data quality is high it meets all the checks you would expect. And then finally you publish it to production.
And so the goal of this is that there should never be bad quality data in production. And the trade-off you're making here is that the data in production might be slightly stale and that's okay. You'd rather kind of bias towards high quality data and then you can kind of go fix whatever is not.
Standardizing definitions
So we're kind of back at this part of the diagram, just the transformation block, which is we've developed this analysis, we've tested and documented it and we've deployed it. So we've kind of gone through that whole life cycle and you're at happy hour when essentially somebody, one of your colleagues from another team comes up to you and says, Hey, we have something similar. We tag customers with more than three orders as prime. You guys probably know where this is going.
And now you kind of have a problem in that you have two different definitions for what are essentially the same business metric. And so what you have to do now is do standardization, which is a really, really hard organizational challenge. It's not really a technical challenge. It's that you have to get two different teams to talk to each other, often two different VPs to kind of talk to each other, agree on a definition, do the analysis for how metrics might change, how to communicate that change to all your different stakeholders.
This is hard work, but it is so very worth it to do it just so that when another team member from another team kind of talks about a metric, you know exactly what they're saying and what the caveats of that metric are going to be and everything else. And so this is an important part of a data engineer's job that often isn't really captured, but it, and it's not captured, it's also like not technical. It's not a technical part of your job, but I think it is an important part of a data engineer's job.
This is hard work, but it is so very worth it to do it just so that when another team member from another team kind of talks about a metric, you know exactly what they're saying and what the caveats of that metric are going to be and everything else.
So this is, I had to put this slide in because these things often take a lot of time, one eternity later, you decide on a standardized definition, which is that you decide to do both, right? You decide that the lifetime value is greater than or equal to 30 and your number of orders are greater than three. So this is kind of like you've redefined this is high value flag to include both of these definitions. It doesn't always go this way. Sometimes it kind of can differ, but this is kind of what you've landed on.
And so what we've actually stumbled on here is we've worked backwards into this definition that I like of data models, which is that it is a structured representation in our case, a CSV file. In your case, this might be a database table. It could be any other format that you put your data in that organizes. So we've combined data from different sources, from your raw customer data, your raw payments data, your raw transactions data, and we've standardized the data across these two different teams who had two different definitions of what high value customer is to enable and guide human and machine behavior, inform decision-making and facilitate actions and I think that last part is like the most important part of this definition. The reason we all do data work is so that we can facilitate some actions within our organizations.
And I think it's like really important to keep that in our sites when we kind of like build this data model. What are we building it for?
Why model data and how to get started
So hopefully I've had you convinced, but if not, this is why model data, which is why do it in the first place? All this talk, fine and dandy, but what are the practical benefits of it? Right. And I think for me, it all just boils down to that helps you scale, right? And so this is kind of a pretty picture of a DAG, which is a directed acyclic graph of all your transformations, which is, and so each node, you can kind of think of it as one of the transformations you've done.
And even at this scale, it is kind of tricky to get it to work reliably if you don't do all of the things we've kind of been talking about thus far, which is like documenting your code, testing your code, using write, audit, publish. And so it's hard at this scale, but it's almost impossible at this scale, right? Which is kind of like my company's internal data transformation framework. Like there's just too many lines going everywhere. There's too many nodes. You cannot even read what's on the screen. But it's kind of impossible to kind of go in, figure out what's going wrong, fix it in timely fashion. So this is where this tool of data modeling kind of really helps you scale and operate at this level.
So cool. How do you get started, right? So I think the most important thing to take is that data modeling is a process. There's no end state. It is something that you keep doing within your organization. So I guess the tips I have is start small. You don't have to like start all the way from, you don't, you don't have to start all the way by taking like two weeks off your sprint and just documenting everything. That doesn't really work. So maybe if there's something that's confusing to newcomers, start documenting that. If there's a dataset or a metric that's not super intuitive, start documenting that.
Keep iterating, like different teams and different organizations will have different things that work for them. So just kind of like figure out what works for you. And lastly, because it's a process, you should have fun with it. And you, and there's, there's nothing better than kind of going off on holiday and then having, having your team change something and then you come back and you immediately know what, what changed because the documentation was there, the tests were there, they're standardized, all of that stuff.
And so the toolkit is kind of stuff we've talked about already, but these are kind of like the three things you can start doing, which is documentation, testing, standardization, standardization I've put as a present because it is a present to your organization.
And I think kind of the last thing I'll say is that it is a tool agnostic practice. Like you can do data modeling in Excel. I'm sure none of you will do it here, but you could. It's kind of the point, you can do it in any tool. The company I work at, dbt labs kind of makes a tool for data modeling with all of these practices built in, in a natural way. And so that's kind of an easy way to get started. You could write Python or SQL in dbt and it works on a bunch of different data warehouses.
And so congrats, you're now equipped to model data from new sources. Like if you bring in a new payment source or stuff like that, you now have the tools to do it with compound metrics. So like if you want to find out a ratio of like customer acquisition costs to lifetime value, stuff like that. And of course with changing definitions, because every organization, like even if your code is right, like your organization will change over time. And so your definitions and your data needs to change with it.
And lastly, if you want to explore further, I think this is a good textbook on the topic, Fundamentals of Data Engineering by Joe Rees and Matt Housley. And that's all I had. Thank you.
