Kshitij Aranke - Demystifying Data Modeling

Transcript#

This transcript was generated automatically and may contain errors.

Hi everyone. I'm Kshitij, I'm a senior software engineer at dbt labs, uh, the bottom left, uh, and I'm out of London. So it's been great to kind of come all the way here. I think the, our community has really like influenced a lot of how I think about data. And so it's really encouraging to kind of come back and contribute a little bit about data modeling and why it's important for data engineers. And given the pull of the last talk, also data scientists.

So the agenda today is going to be fairly simple. We're going to talk about what data modeling is, why it's important and how you can get started. So maybe before I get started, quick show of hands, how many people, if you know what data modeling is, how many of you would be comfortable explaining it to one of your colleagues? That's, that's good. That's kind of the point of this talk, uh, is to, is, is to get there so that you can tell them that you learned this one thing at a conference, uh, apart from other things.

one of the beautiful parts of documenting things this way is that your tests just naturally flow out of your documentation. It's not like you have to think of brand new tests to write.

Write, audit, publish

And there's a million different ways to deploy a data engineering workload. So I'm not going to talk about any of them. I'm just going to talk about a common pattern that you should maybe think about when you do this, which is pretty useful, it's called write audit publish. So the right is basically when you have new data, you write it to a separate place in staging. You don't write this to production. When then you run tests on this new staging data to make sure that the data quality is high it meets all the checks you would expect. And then finally you publish it to production.

And so the goal of this is that there should never be bad quality data in production. And the trade-off you're making here is that the data in production might be slightly stale and that's okay. You'd rather kind of bias towards high quality data and then you can kind of go fix whatever is not.

Standardizing definitions

So we're kind of back at this part of the diagram, just the transformation block, which is we've developed this analysis, we've tested and documented it and we've deployed it. So we've kind of gone through that whole life cycle and you're at happy hour when essentially somebody, one of your colleagues from another team comes up to you and says, Hey, we have something similar. We tag customers with more than three orders as prime. You guys probably know where this is going.

And now you kind of have a problem in that you have two different definitions for what are essentially the same business metric. And so what you have to do now is do standardization, which is a really, really hard organizational challenge. It's not really a technical challenge. It's that you have to get two different teams to talk to each other, often two different VPs to kind of talk to each other, agree on a definition, do the analysis for how metrics might change, how to communicate that change to all your different stakeholders.

This is hard work, but it is so very worth it to do it just so that when another team member from another team kind of talks about a metric, you know exactly what they're saying and what the caveats of that metric are going to be and everything else. And so this is an important part of a data engineer's job that often isn't really captured, but it, and it's not captured, it's also like not technical. It's not a technical part of your job, but I think it is an important part of a data engineer's job.

This is hard work, but it is so very worth it to do it just so that when another team member from another team kind of talks about a metric, you know exactly what they're saying and what the caveats of that metric are going to be and everything else.

So this is, I had to put this slide in because these things often take a lot of time, one eternity later, you decide on a standardized definition, which is that you decide to do both, right? You decide that the lifetime value is greater than or equal to 30 and your number of orders are greater than three. So this is kind of like you've redefined this is high value flag to include both of these definitions. It doesn't always go this way. Sometimes it kind of can differ, but this is kind of what you've landed on.

And so what we've actually stumbled on here is we've worked backwards into this definition that I like of data models, which is that it is a structured representation in our case, a CSV file. In your case, this might be a database table. It could be any other format that you put your data in that organizes. So we've combined data from different sources, from your raw customer data, your raw payments data, your raw transactions data, and we've standardized the data across these two different teams who had two different definitions of what high value customer is to enable and guide human and machine behavior, inform decision-making and facilitate actions and I think that last part is like the most important part of this definition. The reason we all do data work is so that we can facilitate some actions within our organizations.