Resources

Nick Pelikan - Data Contracts: Keep Your Weekend Work-Free!

video
Oct 31, 2024
20:01

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So I'm here to talk about data contracts. So really quickly, just so I know, so I've been giving this talk a couple different times. Every time it ends up about 30 minutes, so I ruthlessly took this talk down before giving it. So just a quick show of hands so I know which slides to cut if Michael starts cutting me off. Who here is a, who here works as or has their job title as data scientist? Who here is a data engineer? Who here is a data scientist but does a lot of data engineering? That's how it goes. That's how I got into this.

So as Taylor mentioned, I'm a solution architect at Posit. But before I was at Posit, I spent about 10 years doing, working in data. So I spent a lot of time doing data science, doing data engineering, managing data teams of both data scientists and data engineers. And from that, I have a confession. I have lost so many weekends of my life to this gig. I have gotten so many texts like you're seeing down on the lower right there, where I see from my boss way too late at night, hey, this really critical dashboard's breaking. We need your help. Get online now.

So my big takeaway from, from just my time in this industry has been that working with data is super fun, super rewarding, super challenging, but it can be so frustrating. And just before I jump into it, one quick note about this talk. This talk really comes from the perspective of a data scientist, but I really hope that those of you who raise your hands and are working as data engineers will find it, will find it instructive.

Trust and the data input problem

One thing I've learned from being a data scientist, from being, from being a data team lead is that trust is everything. Trust in your outputs is critical. This is probably not new information to anyone in this room, but without trust in your data outputs, your stakeholders aren't going to really want to use them. It's going to limit your impact in the business because they're going to instinctively question everything you do. And probably for good reason, if they constantly see something breaking.

Well, one question that I think every one of us has run into at different points is how much trust can I have and how much control do I have over the inputs into my data outputs, into those beautiful Quarto dashboards you're making into those shiny apps? How much really do you know about what goes into them? And this level of disconnection between data input and output is accelerating. So in a lot of mid to large organizations, I know for a lot of you, the people who are producing the data that you're putting into your data outputs are probably on different teams than, than yours. You can think of like a team of data engineers. They're producing the data that you're consuming. That's different. You don't have that much, you don't have control over that team and the market, the industry, the data industry is moving further and further in that direction. Just by show of hands, who here has heard of the concept of data mesh?

Data mesh is decentralizing and a lot of companies are moving towards more of a data mesh architecture, which means that there's going to be even more disconnection between people producing data and people consuming it.

Core problems data contracts solve

So all of those issues I just mentioned, all of those things that we're, we're dealing with as data scientists, it leads to a couple of different problems and we deal with these every day. So first up reliability, we've all run into reliability. If I'm building an output, will that data be there when I go to consume it? Consistency. Do I have any guarantee that that data is not going to change? And autonomy. Can I build things myself on top of a product being built by another team without their help? And they were all these problems really boil down to one central issue. If you build a data output, if you build a machine learning model, a shiny app, a Streamlit app, a Quarto document, by default, you're the one responsible for that data input.

So I propose one solution to this, which is data contracts.

What is a data contract?

So what is a data contract? So because I was a grad student once, of course, I look it up in the dictionary. From the Oxford English Dictionary, a contract is a written or spoken agreement that's intended to be enforceable. And an agreement from the Oxford English Dictionary is harmony or accordance of opinion or feeling in a negotiated arrangement. Note that this definition does not include technology. A data contract is an agreement, not a piece of technology.

A data contract is an agreement, not a piece of technology.

So see, just like Fred from Scooby-Doo there, surprise, this is not a talk about technology. This is a talk about organizational dynamics. Nobody left for the door. Awesome. But we will talk a little bit of technology.

What makes a good data contract

Let's talk about what makes a good agreement between a data producer and a data consumer. First up, people. People are the most important part of this puzzle, of the data input puzzle. They come first always. Every data contract should be people first. And there's three main people or groups of people that you really want to put into any good agreement about data.

First up, ownership. Who owns the contract? Who maintains the contract? If, let's say, somebody producing the data and somebody consuming the data disagree, who makes the call? Who decides what goes into a data contract and what's left out? That's the owner. The person responsible. This is what most people think about when they think data contract. The person responsible is the person who gets the call when the contract is broken. When that contract, when one of the expectations of a data contract is broken, who gets the 5 a.m. call? And consultant. And consultant is usually where you, the data scientist, as the data producers come in. The people consulted for a data contract are the people using the data. Whose work depends on it. Who needs to know when the contract is broken?

Expectations in a data contract

And those of you who are familiar with data contracts may be expecting this next one. One of the core tenants of data contracts, of a successful data contract, is really well laid out expectations. That's when you, the data consumer, and the data producer creating the data, come to an agreement on what form that data should take.

So a couple of things that you should always have in a data contract, I propose, number one, a schema. That's what most people think about. The data schema is what form does the data take? What columns are in the data? What data types are those columns? Valid values. Really quickly, who was in Hadley's art and production workshop on Monday? Hadley showed an example of a data input changing to a data output that had a temperature column that changed from Fahrenheit to Celsius. That's a great example of some valid values that you should have in a data contract. You should be expecting, you should have an expectation of what values are going to go into your output.

SLAs. This is super critical, especially for those of you creating very business-like business-critical pieces of information. An SLA is around how often data changes. What's the expected update frequency of data? What's the expected lag time of that data? It's really important to put that out because let's say you're building a Shiny app, you're expecting it to be consuming daily data, and that data changes the monthly. The people actually looking at that Shiny app might be taken by surprise there.

The interface of data, also super important. Interface is where's the data stored? Is it in the database? Is it in S3 store? How do I get it? Any needs on query performance, that's super important. If you're building a machine learning model, you need data quickly, that should be something you build out in your data contract.

Compliance, also super important. Most of us have worked in, most of us are probably working in industries that are regulated or soon to be regulated. So does your data need to comply with any regulatory requirements? Who here, show of hands, who's in healthcare? HIPAA? Who here's in finance? PCI? Who here works with any sort of customer data? Most of us, GDPR. Almost all data is going to have these days some compliance behind it. So it's really important that you make sure that that is noted in your data contract, and you know what those compliance effects are going to be. If data gets deleted because it needs to be deleted for compliance, that's something you should be keeping track of and know that that's something your outputs can deal with.

And then probably most critically is interoperability. One of the great things, one of the things to look out for as you're starting to build data contracts in your workflow is, can that data contract be consumed by people and machines? So can your data inputs, can your data outputs consume the output of your data contract? Can they tell if a data contract's failing? That can help you create failure statements, that can help you create failure states that your users expect, and help you make sure that your outputs are always giving users a good experience.

Who should build data contracts?

Really quick question. Now that I've kind of thrown some information at you about data contracts between consumers and producers, who thinks a data contract should be built by the data consumer? A few hands. Who thinks it should be built by the data producer? A few more hands. Who thinks it should be both? The correct answer. The answer is both.

Data contracts should always be built collaboratively. If they're not, if you're working in a company, you join a company where you're doing data contracts or being built by one or the other, that's a red flag. Data contracts should be agreements. They should be mutually agreed upon. If they're being built by just the data consumers, that's a big organizational issue. That's usually, I always think of that as a cry for help. That means that we have no idea what our data engineering team is doing, and we're not having a great time. If they're being built by the producers, that again is an organizational problem. That's a little bit different problem. Usually that means that your data products probably aren't meeting the needs of the consumer.

Data contract technology

Quickly, let's talk technology. Most of you are probably familiar with data contract technology. Who here has heard of great expectations? Decent number of people. One thing I want to really stress here is that data contract technology is super helpful, but it's not everything. Thinking back to the slides we did before, data contract technology is really great at enforcing expectations on data. It's really good at enforcing things like schema, the valid values, SLAs, really good at doing that. It's good at informing stakeholders, but it is not a substitute for defined ownership and responsibility. Most of the time, data contract tech does not do a very good job of putting together defined ownership and responsibility. Data contract technology, without that ownership and responsibility, is just really complicated data engineering integration tests.

I really think of data contract technology in two main buckets. First off, you've got the existing framework. That's your great expectations, your soda.io. These are great frameworks. They come with a lot of advantages. You can really easily plug them into your data stores. Most of the time, some of them even come as SaaS products, but they're not without their disadvantages. They typically are single language. If you're in our team, working with the Python data engineering team, you may have to learn Python, and a lot of these products are death by a million yamls. Again, they don't encode that ownership and responsibility.

Literate data contracts

Another, what I really want to take some time here to look at is the concept of literate data contracts. This is something that I've introduced to a couple different teams with some pretty good success. If you scan that QR code down there, it'll take you to a repo on my GitHub that shows an example of a literate data contract in Quarto. A literate data contract, you can think of this almost like an extension of literate programming. What it does is it encodes, it uses all of the code that builds a great data contract that enforces all those expectations. Because it's built into a literate programming framework, it gives you all that advantage of a literate programming framework. You can use words. You can use words to incorporate and encode that ownership and responsibility into the actual data contract itself.

And not just that, because those words are usually going to be somewhere like a Git repository, you can use that Git repository's tools to actually track changes. First of all, you can track changes, which is great. Second of all, you can use that track change to really build consensus around a specific data contract. If you go to that repo that's linked there, I put an example there of a pull request to a data contract. Git pull requests are one of the most fantastic tools for building consensus around a data contract, for tracking the changes you're putting in there, and for making sure that everyone's on the same page, that you're building that organizational agreement, that everybody's bought into the data contracts concept. But, just like everything else, it's not without disadvantages. The biggest disadvantage to a literate programming framework, it's more words. More words, more work.

Git pull requests are one of the most fantastic tools for building consensus around a data contract, for tracking the changes you're putting in there, and for making sure that everyone's on the same page, that you're building that organizational agreement, that everybody's bought into the data contracts concept.

Adopting data contracts in your organization

What I really want to stress here is that there is no right answer. If you're looking at adopting data contracts in your organization, pick a technology that works best for you. And take, I'm hoping that this will give you some tools to really think about how to use data contracts, how to incorporate them, and how to think about different technologies, and make sure you're picking the best one. So now, hopefully all of you are bought in. We're all going to start data contracts at our companies next week. I just wanted to give you a couple tools that I've found through adopting data contracts at a couple different companies.

First up, expect resistance. Most of the time when I've been delivering data contracts, it has been as a data consumer, as a data science team lead. When you're creating data contracts, or trying to create data contracts as a data consumer, what you're really trying to do there is you're trying to move responsibility away from the organization. You're trying to move responsibility upstream. There's no way around that. You're trying to put more responsibility for their outputs onto your data producing teams. And that's often an uncomfortable process. I have received a lot of messages just like these, especially the bottom right one. You want me to do more now?

So one thing I'd suggest is really leading with empathy. Those of you that were in Hadley's workshop, Hadley's number one recommendation there was meet people. Meet your team. Get to know them. If you're in the office, take them to lunch. If you're remote, Zoom beers, meet and greets, any excuse you can have to meet their team is fantastic, to meet your data producers. And learn about their pain points. That's one of the key things. If you can show that empathy towards their pain and how data contracts might help, that can be huge. Be ready. If you ask them what their biggest pain point is, it might be you. Been there. Take the note. Take the note and think about how what you can do with data contracts can help that.

So just a couple general themes. A couple general themes to think about as you're thinking about adopting data contracts. First off, articulate value. Business value is super important. Data engineers typically don't do a good job, are typically very divorced from business value. So the more you can articulate that, the better. And keeping it simple. Think about your goal here is to save time and build credibility. How much validation do you really need? Think about minimum viable valuation. Do you need a data contract on every single table? Probably not. Do you need a data contract to cover every column of every table? Probably not. And think about how you can simplify your data producer's lives. And with that, thank you. You scan this QR code, takes you to repo with a lot of materials on data contracts.

Q&A

Thank you, Nick. I know we're close to time, so I will try to get one quick question from our Slido. So where should data contracts live? Is GitHub a reasonable expectation for non-technical consumers? I think it is. I think building in, and I definitely encourage you to check out the literate data contracts framework. I've had really good luck with that. And non-technical consumers, getting them through a few minute course on Git and getting them writing in that is not too difficult. I've had non-technical consumers really get into that and have really good luck.

Well, thank you for the presentation, sharing your repo. I'm excited to dig into it soon. Yeah, thanks.