Resources

From Chaos to Clarity: Implementing Effective Data Stewardship

video
May 31, 2024
58:32

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi all, thank you so much for joining us today for our data stewardship discussion. In talking with customers and the community at Data Science Hangouts, we hear a few themes that come up in the questions asked. And regardless of our industries or company sizes, many of us are tackling common problems independently.

So I always love getting the opportunity to bring the community together to learn from each other and talk about what's working, what's not, and where we go from here. So today we'll kick off the conversation around the importance of data stewardship at the individual level and what that even means with two community leaders who will draw from their own experience first.

I'm so happy to be joined by both Jamie Warner, Managing Director of Data Science and Pricing at Plymouth Rock Assurance, and Dan Boisvert, Head of Data Stewardship at Biogen. There were so many awesome questions submitted ahead of time when you all RSVPed, so I've shared these with Jamie and Dan. And after they share a bit about what they're doing in their own organization and from their own experience, we'll jump over to some more of a Q&A open conversation.

So please feel free to also ask questions in the chat as well and share your own thoughts too. We wanna hear from all of you today.

I am also learning from the Hangouts that people really enjoy connecting with other attendees in the chat. So if you are interested in connecting with others, I wanna encourage you to say hello in the chat, briefly introduce yourself, your role or your base, and something you do for fun.

You will notice this session is being recorded, so we'll share it up to the Posit YouTube within the next week. I can also email the recording out to you.

Introductions

With all that, thank you so much for joining us today. I'd love to turn it over to Dan and Jamie to introduce themselves.

Yeah, thanks for having me. My name is Dan Boisvert. I head up a group called Data Stewardship at Biogen. Our group's responsible mainly for our clinical trial data, thinking about how we use and reuse our data. We also look at data anonymization, external data sharing, imaging ingestion, and data standardization. I also look at a lot at data strategy and work on a project looking at data strategy across research and development.

Sure, I'm Jamie Warner. I lead data science pricing for Plymouth Rock Insurance at home, which includes kind of the implementation of our data science models, as well as a lot of our cloud migration, which is pretty exciting. And I also really love doing this stuff after work. So I teach at Northeastern in the data analytics as well as the HR analytics programs. But super passionate about this. I wish we were big enough to have someone like Dan as a data steward, but I think it'll be nice to see the balance of a company with that infrastructure versus kind of where we have to just pull it from other places.

What is data stewardship?

Yeah, and just a quick disclaimer here is that views that I express are my own and don't necessarily represent those of my company. But I want to talk a little bit about data stewardship here.

When I talk about data stewardship, people often ask, is this data governance? Is this data stewardship? What's the difference here? And my joke that I always say is that no one likes data governance. Whether you're an analyst or an executive, everyone thinks it's overkill, it's overhead. We don't need that. Why are you telling me what to do?

So very quickly in my journey, I said, we need to change how we think about this and really think about this as the individual owners of the data, individual people who work with the data and how we steward our own information. So this change to data stewardship is really with that in mind.

So very quickly in my journey, I said, we need to change how we think about this and really think about this as the individual owners of the data, individual people who work with the data and how we steward our own information.

Internally at Biogen, I run a community of practice around data stewardship. And we did a crowdsourcing exercise last year to come up with what are the best practices. And so what I have on the screen is the best practices that came out of that work.

I'll start on the left, which is create data that will be used. I know this sounds kind of obvious, but it's really important to know that whatever you're creating as a data scientist or whatever as an analyst is going to be used later on. So you want to leave that in a state where it can be picked up and used. So you want to use well-defined standards, templates, metadata to make sure that whatever you're creating out of the data is able to be picked up and used later on.

From there, I can go underneath, which is have clear roles and responsibilities. There's always a little bit of question, a little bit of finger pointing. Is this your job? Is this my job? Is this their job? Who's doing these things? Who's accountable for data access? Who's accountable for standardization of data? And more clarity that you can add here, the better.

So I think this is worthwhile to go through and make sure that nothing falls in between the cracks. And then you need to make sure that it's actually resourced because this is real work. This is hard. This is not someone's passion project. You want to make sure that this is resourced appropriately.

With that, I can go into the middle, which is that data should be protected and compliant. So this is thinking a little bit more about data defense. We need to make sure that the data is well-protected wherever it's used. And I think this is to make sure, at Biogen we work with highly sensitive data, but all data that you're going to be working with is going to have some level of sensitivity to it. And you just want to make sure that these permission controls persist as you use the data and perhaps move the data across the org.

The one under it is about data compliance, GDPR, and then other data privacy protection regulations have come into play. You really need to think about these. How can I use this data? If this is not familiar to you, get familiar with it quickly, find the person stewarding the data and understand it from them. But there are also more contractual data use agreements that come into play that can really change how data can and cannot be used.

I love tying that one back to your number one, right? Because if you're creating data that's going to be used, it's a lot easier to protect and make sure you're compliant on smaller volumes of data. So if you know exactly what it's being used for and why you're using it, not just like storing data to store data, it's a lot easier to follow these other guidelines and make sure that you have protected data, especially around like some of the more sensitive data. I know you guys do a lot of medical related data. We have an insurance, a lot of like private PII data, which is personally identifiable information. And so we really have to be careful around it. And one of the ways to be careful is not to store the things you don't need.

Yeah, that's a great point. And I think that leads nicely into the stuff on the right, which is the single sorts of truth. You know, data should be shared in place and trying to be centralized, not like you're saying have many copies of it all over the place, which is hard to protect and adds, you know, overhead onto it.

I do think single source of truth is a little bit of an aspirational goal. I think it's kind of our intention of where we're going, but there are, you know, there are technical limitations. There are departmental limitations that require different copies of data to be made. And there's just some sort of pragmatism that we have to take when we think about this.

And the last one is to consider producers and consumers of data. I do think as data users, we often think we're at the end of the line, like all the data gets generated, cleaned, created, and then it comes to us, and then we do something and that's it. But I think we should make sure that we know that we're in the middle of the line.

So like there's someone who produces something that we use, that we consume, but we produce something that someone else consumes. So I think when you understand this, that you're in the middle and not at the end, you start thinking about the data that you produce and how to make sure that it's used by the people downstream from you and how are you a better partner to what's being done upstream from you.

Yeah, I love that especially because I think people are gonna do what people are gonna do. So it's more, you know, more effective to meet them where they are sometimes and realize that they're actually gonna do these things with your data and figure out the best way you can manage that and keep aware of it rather than just say no, and then they go and they do it anyway. And in a less effective, less documented way.

So I also love that kind of idea of creating like the clear roles and responsibilities, but also creating like a toolkit around like, let's make sure this is documented. Let's make it easy for you to do this work. Let's make it easy for you to tell me what you're doing or make sure that I understand exactly where this data is going and how you're using it. So I can enable you rather than kind of trying to cut it off, which has opposite effects frequently.

Again, I think there's a lot about what's possible when you know what's there. And if you don't know what's there, you like abnormally constrict yourself. So just even knowing stuff is there, knowing who your consumers are, knowing who your producers are, having conversations with them, starting to learn about what's there, what information they have, how you can start accessing it. I think that starts breeding some innovative thought.

Communicating standards and managing resistance to change

But one of the first questions was, how do you effectively communicate data management standards to your team? And how do you engage with teams that are resistant to change?

Sure. I think the easiest way to do that is to make the change as easy as possible for them, right? And the way that I like to do that is make a toolkit or make a framework, right? If you have a really easy tool to document things, like giving them that, even if it's something as simple as an Excel, like kind of meeting them where you are versus getting the best possible tool.

But in a lot of ways, to me, the reason people resist change in a lot of circumstances seems to not be because they don't want to change or they don't see the value. It seems to be a lot more around, I have a lot of other things to do and this is the worst thing on my list and it's not gonna help me, it's gonna help someone else later. So giving them tools that make it really easy for them to document or make it really easy for them to share or even if it's just having a common storage location where you can go find it later and deal with it, I think really helps along the process.

And that to me has gotten a lot further than trying to really get people to rethink the way that they work, which is a much longer term, harder thing to transition through. But you can do it a lot, especially when there's new hires, kind of teaching them that as you bring them in.

Oh, that's great. I think you're really hitting on the key points here. Often when you see resistance to change, people don't know what you're asking them to do. So I think the stuff that you're saying is like, just do this, like you give them something, just fill out this form. And that is much, much easier than saying, be a better data steward, right? And that does not evoke change.

But as I do get further in my career, I think about, trying to make change in the organization to change is very, very hard. So let's make sure that when we are making change, that there is a value statement behind it. And that we keep bringing that back up to help people really tie what they're doing back to what value they're creating out of it.

There is this change management framework that I really like. It's from this book called Switch by Chip and Dan Heath. And they have a nice picture about how change happens, which is you have an elephant, and then you have someone riding the elephant, and then the elephant is on a path.

So the elephant is the emotional mind. This is like how you feel about things. And the rider is your rational mind. And then the path is how you actually get there. So I think what Jamie was talking about is the path. You gotta make the path really easy for the elephant to plod along on.

But you have to, and you can talk to the rational mind and saying, these things are important. This is a GDPR requirement. You have to do this. I'll show you the statistics of why this is more effective. But the rider on the elephant, he gets tired of pulling the elephant at some point. And eventually the elephant just goes down the path. So you do have to affect that emotional mind too, and then make the path really easy. So your best bet is when the elephant just goes and the path is just obvious for them.

But when you're trying to steer the elephant into a new direction, you have to think about the rational, the emotional, and the path, and try to move that forward.

Showing the pain and using regulation as a driver

I mean, I think one of the biggest data transformations I've ever seen came from California's right to be forgotten law, where companies had to be able to get rid of all record of someone. You could go and you could request to the company in California, hey, I don't wanna be in your data anymore. And one of the big challenges for companies was they actually couldn't connect a record across all their different systems.

And that was a financial and otherwise legal incentive where they actually had to make these huge transformations of the data systems to make that feasible. Now, those transformations actually enabled all their systems. It enables things like AI. If you wanna do that, you have to have your systems connected effectively. But none of those things are usually incentive enough to go through one of those massive transformations. They're expensive. They're really hard to do. The historic data is really hard to mess with.

And so I think finding some of those legal things that are starting to come out now, especially around all this governance, I know a lot of folks are like, oh no, governance, but I'm actually really excited about some of the stuff that's coming out because I think it's a great way to give an incentive. People seem really excited about investing in data science or investing in AI, but they seem much less excited in investing in the core systems that actually make that all possible, which are much more tedious, much more expensive and are what is actually gonna push us forward.

Yeah, no, this is great. I do think the regulations push us forward and there's a lot of complaints for sure when these regulations come out that we have to abide to, but I do think that they move us forward in a good direction. And I think they do have that strategic mindset in place like you're saying, where this actually helped us in the long-term because it connected all these systems where we never would have been in this place before.

One of the things in that change management thing is you need to solve a problem. You need to solve a problem that actually exists and you need to solve a problem that is recognized to exist. So GDPR is a great one. Here's this new law. Are you compliant to GDPR? Have we ever heard of GDPR before? No. So it's a clear problem to be able to go implement it. And I think as you look across all different problems like this, the more you can bring visibility to the issue, you could start to feel the pain a little, show the pain to the organization. That's where you start to get people to adjust their mindset and say, this is a problem, I see this as a problem. We need to change this. We need to change this. And you can start moving the organization forward like that.

Minimum metadata standards for datasets

Hey everybody, thanks a lot. I'm Tim York. I'm at Virginia Commonwealth University, a professor there. I was wondering, we run into this problem a lot. We store a lot of data. Are there guidelines standards for the minimally reported information that must accompany a data set? Are there accepted standards out there that are published?

I don't know of one. That would be amazing, right?

I think we're all trying to figure out what good is, right? What's the minimum good that we should just abide by? I'd kind of ask that back to everyone on the call here.

I would say maybe just quickly, since you mentioned you're at a university, what your IRB has, that's a really good place. So the Internal Review Board for Research. So those are for folks not used to university systems, the folks that approve research. They also typically have some data guidelines around what you're allowed to store, how you're allowed to store it, things like that. So that's a good place to start.

In insurance, we have really strict regs. So by industry, like what we're allowed to store about people, how long we're allowed to use it, things like that.

Yeah. I mean, I think a couple of things. Like who's the owner? Who's the, who should it be directed? Who has permission? Like you were talking about, who has ownership of the data? Who has permissions to do X, Y, and Z? Who is ultimately responsible for the data? I guess those types of sort of metadata that should always travel with the data.

I think there's actually reasonably good research out on those sorts of high level topics. And so, and a few of the different like data framework that exist have those. A lot of times we like to think about kind of like, I think for a while there were like the five Vs of the data but like that's kind of moved on. But I like to think about like, where did it come from? What was its intended use? So is it primary or secondary initially? And then who created it at what time did they create it? What time scale does it go over is a really big one because a lot of times you can kind of think about how things will be changing over time.

They do have like broad guidelines for what sort of information at that high level you should be keeping. But really time, location, and any sort of like discrepancies or known potential issues. And then the reason you source it is the biggest one because like going back to Dan's slide, I think it's really interesting to say we have a single source of the truth because at a lot of companies, different departments will consider a different calculation, the same source of the truth.

So if you think about like finance and actuarial, or maybe even within finance, they'll have a few different definitions of the same term sometimes. And so if you don't know why that data was being used originally, you don't know which definition it's going to be. And I presume that's for your type of research as well if another professor wants to go pick it up.

Only one I would add to Jamie's list there is who can use it. You know, I think sometimes we have data where anyone can use it. Sometimes we have data that has a very limited data use agreement. And so we need to restrict it. And so a little bit of that knowledge upfront, we were just talking about this, like having some sort of data list that has like, you know, what the data set is, who to contact about it, what the data set means, and where it is and who can use it is a pretty good first pass.

Yeah, so just to do a quick intro, I head Enterprise Data and AI Services. I report to the CTO. I'm an IT side, so responsible for anything about data and AI within a global organization. I'm in insurance. And so I think that, which probably one of my favorite answers right now is that, well, it depends because, so it really has a depending on where you are. Are you a global company? Are you a regional company? Are you someone who's in the Pacific state to have certain laws or whatnot? Also, what industry you're in.

In pharmaceuticals, I used to work for a pharmaceutical as well, and we had a lot more rules on what it needs to use to define a data set versus when I was at a college. So it very depends on where you are. So that's why you won't really find a general set of requirements, but rather suggestions. And as you see here, everyone commenting that here are the things you'll look at and whatnot, but a lot of this all comes from a data governance practice. So ideally there should be, and I say ideally because I don't think it happens everywhere, but there should be a data governance group of folks within a organization and community set of people that help define what this is.

Data governance vs. data enablement

So my strong view here is that there's no one on this call who thinks, who likes data governance. Even I think when you talk to the people who do data governance professionally, they're like, yeah, but you know, it's not really, like they try to talk themselves out of it. There's something with the word governance that just holds a lot of baggage for people. Oh, you're gonna tell me how to do my work? I'm the analyst. I'm the data scientist here. You're gonna tell me what I can do without my data?

So I saw that really early in this journey, and I immediately rebranded it to data stewardship because of this baggage that's there. And I think what's happening on our data ecosystems is that we used to be all centralized, right? There's like centralized, centralized, centralized. And then you could reasonably have a centralized data governance over it. But what happens over time is it all spreads out, right? And now we're managing the mesh, right? And when you manage the mesh, you need people at the individual points of it to manage it.

And so I think that's where you get more of a view of a data stewardship. You're kind of accountable for your little node of the mesh. So if we were to go that further and call it data enablement, I guess my point here is we're on that journey for sure. And data enablement to me is a little too soft of a term, but I think it can be something that is like pointing us towards we are generating value. I think that's what it's trying to say. We are generating value through this work. This is not busy work just to overly restrict you.

Yeah, I would agree. And I think it depends on who I'm talking to. So for senior leadership, I think the idea of data governance is really just it's expensive, it gets in the way. And so data enablement to me is a great little advertising catchphrase to get us where we need to go and what we need investment wise.

But also I do think it's important to have really frank governance conversations with data scientists, data engineers, et cetera, and use that terminology to be really frank about the fact that this is something that's important that we have to care about. And this is something that is critical to the work that you do.

And so to me, the branding is more how we fund it, how we enable it. And the actual terminology is, I think sometimes we don't wanna be like Dan said, soft when we get down to actually brass tacks of fixing the data and cleaning the data and understanding the data. And I think a lot of that I feel like can tie back to what makes a good data engineer, data scientist, which is curiosity in my opinion.

And so if you're curious about your data, you're actually going to be doing data governance as you go through it, right? You're gonna be asking questions about what looks weird, how is this defined? Do I understand this? And so you being really good at the governance piece is you being really good at your piece of the work as well.

Single source of truth and managing multiple definitions

Another question that was asked ahead of time was, it's easy in theory that the organization should have a single source of truth of data, but in reality, each org has their own definition of measures and metrics and what to track under the same name. How do we overcome that?

So I think this is actually something that we should embrace a little more. And I don't know, Dan, if you're gonna knock my lights out for saying that, but it is really important that different definitions exist in different environments. The question is, how do we tie it all together? And so I think that organizations where they say, you can only get data from here and you can only get it in this way, end up with a lot of shadow IT stuff.

If instead we say, we know we have different definitions for this, how do we integrate this into our systems? That ends up with a much better outcome where people actually get what they need, which goes back to that first point that Dan had on his first slide. Like, why are we creating this resource to begin with? And I think sometimes in the act of governing and setting things up and doing all this work, we forget why we started pulling the data together in the first place, which is usability and functionality.

And so in this, I would encourage that and I would encourage them to write the definition down and then we start having our governance, right? So you want data a certain way? Yeah, let's do it. Let's write it down. Let's get it framed. Let's make sure we have really clear documentation and definitions. And then let's move forward with your department has a slightly different definition than this department and that's okay. It's in a different layer or it's in a different view or whatever, however you want to set it up. So I actually really think this is a good thing and something that should be encouraged because it is the reality of the way we work.

Well, I think Jamie said it really well, but there's just some reality here. People are going to go do their jobs whether you like it or not. And so when those, but I guess I'm a big proponent not to create a process where the only way to get your job done is to break the process, right? So you need to kind of live in reality here of, okay, so we're going to create these different things. These different things are going to exist. How do we just know that they all exist? How do we protect them all? There is a little bit of overhead every time you do create something.

So, and there's also a little bit of risk if we have two things that look the same but they're a little bit different, and people think that they're the same. That's where we could get into some kind of trickier situations that are hard to deal with.

I'm thinking back, this is a while now, but I did some work about looking at Semantic Web and they have this idea of collaboration without coordination. So there's a way to collaborate and kind of use the same data, but kind of at a meta level, connect it all together. And I think that's a really nice way to think about it because people will kind of be pulling the data out into their individual nodes, into their individual pockets, but can we, at this higher meta or semantic level, connect that data together so that we can manage it a little bit better?

Yeah, so when I started at this organization, we just moved off of the prem world. And so I mean, moving on to cloud services. And so there was no defined enterprise data architecture. There was, it was just a glop of different applications and they had some integration pipelines to connect them.

And so the way that we moved forward was coming up, well, of course, with ethics of leadership and support from the engineers, we came up with a defined enterprise data architecture and a strategy on how to get there. And you build that roadmap and include everyone into that to get the alignment to that. That's how you move towards to something like that. So the one I mentioned here was a medallion architecture. So we do use it in our books and daily.

And so we show how does it work from a general perspective, meaning that if I have external or data I want to bring in from applications, well, you have a raw layer. That's where everything comes in. No one minute plays, no one does anything to it. Then you mature it and bring it to a common data model. So again, I think you mentioned earlier that every data should have a purpose or a reason to use it. And so the reason for us in this common data model is to provide the business with a data set that has generally all use cases available to it because it's the common data model, not anything for specific use, but specific to commonality that's out there.

Then you build separate data sets coming from that to be the ones that are specific for a given report or for given analytics purposes or what that. So then you can track down the lineage to the source of truth, which is our raw layer. So if someone's able to articulate that and be able to explain this at many different levels, then you can get this body of people who kind of support it and want to go in that direction.

Reducing duplicate data stewardship

I would first ask why. To me, that's always the first question is when you see something like that, why is it happening? And if it's actually duplicative, I think that's actually more of an HR problem you're going to have to deal with, right? Do these two roles still need to exist? And are these legacy processes that are just happening because somebody needed them at one point and you can't combine them across? So to me, sometimes it's not a data problem, it's a organizational structure problem.

And we see that a lot when companies are, especially like insurers, we see that a lot. We buy books of business and we absorb different pieces of companies. And so you'll see, you'll realize suddenly, oh, this person pulls the same report as this person. So like, do they both need to pull that report is my first question before I even try to do it as a data focus question.

The way my brain went on this one is big organizations, right hand to the left hand don't always know what they're doing, right? So you may, I think this could easily happen. We see it a lot, people in this group are doing this thing and then people in this other group who never talked to each other, right? Are doing this other thing and turns out the first like eight steps of it are exactly the same.

So here I'm going to say, talk, get some organization to be able to, you should be able to at least grab a list of who has access to the data, right? And maybe just have those people come together a little bit once a month, once a quarter, to be like, hey, this is what we do with the data. What do you do with the data? And just to start getting people together a little bit, because I do believe that no one wants to do duplicative work.

If I feel like as like software engineers, growing up through this, we're like, if someone's already done it, I'm just going to take it and then go from there, right? I don't need to like redo these calculations that are already done. So I think there's some opportunity for collaboration there and then just see what makes sense. Because sometimes it is duplicative and it can be shut off like Jamie's saying. And sometimes it's not really, and it doesn't work just because of the reality of our systems and how we work.

Yeah, that was going to be my own follow-up question there. It's like, how do you know when somebody else is doing the same work as you somewhere else in the organization? Reach, I feel like this is kind of what you do with these meetups, is if you're in your organization and you're a data scientist, you should know the other data scientists. If you do BI work, you should know the other people that do BI. If nothing else, in case your department gets shuttered and you need somewhere else to go at a minimum, which we're seeing a lot in the industry lately.

But at a maximum, like knowing what the other people that do your type of role across the company do