From Chaos to Clarity: Implementing Effective Data Stewardship

Transcript#

This transcript was generated automatically and may contain errors.

Hi all, thank you so much for joining us today for our data stewardship discussion. In talking with customers and the community at Data Science Hangouts, we hear a few themes that come up in the questions asked. And regardless of our industries or company sizes, many of us are tackling common problems independently.

So I always love getting the opportunity to bring the community together to learn from each other and talk about what's working, what's not, and where we go from here. So today we'll kick off the conversation around the importance of data stewardship at the individual level and what that even means with two community leaders who will draw from their own experience first.

I'm so happy to be joined by both Jamie Warner, Managing Director of Data Science and Pricing at Plymouth Rock Assurance, and Dan Boisvert, Head of Data Stewardship at Biogen. There were so many awesome questions submitted ahead of time when you all RSVPed, so I've shared these with Jamie and Dan. And after they share a bit about what they're doing in their own organization and from their own experience, we'll jump over to some more of a Q&A open conversation.

So please feel free to also ask questions in the chat as well and share your own thoughts too. We wanna hear from all of you today.

I am also learning from the Hangouts that people really enjoy connecting with other attendees in the chat. So if you are interested in connecting with others, I wanna encourage you to say hello in the chat, briefly introduce yourself, your role or your base, and something you do for fun.

You will notice this session is being recorded, so we'll share it up to the Posit YouTube within the next week. I can also email the recording out to you.

Introductions

With all that, thank you so much for joining us today. I'd love to turn it over to Dan and Jamie to introduce themselves.

Yeah, thanks for having me. My name is Dan Boisvert. I head up a group called Data Stewardship at Biogen. Our group's responsible mainly for our clinical trial data, thinking about how we use and reuse our data. We also look at data anonymization, external data sharing, imaging ingestion, and data standardization. I also look at a lot at data strategy and work on a project looking at data strategy across research and development.

Sure, I'm Jamie Warner. I lead data science pricing for Plymouth Rock Insurance at home, which includes kind of the implementation of our data science models, as well as a lot of our cloud migration, which is pretty exciting. And I also really love doing this stuff after work. So I teach at Northeastern in the data analytics as well as the HR analytics programs. But super passionate about this. I wish we were big enough to have someone like Dan as a data steward, but I think it'll be nice to see the balance of a company with that infrastructure versus kind of where we have to just pull it from other places.

What is data stewardship?

Yeah, and just a quick disclaimer here is that views that I express are my own and don't necessarily represent those of my company. But I want to talk a little bit about data stewardship here.

When I talk about data stewardship, people often ask, is this data governance? Is this data stewardship? What's the difference here? And my joke that I always say is that no one likes data governance. Whether you're an analyst or an executive, everyone thinks it's overkill, it's overhead. We don't need that. Why are you telling me what to do?

So very quickly in my journey, I said, we need to change how we think about this and really think about this as the individual owners of the data, individual people who work with the data and how we steward our own information. So this change to data stewardship is really with that in mind.

So very quickly in my journey, I said, we need to change how we think about this and really think about this as the individual owners of the data, individual people who work with the data and how we steward our own information.

Internally at Biogen, I run a community of practice around data stewardship. And we did a crowdsourcing exercise last year to come up with what are the best practices. And so what I have on the screen is the best practices that came out of that work.

I'll start on the left, which is create data that will be used. I know this sounds kind of obvious, but it's really important to know that whatever you're creating as a data scientist or whatever as an analyst is going to be used later on. So you want to leave that in a state where it can be picked up and used. So you want to use well-defined standards, templates, metadata to make sure that whatever you're creating out of the data is able to be picked up and used later on.

From there, I can go underneath, which is have clear roles and responsibilities. There's always a little bit of question, a little bit of finger pointing. Is this your job? Is this my job? Is this their job? Who's doing these things? Who's accountable for data access? Who's accountable for standardization of data? And more clarity that you can add here, the better.

So I think this is worthwhile to go through and make sure that nothing falls in between the cracks. And then you need to make sure that it's actually resourced because this is real work. This is hard. This is not someone's passion project. You want to make sure that this is resourced appropriately.

With that, I can go into the middle, which is that data should be protected and compliant. So this is thinking a little bit more about data defense. We need to make sure that the data is well-protected wherever it's used. And I think this is to make sure, at Biogen we work with highly sensitive data, but all data that you're going to be working with is going to have some level of sensitivity to it. And you just want to make sure that these permission controls persist as you use the data and perhaps move the data across the org.

The one under it is about data compliance, GDPR, and then other data privacy protection regulations have come into play. You really need to think about these. How can I use this data? If this is not familiar to you, get familiar with it quickly, find the person stewarding the data and understand it from them. But there are also more contractual data use agreements that come into play that can really change how data can and cannot be used.

I love tying that one back to your number one, right? Because if you're creating data that's going to be used, it's a lot easier to protect and make sure you're compliant on smaller volumes of data. So if you know exactly what it's being used for and why you're using it, not just like storing data to store data, it's a lot easier to follow these other guidelines and make sure that you have protected data, especially around like some of the more sensitive data. I know you guys do a lot of medical related data. We have an insurance, a lot of like private PII data, which is personally identifiable information. And so we really have to be careful around it. And one of the ways to be careful is not to store the things you don't need.

Yeah, that's a great point. And I think that leads nicely into the stuff on the right, which is the single sorts of truth. You know, data should be shared in place and trying to be centralized, not like you're saying have many copies of it all over the place, which is hard to protect and adds, you know, overhead onto it.

I do think single source of truth is a little bit of an aspirational goal. I think it's kind of our intention of where we're going, but there are, you know, there are technical limitations. There are departmental limitations that require different copies of data to be made. And there's just some sort of pragmatism that we have to take when we think about this.

And the last one is to consider producers and consumers of data. I do think as data users, we often think we're at the end of the line, like all the data gets generated, cleaned, created, and then it comes to us, and then we do something and that's it. But I think we should make sure that we know that we're in the middle of the line.

So like there's someone who produces something that we use, that we consume, but we produce something that someone else consumes. So I think when you understand this, that you're in the middle and not at the end, you start thinking about the data that you produce and how to make sure that it's used by the people downstream from you and how are you a better partner to what's being done upstream from you.

Yeah, I love that especially because I think people are gonna do what people are gonna do. So it's more, you know, more effective to meet them where they are sometimes and realize that they're actually gonna do these things with your data and figure out the best way you can manage that and keep aware of it rather than just say no, and then they go and they do it anyway. And in a less effective, less documented way.

So I also love that kind of idea of creating like the clear roles and responsibilities, but also creating like a toolkit around like, let's make sure this is documented. Let's make it easy for you to do this work. Let's make it easy for you to tell me what you're doing or make sure that I understand exactly where this data is going and how you're using it. So I can enable you rather than kind of trying to cut it off, which has opposite effects frequently.

Again, I think there's a lot about what's possible when you know what's there. And if you don't know what's there, you like abnormally constrict yourself. So just even knowing stuff is there, knowing who your consumers are, knowing who your producers are, having conversations with them, starting to learn about what's there, what information they have, how you can start accessing it. I think that starts breeding some innovative thought.

From Chaos to Clarity: Implementing Effective Data Stewardship

Transcript#

Introductions

What is data stewardship?

Communicating standards and managing resistance to change

Showing the pain and using regulation as a driver

Minimum metadata standards for datasets

Data governance vs. data enablement

Single source of truth and managing multiple definitions

Reducing duplicate data stewardship