Accountability and transparency in AI systems | Sam Tyner-Monroe @ DLA Piper | Data Science Hangout

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everybody. Welcome to the Data Science Hangout. I'm Rachel Dempsey, and I lead Customer Marketing at Posit. Posit, if you have never heard of us before, is the open-source data science company building tools for the individual, team, and enterprise. The Hangout is our open space to hear what's going on in the world of data across different industries and connect with others facing similar things as you.

We get together here every Thursday at the same time, same place, except next week is the 4th of July holiday here in the U.S., so there will be no Hangout next week, just a reminder. But if you're watching this as a recording in the future and you want to join us live, there will be details to add it to your calendar below. I've started adding this because I know people really love connecting with other attendees, even if you're not jumping in live.

So if you are interested in connecting with others, I want to encourage you to say hi in the chat and introduce yourself, maybe share your LinkedIn if you want or your role, where you're based, something you do for fun. But at the Data Science Hangout, we're all dedicated to keeping this a friendly and welcoming space that you all have made it and love hearing from you no matter your years of experience, titles, industry, languages that you work in.

So there's three ways that you can ask questions or provide your own perspective. So first, you could raise your hand on Zoom. You can put questions in the Zoom chat and just put a little star next to it if you want me to read it or else I'll call on you to introduce yourself and jump in. And then lastly, we have a Slido link, which I'm sure Curtis has probably already shared here, where you can ask questions anonymously too.

I'm so excited to be joined by my co-host, Sam Tyner-Monroe, Managing Director of Responsible AI at DLA Piper. And Sam, I'd love to kick things off with having you introduce yourself and share a little bit about your role, but also something you like to do for fun.

So sort of regardless of historical status, or maybe the historical majority, or the historically overrepresented, or, um, overly advantaged, or, you know, on the other side, the historically disadvantaged groups, right, of any of those categories, they're all equally protected, sort of regardless of their historical status.

NIST AI Safety Institute Consortium

Yeah, absolutely. So we're, uh, very involved. We're one of, I think there's only two law firms. We are one of them, and there's 200 plus organizations that are a member of the NIST AI Safety Institute Consortium. And so the consortium is made up of several working groups, and those working groups are focused on various aspects of, um, generative AI and safety around that.

Yeah, the NIST has been very busy lately, so the consortium working groups are not super active at the moment. Um, you may have seen NIST recently put out, um, the gen AI, um, or generative AI risk management framework profile, um, so the risk management framework is something we work very closely with and it's put out a couple years ago, um, and then they recently added a piece for generative AI.

How sophisticated are clients about AI?

Yeah, um, lawyers get their information from the same places that, you know, we get our information from. And so typically, right, so if they're in their area of expertise, so they're looking at law review journals and they're looking at, you know, Westlaw and LexisNexis and all these legal databases and looking at court cases, right? So that's where they're getting their information from. But in terms of AI, they're getting it from where everybody else gets it from, right? You know, the news, um, Bloomberg Law, um, you know, various, you know, CNN.

And there are, there are definitely some, in terms of your question about, you know, the level of sophistication, it really runs the gamut. Um, so we do have several attorneys on our team who are very sophisticated and who were data scientists in a past life. And so they are totally on board. Like they get it, you know, a hundred percent. And then, yeah, we do have some people who say, oh, we want to do AI. Um, and then you sort of need to be like, okay, well, what's your use case? What data do you have available?

In a lot of ways, it's really about mutual understanding. And so really when someone says let's do AI, you kind of have to figure out what it is that they're actually asking for. Right. So that puts a lot of, you know, onus on you to really be curious and really understand and sort of try to come at it from their perspective and really understand their issues and be empathetic to their needs and so forth. And so, yeah, it's a combination of your own deep understanding and ability to explain things as well as, you know, your empathy for the other person.

Using Posit at DLA Piper

Yeah, absolutely. So we are using it, we use it to host tools. So we do a lot of, you know, repeated computations. So typically let's use, for example, when a company comes to us and says, hey, I want to test my system for racial bias. They might say, oh, but by the way, we don't collect race information from our customers. So we need you to figure it out for us. And so in that case, we'll use inference methodologies, specifically something called the Bayesian improved surname geocoding methodology, which we also augment based on other research using first name as well.

And so that we turn into a tool that draws on census data. And so we input, you know, first and last name, street address, city, state, and zip code. And that is able to give us some probabilities that according to census data and other data sources, these are probabilities that the person belongs to each of the different racial groups. And sort of on aggregate, we never use it on an individual, we always use it in the aggregate. And so we're comparing groups. So we're averaging these probabilities. And that methodology is highly accurate when you aggregate it and determining the racial makeup of a particular group.

So we have a tool that computes it for us that we use that is hosted on Posit Connect.

Testing for discrimination

Yeah. So I'll answer that question by talking about an example that we've done. There is a, and again, so everything that we do is driven by the legal sector. So if it's not illegal, we don't care. And so we're focusing on what is strictly legal that companies need to comply with. So a recent example is the legislation that was passed in Colorado a few years back, which said, you know, if you're using an external source of data and making any sort of decision about someone's insurance, whether that's health insurance, life insurance, if it's how the insurance product is being marketed, if it's how their claim is getting processed, if it's how they're being, like, if you're determining their premium price.

So if you're using any sort of external data and any sort of AI or automated decision-making system in that insurance context, in any sort of business aspect of insurance, you must not discriminate. And so something that we look at is what data is going into the model. They're not saying, they're not putting in race into that model because they don't have it. But because of how the, because of the world that we live in, there are patterns that algorithms are able to pick up on that result in disparate outcomes.

And so one example that I think is very widely known is an example of the credit scoring system. And insurance companies use credit-based insurance scores. So as a part of their business model, these credit broker companies, they sell data to companies that say, okay, based on, I have data of 200 million Americans. And I think that based on all this data that I have, this person has a higher mortality risk. And so therefore you should not offer them a life insurance policy at this price. You need to offer it at a higher price.

And so those are the kinds of things that we look at. And if you sort of take a step back, well, why is, what are the factors that are connected to somehow in this credit scoring system that are leading to that, right? So there's baseline levels of things like, because of just baseline levels of racial discrimination in this country, for example, certain groups have worse mortality outcomes, and it's directly because of discrimination. Now, the insurance company didn't do anything there. They're not causing that discrimination. They're not causing those health impacts, but those health impacts are still there.

And so what we can do is we can say, okay, well, we know that if we're using this credit score, it's going to affect this particular population in a worse way at the same value than at this other population. So what we can do is we can say, okay, maybe we'll lower this threshold, or we maybe will use a different variable.

Collecting race data and responsible AI

Yeah, there's lots of considerations that a company who is using their data to generate some sort of AI powered decision in terms of whether and how to collect that information from people. One is sort of the more traditional viewpoint or sort of old school viewpoint, which is I'm not going to collect the data period, then I don't have that data and no one can accuse me of any racial discrimination. That's old fashioned. That's not true anymore. We know that AI systems, if race is not an input or if gender is not an input, it can still make decisions that are highly correlated with race or with gender or with any other protected class.

And so it's no longer that's sort of the old school way of thinking. And so, yes, I agree, it is more effective and more efficient to just collect that information up front. So you can just do that testing on the back end. That's sort of a responsible AI way to do things. But again, companies still are very averse to collecting that data. And again, it also matters to there's a level of trust depending on sort of the product that you're selling.

How AI is evolving

Yeah, definitely. It's such a hard question. You guys all seen the, well, of course, not everybody has seen it because not everybody has seen everything. But there's a hype cycle, the Gartner hype cycle graph of AI. And so we're very much in like we're coming down kind of on the peak of the hype cycle right now. And so I think that there's a lot of things. And even within our own AI group, we're like, oh, yes, gen AI is totally going to be able to do this. And then we try it and we finishing the meeting, we tried to do it, it couldn't do it.

And so what we ended up using was a very traditional machine learning model that performed much more effectively for our purposes. And so it just depends on the use case. And, you know, I like to say focus on the use case first. Like, what is it that you want to do? Don't just focus on, oh, my gosh, I want to do AI. So cool. Like, let's just throw it. I'll limit this. Right. That's how you get into trouble, too.

That's actually something we advise our clients on all the time, because you've seen several famous cases now where a company has put up a GPT powered chat bot on their website and then it's agreeing to sell a brand new 2024 car for a dollar, or it's telling you, yes, you can definitely get a refund for this thing. And actually the company is saying, no, you can't.

And so it's very, very tricky. So I like to focus on what's the use case. Right. So focusing on the use case, I think, is going to be more useful in terms of figuring out where it's going. And I think is really going towards a place right now that I'm, to be perfectly honest, and this is my own personal opinion, I'm not wild about it, where we're trying to replace a lot of humans, human aspects of things, which I don't love. And there's a famous quote, I'm sure I saw it on Instagram or Twitter or something, but I want to help me do my laundry and, you know, pay my bills. I don't want to help me create art and be a creative person. Right.

I want to help me do my laundry and, you know, pay my bills. I don't want to help me create art and be a creative person. Right.

Communities and thought leadership

Yeah, so I'll say that, well, within our group, we do a lot of thought leadership. So we're very involved with a variety of institutions. We have a partnership with the Duke Institute for Health Innovation where we're working on bias and health care data. And specifically, there's a nonprofit that we're a part of with them that is geared towards, you know, helping low resource communities implement AI tools and health care in a way that's actually going to help them as opposed to make their lives more burdensome.

We work closely with an organization at Stanford University called the Center for Legal Informatics. So determining working with them on the legal and sort of technical implications of the use of AI. So how can AI be used in a legal context? Are there legal data sets that we can create with them? Various like what kind of guard rails do you need to put into place? So in order to protect yourself and your customers from, you know, an AI behaving badly, there are things you can do and we work with them on that.

We just had a big feature at the UN AI Summit for Good in Geneva, Switzerland. So we were a big part of getting that off the ground where the AI legal and the exact name is escaping me right now, but we're like the AI legal leaders of that organization.

Advice for getting started in data science

Yeah, it's an interesting question, right? Because data science, let's see, when I started my PhD program in 2012, I don't really know that data science was really like a big thing. It was sort of still developing, and it was not as big of a topic as it is now. And now you have people who have PhDs in data science or PhDs in AI, right? And those didn't necessarily exist just 12 short years ago when I started my PhD program.

And so one thing that I would recommend that I think I've highly benefited from sort of making the switch from statistics to data science, I think I have benefited from the statistical background, because that is hugely important to understanding the context and sort of what models you can use and what scenarios, and understanding how to do exploratory data analyses and things like that. So I would definitely recommend getting a good solid statistical grounding in so that you actually understand the mathematics behind some of these concepts and sort of why the assumptions that you make matter in an analysis.

Networking and working out loud

I do think that it's important to work out loud. There's, you know, it's good to have an online presence, you know, make LinkedIn posts, sort of have a website, write blog posts, right? Those sorts of things. But then also just going to events and learning and networking, I feel like it's kind of a bad rap, but it's really just asking people about themselves. And everybody loves to talk about themselves. So if you think about it as just being curious and learning about people, I mean, you don't have to, you can talk to a hundred people and you find the perfect job opportunity, right? So it's not necessarily like, oh, I have to talk to this exact right person. You just talk to people. And then that is what leads you to the path that you are destined for, basically.

Oh yeah, absolutely. And so I had a personal website that I host with, oh gosh, I haven't updated it in a while, but it's like, it's Hugo. I do it all in RStudio . And so I, you know, I'm using all of those packages to update it. And I do all the blogging on there and stuff. So I, when I had this job fall through, this really cool job that like, I was going to get paid to go to baseball games, right? Like that's like the dream, but then this fell through. And so what I did was I, you know, I beefed up my website. I have a pretty solid Twitter presence.

And so I put out a tweet and said, Hey, here's my website. I'm interested in these things. I like, you know, data for good. I like data visualization. I like communicating data science concepts to broad audiences. And then I got a DM that said, Hey, I have this job for you at this company called Prytora. And I were hiring for this role. And I think you might be really good at it because we need somebody who's sort of in this area of like data for good, like really passionate about, you know, making social progress and also has that data, that data expertise. And so then that turned into a job which then turned into this job that I have now. And so that really was an example of me having that online presence, me having that vulnerability to say, Hey, I need a job guys. And putting that out there, it got me to where I am now.

LLMs, tabular data, and learning

Yeah, I love that. I would love to expand on that. So I agree with you a hundred percent. And I think too, there was a really interesting article written a few years back. I think it was in Wired maybe. And the article is saying like, we've created exactly one generation that knows how to use computers. And that's sort of the millennial generation, right? Or some of the like Gen Xers, right? Younger people, everything is app based, right? So they don't really understand how the guts of a computer works.

And so to your point of like not knowing how to read in a CSV, well, they don't understand how file storage works on a computer because they just search for stuff. I have siblings who are much younger than me. I have siblings who are 12 and 14 years younger than I am. And so I'm able to go and ask them, you know, hey, Gen Z, how do you look at your files in your computer? Oh, I just search for them.

Well, I think a lot of it too, right, is sort of like what's flashy, what's sexy in the moment, right? Like, you know, data science was like super sexy, you know, five years ago and now it's all gen AI and LLMs and stuff like that, right? So part of it is just trends. But I think too, like you said, I totally agree with what you said about companies just needing some like very basic, like tabular analysis. By far, that has been the thing that has surprised me the most about this job and about some other jobs I've had where just creating good summaries of data is incredibly useful for companies.

We had an example recently where a client sent us, hey, the client sent us all this, or an attorney who's an attorney at the firm sent us, hey, our client sent us all this data. But we also found out that this same data might have been leaked. And so, hey, can you look at this data that was leaked and sort of see if it matches this client data? And it turns out we were able to just directly say, okay, match phone numbers. And we were just able to directly match the phone numbers. And even that, which is a very, very simple data science ask, but being able to find the 500 phone numbers that matched in a client data set of maybe 200,000 rows of data, that was very, very useful to the attorney.

I'll make a personal, I'll add a personal story to that, which is the last time I made a Shiny app, I asked chatGPT to do it for me. And then I changed a couple of lines of code and then bam, I had a Shiny app. And so I think there's a difference between doing and learning. So if you're learning something, if you already have the knowledge, a tool like chatGPT is great. I also spent several years writing Shiny and learning R and all of those things. So then in that moment when I quickly needed a way to do whatever it was I was doing in Shiny, that was great. That saved me a ton of time.

I think there's a difference between doing and learning. So if you're learning something, if you already have the knowledge, a tool like chatGPT is great.

But as a learner, it's definitely something I'm concerned about. So like I will, I have been guilty on occasion of going to chatGPT and saying, hey, how do I do this in Python? Because I don't know how to do this in Python because I like R better. And one way I help myself with that is I have a colleague who does use Python over R. And so we look through the code together and we talk about, you know, what is this good code and sort of what is this trying to accomplish? And is this the best way to do it?

Closing advice

Always learn new things. Always be learning new things, for sure. That's the number one piece of career advice because the more you learn, the more you can grow and do new things.

Well, thank you all so much for spending time with us today, hanging out with us. Huge thank you to you, Sam, for sharing your insights and experience. Have a great rest of the day, everybody, and a great 4th of July if you have it off. As a reminder, we won't be here next Thursday for the Data Science Hangout, but we'll see you the week after. Bye, everybody.

Accountability and transparency in AI systems | Sam Tyner-Monroe @ DLA Piper | Data Science Hangout

Transcript#

Sam's background and journey

Working at the intersection of data science and law

Responsible AI at DLA Piper

Protected classes and bias testing

NIST AI Safety Institute Consortium

How sophisticated are clients about AI?

Using Posit at DLA Piper

Testing for discrimination

Collecting race data and responsible AI

How AI is evolving

Communities and thought leadership

Advice for getting started in data science

Networking and working out loud

LLMs, tabular data, and learning

Closing advice