Matt Frazier @ Pie Insurance | Removing blockers for your team | Data Science Hangout
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Welcome to the Data Science Hangout, hope everyone's having a great week. If we haven't met yet, I'm Rachel Dempsey, I lead our Pro Community here at Posit. This is our open space to chat about data science leadership, questions you're facing and getting to hear what's going on in the world of data across many different industries.
And we're here the same time, same place every Thursday. So if you are watching this as a recording sometime in the future, there'll be a link where you can add it to your calendar below as well. Together, we're all dedicated to making this a welcoming environment for everybody. And so we love to hear from everybody, no matter your level of experience, or area of work or industry.
I like to say that it's totally okay to just listen in, though, if you want. But there's also three ways you can jump in and ask questions, or provide your own perspective. So you can raise your hand here on Zoom, and I'll be on the lookout here. You can put questions in the Zoom chat. And if you're maybe in a coffee shop or your dog's barking or something, just put a little star next to your question in the Zoom chat, and I can read it instead.
Otherwise, I'll just call on you to jump in and add some context. And then lastly, we also have a Slido link where you can ask questions anonymously. And I see Hannah just shared that in the chat here.
But thank you so much, Matt, for joining us here today as our featured leader. My pleasure. Matt Fraser is former Chief Analytics and Underwriting Officer, currently a Strategic Advisor at Pi Insurance. And Matt, I'd love to have you maybe kick us off by introducing yourself and sharing a little bit about your role and something you like to do outside of work, too.
Matt's background and path into data science
So I'm Matt Fraser. I actually started in data science in 2008, right after the financial crisis. Prior to that, I was more of an insurance guy. I actually have a background in philosophy, specifically metaphysics. And so I'm sort of a Karl Popperian at heart, if anybody knows who Karl Popper is, who was the first empirical skeptic. And so I sort of transitioned my empirical skepticism over into data science and started to learn it in 2008.
Before that, I carried a number of executive positions at different insurance companies in the underwriting space. At that point in time, right, from let's say 1998 through 2008, I did not know of an insurance company that was actually doing any predictive modeling at all. And it is still the case that really the national carriers are using AI, they're using data science, they're using predictive modeling. But most of the smaller mutuals, right, the ones that are only in a couple of states, are really using other predictive modeling software as a service companies to do all of their predictive modeling.
So data science was a sort of an obvious transition after I made it, but I had no idea what the heck it was before I got into it. I was just a philosophy guy. I was a language guy. I love language. And I love understanding what is knowable and what isn't knowable, right. But I didn't really realize that you could translate that into mathematics until I got into a company called Valen Technologies.
So I sort of got out of the insurance industry and went into a company that was adjacent to the insurance industry, actually selling to insurance companies, predictive analytic and data science solutions, right. But before that, I was pretty good on the actuarial side. I'm not a classically trained actuary at heart, but I was the director of pricing at a number of different insurance companies as I sort of progressed through my career.
And so I got pretty good at using Excel, okay. I got pretty good at grinding through data. I get pretty good at philosophically understanding data and understanding, you know, what it meant if certain data had low integrity versus high integrity and things like that. So I was really good at data wrangling, but I didn't have any idea what a GLM was. And don't ask me to write any formulas down because, you know, I let you guys, I let the data scientists do that, right.
But I'm pretty good at picking models apart because of my empirical skepticism that I learned at a very early sort of age through college and high school. My dad was a physician. He always used to kind of hammer me on, hey, are you sure you know that? Do you know that or do you think you know that, right? Do you believe you know that? Because beliefs can be wrong.
And so when I was at Valen, I was able to get trained by a former chief modeling officer at Capital One. Capital One Technologies, one of the major investors was Nigel Morris, who was one of the co-founders of Capital One. And Capital One was sort of one of the first banks that got into machine learning, got into predictive modeling on the credit card side. And so they had a very robust understanding. They were one of the early adopters, let's say, in the banking space and the fintech space of actually using predictive models in order to create a competitive advantage or arbitrage opportunity within the credit card space.
And so we were able to really capitalize on that connection with Capital One. We were able to take some really smart people from Capital One, bring them into Valen. And I really learned on the job from one of the former chief modeling officers on the commercial auto and auto finance side from Capital One.
And the second I got into it, I absolutely loved it. I mean, Valen was my favorite job. The second I got into data science, I was like, oh, God, this is what I've been missing my whole life. Right. Like, I love the language. But to be able to blend the language and the math together, the language and the data together and actually produce something that gives another company a moat or a competitive advantage in a marketplace, man, that's really cool.
But to be able to blend the language and the math together, the language and the data together and actually produce something that gives another company a moat or a competitive advantage in a marketplace, man, that's really cool.
Like I can actually take things from the philosophy side where all you're really doing is creating more questions for yourself and actually translating that into real, durable value in the world, like something that's tangible, that can be seen through dollars and cents. I was super duper excited, you know, I got really excited about that.
And then I translated that into Pi Insurance. I was one of the what I would call the silent co-founder at Pi. We took all of the learnings that we were able to extract from Valen Technologies, working with about 56 different insurance companies in the property and casualty space. And we translated that into building durable models that created a competitive advantage and an arbitrage opportunity for Pi Insurance.
Pi started in, oh, gosh, I want to say it was 2018 now. So we're now writing about 300 million. We're on the path to about 350, maybe 400 million this year. And I believe we're the fastest growing and quickest, most capitalized in insure tech that has ever existed.
So somehow I was able to take that philosophy degree, translate it into data science, create a competitive advantage, and raise a whole bunch of money and create a company that now employs. Unfortunately, we just went through what we call a reduction force. But we had about 460 employees, we now have about 400 employees now. And I think that's the end of the cutting so far.
But I've been very pleased with what I've been able to do and what I've been able to learn in data science. And I'm just looking forward to a lot more learning opportunities as we move forward. You asked, what do I like to do? Skiing, fly fishing, hiking, I like to be outside. So as long as it's outside and it's sort of a more of a single sport, I really like that kind of a thing.
AI, ChatGPT, and the insurance industry
Something I really wanted to ask you about initially, because it's something that comes up quite a bit in the data science Hangouts. So I know you wrote your bio for the Hangout with ChatGPT. So I feel like this is a good place to start off. But I'd love to hear your thoughts on ChatGPT, but also AI in general, and how you think it may impact the insurance industry.
I'll start with in general, and then we'll go into insurance. There's a lot of people on both sides of one particular argument, and that is whether ChatGPT has the latest sort of tangible instantiation of AI, right? Is AI going to take over the world? Are we going to live in a terminator universe where AI eventually is significantly smarter than us and somehow grows in tension?
I'm in the camp that I think that artificial intelligence and large language models and the like are going to be a benefit to society. They are not going to be a detractor to society. Now, do we have to be careful? Of course, right? Human beings are never at a space morally where we are technologically. And so the technology is going to force us to re-evaluate or trans-evaluate our scale of values, right? We're going to have to understand what can be done, and we're going to have to understand how we can make sure that we are distributing things like AI in some equitable manner.
In insurance, this is a real big problem, especially in the industry that I'm in. We have a lot of small carriers out there, but they don't have any of the data. There are about five or six or seven very large carriers that beat all of the small carriers up with data, okay? Now, we're not talking about large language models. We're literally just talking about simple GLMs. We're talking about any of the normal machine learning algorithms, whether they be gradient boosting, whether they be random forest, whether they be other elastic nets or tree models, it doesn't really matter.
Whoever has the most data wins, usually in insurance, right? In insurance, it's a little bit weird. We have a zero point inflation on all of the distributions that we're looking at. So in most cases, when you look at an insurance policy, about 75% to 97% of the time, the right answer is zero, right? People just don't have losses. But when they have losses, some of them can be very significant, and they can really hurt businesses, they can hurt individuals. And that's what insurance is for, right? Insurance is for that highly unlikely event that is catastrophic, right? That's where insurance really helps society.
The issue is that most of the carriers that are out there that aren't the big national guys don't have enough data in order to do that, right? And that's a big problem because you can use that to your advantage to become a monopoly. And if you see Hartford, if you see Travelers, if you see Liberty Mutual, if you see Swiss Re or some of the large reinsurers, they have all the data.
So they can almost always build better models, because at this point, the algorithm doesn't matter as much as how much data you have and the integrity of that data, right? It's really important to understand your problem space and to think very deeply about that problem space and where you can go get data in order to solve that problem space. And so at Valen Analytics, when I was there, we actually created a consortium data set, and that was specifically for small carriers. It was to allow small carriers and new entrants to the insurance market to be able to take advantage of large data sets with people who had subject matter expertise and data science expertise to be able to deploy models for those small carriers and democratize the insurance base to really allow new entrants and innovation into that space.
If anybody knows anything about the insurance space now, the one thing that is very, very clear is that it is probably one of the slowest moving industries on the planet. And the reason that it's one of the slowest moving industries on the planet is because there are five carriers that hold all the data and they don't have to move. They already have a durable advantage over everybody else. And so they're not really interested in innovating.
And what I think that's going to do is, first of all, it's going to disrupt the industry. That's why the insurance industry has coined a new term called insure tech, right? Most of those insure tech companies are not just tech. They should call it like insure data or insure machine learning or insure AI, because that's really what all of the insure techs are doing. They're using AI. They're using machine learning. It's really not AI. It's machine learning, right?
It's building models on large consortium or synthetic data sets in order to extract or provide that value to smaller carriers. So they can compete with the big guys, because most of the small guys either just sort of lagger along, like barely soldier on through the snow in Siberia, right? Or they just get subsumed or bought or acquired by some of the large guys. And so they just get bigger and they continue their monopoly.
What I'm really interested in is making sure that data science and large available data sets can actually democratize the industry that I'm in, because I believe it's an unfair industry. And I believe it's a quasi cabal of monopolists that are actually controlling this industry. And that is true in many, many industries. And I think there's going to be a huge opportunity, especially with the advent of ChatGPT and these LLMs, for software as a service companies to go out and very easily get customers in order to optimize those standardized LLMs and do other data science and machine learning projects for these companies, deploy those models for them, and allow them to hit those models with APIs and bring that value to bear.
Now, there's a lot more insurance companies that are looking to create data science practices inside of their own carriers. The biggest issue is you have to get a critical mass of data in order to do that. I mean, Pi is running $350 million in premium a year, and we don't have enough data. We actually had to go to Valen to purchase additional consortium data in order to build a robust model, right, because of that zero point mass. So you have to be not just a slightly big insurance company, you have to be about a six, $700 million insurance company that's been writing for three to five years at that level in order to have enough data to build a really robust model.
And so that's where I think using consortium datasets or novel datasets or even some of these new algorithms like adversarial algorithms to create synthetic datasets that are very closely mirrored to the underlying empirical data to be able to bootstrap and expand that data to be able to make more robust models. There's a lot of companies that I've talked to that are coming out with things like that, and Pi is actively doing a lot of POCs with those types of companies to find out whether we can take our smaller dataset and we can exponentiate that dataset so that we have a much larger dataset to be able to build models on.
Data privacy and LLMs in insurance
I see Liz had a question, a follow-up question in the chat. So I worked in the casualty space as an analyst for five years. I'm actually about to have an interview next week for a junior data scientist role trying to get back in the space. But I was curious, regarding ChatGPT, I think, you know, there's a lot of people doing like paraprogramming and stuff with that, and they maybe don't realize that when they submit a question or a prompt to ChatGPT, if it's not a locally hosted LLM, that the host is owning that data, whether they're allowed to or not, and that creates a huge exposure. I'm curious if you have any thoughts or comments about that liability and that exposure.
Yeah, that's actually a massive liability. And I don't think that the insurance space is going to capitalize on ChatGPT, especially in the pricing and underwriting domain, until there is assurances that the data that they put in to train specific instances of the LLM are going to be protected in some way and cordoned off from the larger language model. So right now, I know that there are several companies that I'm working with are like, well, we don't want to do anything until we have absolute assurance that our data is safe and that no one else can gain any advantage through the use of us putting that data into, right, some container or some space that allows us to train those LLMs.
And I think that is where a lot of new companies are going to be extremely successful, is going to be prompt engineering and specialization of those larger language models, sort of the base model, and making sure that they're doing that in such a way that they can make sure that they have an understanding of the custody of the data and that they can either tokenize that data or make sure that that data doesn't get into sort of the larger pool of data that's being used in order to optimize the underlying base LLM. It's really important.
I think until that happens, ChatGPT is going to be used more for, in insurance, it's going to be used for automating processes around changing policies, right, making sure that the language and the chat functionality goes back and forth very smoothly without any human intervention. But I don't think that, you know, it's really going to, the insurance industry is really going to capitalize on ChatGPT as an underwriting agent, right, until all of that data can, you know, is safe and is completely separated and mutually exclusive from the larger pool of data.
I mean, I know that I've already told my team at Pi, and I've worked with several other carriers, and I've told them all, don't put your data into ChatGPT right now because then everybody gets to use it, right? Like this, this thing learns. And it's like, you know, it's also about like work product and stuff, you know.
Oh, totally. They, if you, if you say, oh, hey, I need help fixing this, you know, chain ladder model or whatever, but you have a little bit of secret sauce in there and then someone else asks and then boom, your secrets out and you lose an edge. That's exactly right. I mean, even in our business, we have a very specialized transformation of the target space that we use for our line of business, workers' compensation. And we don't want anybody to know that. That's a specialized transformation. I haven't seen anybody else in the industry use. It works really, really well. I don't want to give that information to anybody, right? And the second I pop it in there, it's possible for someone to extract that.
Unless new technology and updated knowledge is actually given to the insurance industry by some of, you know, a new company, a new individual that's working with these companies that actually understands how to specialize these LLMs without delivering the data to the larger baseline model. I personally don't know how to do that yet, but that, I know that that's something that's being talked about very actively and heavily within the insurance space. It's, well, how do we take advantage of this? And if we're going to take advantage of this, how do we not give away our secret sauce and take advantage of it at the same time?
Bias in models and synthetic data
That was actually one of the anonymous Slido questions was, how do you ensure that bias is not introduced when generating synthetic data? So it's, it's an awesome question and it depends on what you mean by bias, right? Because there's bias, whether it's statistical bias or some other bias inherent in every single model that's built, right? What we try and do is we, we look very closely at the variables that are going into all of our models and that's what's really hard to do with an LLM. I mean, you know, somebody else almost has to make sure that there's no bias in the, in the baseline LLM, but it's all about the covariates.
So when we're building a machine learning model, we're taking a look at the covariates and we're making sure that those covariates are highly empirical and that they're based on, you know, insurance data for the most part. Now, there's a lot of demographic data that can come in, especially in the line of business that I'm in, which is workers' compensation. If anybody doesn't know what workers' compensation is, it's essentially a coverage that employers buy so that their employees are covered if they have an accident and it pays for all of the medical bills plus their, their weekly wage, as long as they cannot get back to work at the, at the same role, right?
And when, when you think about that, there's a lot of demographics that come in. Things like average household size are actually important. Oddly, average hail size is important and you can't imagine what kind of biases that actually introduces. So if you're a homeowner's insurer, average hail size really matters, right? Like the people in Colorado, right? They have higher premiums than the people in North Dakota because North Dakota has less hail than Colorado does, right?
But what, what you really care about is, okay, is there some group of people that already have it tough? And are we making it more tough because of something that's inherent in a covariate that we're using that just makes it more difficult for those people to purchase workers' compensation insurance? We try as best as we can to remove those variables, whether it's average wage, average household, any race demographics, anything like that is completely just dumped out of our data set right at the beginning, even though those exist.
And so we do our best to try and minimize those biases. But again, there's always going to be inherent and potentially unknown or unknowable biases that actually exist within an overall algorithm estimation. So we do the best that we can. Nobody's perfect at this.
There's a lot of new companies that are coming out with novel data like behavioral data. And that's something that the industry is talking about very, very closely, right? It's like, hey, if we can capture what your media content consumption is, right, and that has a significant risk separation between the good risk and the bad risk, is it highly correlated with things like race? Is it highly correlated with other things? Because it may be the case that people of different backgrounds consume different media. And that may inherently have some sort of correlation to things that kind of matter with regard to the current Zeitgeist in terms of which groups are having trouble and which groups aren't.
Effective leadership and servant leadership
So there's a couple of things that I post on LinkedIn. Some of them are quite old. There was one that I posted that, and I'm sure everybody's seen this, the difference between a boss and a leader, right? A boss sits behind you with a whip and yells at you to get stuff done. And a leader is the one who's pulling you along. He's or she is the one who is removing the blocks, who is actually serving the employees. And I consider myself and I'm a strong believer in servant leadership, right?
I'm not anybody's boss. I am the person who is responsible for making sure that all of the conditions are set up so that each and every one of my employees can succeed and can fully explore the domain of their competencies. That's my job, right? My job is to find smart people and then set up the conditions and get the heck out of the way. And if there's a blocker that is in their way, I need to go find a way to remove that blocker, right?
My job is to find smart people and then set up the conditions and get the heck out of the way. And if there's a blocker that is in their way, I need to go find a way to remove that blocker, right?
Now, are there performance reviews and things like that? Yeah, every company has them. I am a strong proponent against them. So I hate performance reviews. I think that performance reviews should be done every single day between the individuals that are working together. And I don't think it should be between a boss and that boss's employees. I think it should be the individuals on, that they're colleagues on the team that are working together with one another. That's where that feedback should be coming from.
It should be a 360 degree feedback because performance reviews are a thing that can be weaponized so easily. And they say more about the person who's giving the review than they do about the person who's being reviewed, period. That is my position. Now, I'm happy to argue with anyone about that particular position because of how strongly I actually believe in that position.
I think that if we can create an egalitarian society where everyone is responsible to everyone else and everyone is giving constant feedback to everyone else on how to improve with empathy, and with honesty, that is where you can create an organization or a social structure and a social hierarchy that actually allows everyone to win and allows everyone to succeed and allows everyone to explore all of the levels of Maslow's hierarchy, right? That's how you become fully human and fully yourself.
Removing blockers and communicating needs
You said part of the role in a leader is removing some of those blockers for your team. And I'm wondering for some of us here who maybe are in a role where we've identified certain blockers for us, how would you recommend you go about like sharing that with your leadership to kind of get past those roadblocks?
Whether it's an I or a we, right? Sometimes there's a blocker for a whole team and sometimes there's a blocker for an individual, right? I think what you need to do is, I don't know if any of you have taken any courses in nonviolent communication. It's kind of a big thing now. We've taken a lot in nonviolent communication. I don't believe all of it, but I think it is a tool set where you can sort of mix and match some of the tools that come out of nonviolent communication. One of those things is to listen for needs and to express needs, right?
So you start with, I'm trying to do this for the company. I believe that this is going to be very valuable. Do you agree? Right? So first, you've got to get buy-in on what you're trying to do is valuable for the organization and valuable for the company. Okay? The second thing is, in order to do that, I have certain needs that are not being met. Okay? This is what those needs are. It could be an individual. It could be a group of individuals. It could be a dynamic. It could be technology that isn't there. It could be budget. It could be whatever, right?
And then you really express that as a need, not as a so-and-so is doing something to me and I'm mad, right? That's a very sort of a selfish communication. What you want to do is you want to try to abstract that as much as you can. Now, you can't always because sometimes it's just a person who isn't on board, right? And so you say, look, I really need this person to be on board, but I don't think this person is on board. And I want to ask you what we can do together in order to try and identify that other person's needs and discover why it is that they appear to be a blocker.
It may be in my own head and I'm just, you know, I'm just in an echo chamber, but let's have that conversation. And if you don't feel confident having that conversation yourself, go to someone who you think can have that conversation. And it doesn't have to be a manager. It could be a colleague who knows that person really well and can set up that meeting so that you can have that meeting and identify what those needs are.
There's another book that I would really strongly recommend, and unfortunately it's escaping me right now. The guy's name is Chris Voss. It's called Never Split the Difference. I don't know if any of you have read that. Strongly encourage you to read that book. Some of it's throwaway, right? But, you know, I follow the Pareto principle, so there's always 20% of, you know, absolutely fantastic nuggets in every book.
So that book basically teaches you to get to know. You talk to an individual and you have a dialogue with them. You have a dialectic with them until you get to why they're saying no to whatever thing you want to get through, right? Because yeses that aren't really yeses don't matter. You have to get to the no, right? And so that I think making sure that you understand what the other person's needs are or the organization's needs, if that's a blocker. And then making sure that you're constructing the right conversation and you're having those conversations through an expression of needs on both sides, a reciprocal expression of needs. Usually that is the black swan that will bust right through that blocker.
Regulation and data asymmetry in insurance
Matt, kind of following on the thread that you started with about these opportunities coming from data to democratize this industry. I wonder if there's a counterpart regulatory piece that you see. Are there regulatory gaps in that being able to be successful in making sure that you're successful in this industry that you described as being a monopolistic cabal? Are there regulatory things that you think are really important either for consumer protection or for pushing back against anti-competitive kind of practices that would make a difference?
Yes. And this is where my opinions are probably the most provocative. I believe that, by and large, insurance is massively overregulated. Massively. To the point that it's actually been regulated in such a way that it's extremely difficult for new entrants to get in, right?
I think we've democratized some of it with a lot of the things that companies like Valen have done and other companies constructing consortium data sets. But the regulatory world in insurance is extremely antiquated. They're basically operating on 1970s principles, by and large, okay? And those principles were set up by the insurance industry at that time, which was populated by a very few, extremely large carriers, specifically for the purposes of creating barriers to entry for new entrants, okay? So, I believe all of the regulations need to be completely rewritten, and they need to be completely rewritten in view of the latest technologies.
There is something that one has to understand about insurance, and that is that it is an asymmetric information game, okay? The insurance carrier gets to ask questions, and they have to ask questions mostly through a third party, through the insurance agent, right? And so, not only do you have information asymmetry, but you have information distortion as well.
And so, what the insurance companies are trying to do is put themselves on a level playing field, because the insured has all of the knowledge about what their exposures and what the risks are. The insurance agent has probably 20% of that information, and the insurance carrier probably has about 10% of that information. A lot of what's coming available now is actually increasing that 10% to maybe 30%, 35% of the information, right?
There's still unconscientious, disagreeable people out there that are lying about all of their information. They're a roofer, and they're saying that they're a computer programming company. They're a trucker, and they're saying that they're a retail shop, right? And that's where models are actually quite good at determining whether someone is actually lying to you or not, right? And identifying and shaving down the moral hazard.
Now, having said that, there are still ways now, and will be ways in the future, that insurance companies can take advantage of information that the insured doesn't even know they have. And so, I think that information transparency is a huge piece of the puzzle. I think that if an insurance company is making a decision, letting the insured know what information they're using in order to render that decision is probably going to be very important in the future.
But I think that the regulations around machine learning and filing models with the Department of Insurance and the scrutiny that they have to go through is massively, like exponentially slowing down the innovation that could occur in the insurance industry. It's also why most insurance companies that I've worked with are choosing not to file their models. There are very few people that are actually filing models in the commercial insurance space. In the personal auto space or the personal line space, you have to file all of your models. So, the Department of Insurance gets to look at it. They get to scrutinize it. They can say no on just about anything, and they do.
And there are some states that don't even allow for trade secret. So, once you have to file your model with the Department of Insurance, your competitor can actually get all of that information and rebuild your model, which I've done several times. So, I think there needs to be massive changes in regulation around data science, machine learning, and algorithms and how they're used in determining pricing and availability of insurance coverage.
Automation of underwriting and the future of data science roles
I wanted to get your opinion on see how far is some of these job functions being automated, such as underwriting or app input, because AI is so much advanced, and it's reading so much data from your database, vast databases. So your thoughts on that?
Yeah, so I don't believe that all risks. And now with the latest technology, my opinion is changing sort of actively as I start to understand a little bit more about these LLMs and transformers and the like. I believe that probably 85% to 90%, maybe even 95% of the underwriting decisions that are currently being made by human beings in the world will be eliminated through the use of machine learning and AI. I mean, that's just a simple fact.
It is already the case that machine learning models can make a significantly better and more consistent decision than humans. Now, it is game theoretic, right? So what the models are not quite as good yet at doing, and obviously there's ways to solve this using models, they're not quite as good at understanding the market conditions and the changing market conditions with regard to what the best price might be, right?
Insurance carriers have to be able to understand good risk from bad risk, but they also have to write policies, right? And that means that sometimes you have to make trade-offs, and that's where human beings come in. They're making those trade-offs. They're saying, well, I might write this at a slightly higher loss ratio, but if I don't write this packet of business at a slightly higher loss ratio, I'm not going to have enough revenue, right?
So the game theoretic piece of the puzzle, sort of price optimization, which is a really bad word in insurance, but that's how we think, right? We're actually doing price optimization every single day. It's just that we're doing it through human underwriters. I think where the world's going to go in insurance is that a lot of the customer service personnel are going to be eliminated because LLMs are going to take on that, you know, chatbots and RPA and automation are going to be able to do probably 95 to 100 percent of policy changes, endorsements, things like that.
But on the underwriting side, I think that it will probably eliminate 85 to 90 percent of the decisions. That doesn't mean that the underwriter's role gets eliminated. It means that the underwriter's role changes significantly. It changes to a portfolio management role. It changes to a gap identification role. And it changes to a role where an underwriter is now the subject matter expert that works directly with the data science team in order to identify and explore the gaps that may exist in the model solutions and helps to fill those with novel data and perspective that sometimes a lot of the data science teams don't have because they're not on the front lines.
And that's sort of a surrogate is it is really important in insurance that the data science team actually has a working knowledge of the line of business that they're actually looking at. So if you are building a model for commercial auto, you have to understand commercial auto. You have to understand how it's rated. You have to understand the coverages, et cetera.
I have a lot of PhD data scientists that I've worked with in the past that felt like, hey, I'm a data scientist. I don't need to know the underlying environment. I'm just going to grab the data and I'm going to build a model. I can tell you that in my experience, 90% of the models that are built in that fashion fail because they cannot be productionalized. The data scientists didn't find out whether the information was available at the time that the decision needed to be made or the target was not transformed in a way that was critically important because the data was centered or there was a treatment effect that they did not identify until after they built the model.
The model building process
Generally speaking, we will, on a regular basis, go out to each one of the department heads and we will ask them, what are your problems today? What is creating too much friction? What is something that you think could be a better decision that's being poorly decision today? Then we'll get all of that information back.
Once we construct all of the potential opportunities for data science, then we start to look around at the available data sources that we have and we start to do some discovery around whether or not we can actually solve that particular problem with data. Do we have the data available to be able to solve that problem? Is that problem a big enough problem to solve?
Usually, once we go to the department heads, we'll then go to the business or department analysts within that particular department and say, hey, how big of a problem is this? Can we quantify the friction? Can we quantify how much this is costing us? How many minutes? How many hours? How many days? How many employee days does this take? Theoretically, if we were able to remove that and automate that, what would the savings be?
Then, of course, there's other things in the industry. How much better can we make around these decisions? Can I separate good risk from bad risk? Can I identify the probability that someone's actually going to pay their bills? Can I identify the cost if they don't pay their bills? Can I identify how many times they're not going to pay their bills and we've got to do something with that policy, whether it be to cancel it and reinstate it and all that stuff because that costs insurance company money.
Once you sit on a problem and you know that there's enough value there and you know that you have the data to be able to solve that problem, then it's a pretty standard process. You have to wrangle the data. You have to normalize the data. You have to work together with the department experts in that department to identify what they think might be interesting or valuable covariates.
So usually we will do some unsupervised modeling just to figure out what the data looks like. Where might there be blooms of data that could be interesting? And then we'll do a correlation study, which is with as many covariates that we can actually put together and then bring that back to the department and then ask the subject matter experts what's not in here that you think should be in here. And then they'll start to think about and construct some additional features. Then you start the feature engineering phase.
You do that with the subject matter experts in that department and you try and get as many features that you think might be useful as possible. And then, of course, there's some target transformation sometimes. Are you going to build a logistic model? Are you going to build a linear model? Are you going to build something that you're going to use for triage? Is it going to be sort of a categorical model where you've got five or six different decisions that need to be made and you're going to split those up? You kind of figure out what the best sort of opportunity is there.
And then we usually go through a sample and partitioning. How are you going to sample this? What's your strategy going to be? What is your validation strategy? So are you going to do cross-fold validation? What's your holdout going to look like? What's your train set going to look like? What's your testing set going to look like? Are those stratified correctly? Is anything time series based? Because that has a significant impact on how you're thinking about your sampling strategy.
Usually in insurance, depending on data size and the composition of the target, we use anywhere between three partitions and 10 partitions. If it's a more complicated problem and we've got a lot of data, we'll use up to 10 partitions and we'll hold three or four of them out. And then we'll use incrementally four or five or six or seven of those in order to increment through the overall model building process.
We'll use the first partition, maybe even split that up in order to do feature engineering and basic variable selection and reduction. Then we will build our initial model and then we'll test it on the next partition. And then normally what we do is we'll build on, let's say you have five different partitions. We build on one and two, test on three. Build on one and three, test on two. Build on two and three, test on one, right? And then once you're really confident that you've got the right model using that data, you add the next partition.
Now you've got one, two, and three built on four, four, three, and two built on, you know, tested on one and so on and so forth, right? And we're constantly looking at coefficients, whether they're shifting, whether they're going up and down. If it's a tree model, we're looking at what the splits look like. We're sort of doing hyperparameter selection usually on those early partitions until you get to the final model, usually with two or three validations that's already held out. And in insurance, we're usually using between five and eight cross-fold validation at each one of those stages, just to make sure that we're getting, there's a lot of volatility because of the zero point math in insurance. So you really need to make sure that you've got enough cross-fold validation there because you can have a very large loss that makes things look really weird in any one partition of the data that you're looking at.
Advice for aspiring data scientists
I mean, so I think a working knowledge of both Python, and I'm going to say it, R, right? R and Python. A working knowledge of both of those is really important. The biggest piece of the puzzle for me is an understanding of what I would call pragmatic, productionalizable model building, right? Do not narrow your scope or your interest down to research alone. You need to know how to get models into production so that they can be used by actors in the real world, right?
If you're doing descriptive statistics, that's great. I mean, and if that's what you want to do, great. But you're going to cut yourself short if you don't know how to deploy a model. You're going to cut yourself short if you don't know what the process looks like in order to deploy a model. You don't necessarily need to know how to do the engineering, right? You can get an MLOps team for that. But you need to know as a data scientist, how to do things like parity testing, predictions between the production environment and the model build environment, right? You need to understand what that looks like, what the process looks like, right?
And you need to understand how to actually go the last mile, if you will, right? If you're just building general linear models and you're doing that descriptively and you're saying, look, I built a model and it looks great, right? The next question that any executive is going to ask you is, okay, how do we use this? How are we going to deploy this? Are we going to deploy it in SQL? Are we going to containerize it? Are we going to use Scala? Are we going to do something else in order to get that model into PMML, whatever the case may be? How do we get that model into production? And how do we know that that production model is going to score the same way as it did in the model build environment?
I think the other one is having a really good understanding and an ability to articulate what the model building process looks like all the way from finding the data sources all the way through to validating that model and making sure that you don't have an overflip model and that that model is really going to perform as you say it's going to perform in the real world on data that it hasn't seen before. Those are all really, really important.
The last point that I'll make is, it is critically important for you to show in any interview or to any business manager that you have a high level of curiosity in the business that you're in. Don't sell yourself short and just be a data scientist. You need to be a curious data scientist. If it's an insurance, you need to want to learn insurance. If it's in aeronautics, you need to want to learn aeronautics. If it's in process optimization for, I don't know, making candy bars, you need to really be curious and interested in what the process looks like to make a candy bar, right?
I think that that's really, really important is that data scientists can't just be data scientists. They have to be Renaissance people and constantly curious and autodidactic. That's the best advice that I can give is don't narrow yourself. Don't be an inch wide and a mile deep. Be a mile wide and more than a few inches deep.
I think that that's really, really important is that data scientists can't just be data scientists. They have to be Renaissance people and constantly curious and autodidactic.
Thank you so much, Matt. I know we quickly came to the top of the hour here, but I really appreciated this conversation and you jumping on here and joining us. Absolutely my pleasure. I had a fun time. Thank you all for all the great questions too. I'm going to put Matt's LinkedIn in the chat here if anybody wants to connect, but also I will share the recording, of course, to YouTube and to the Posit Data Science Hangout site. Have a great rest of the day, everybody.
