Resources

Jamie Warner @ Plymouth Rock Assurance | Data Science Hangout

video
Feb 23, 2024
57:33

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome back to the Data Science Hangout. I'm Rachel, I lead Customer Marketing at Posit. So excited to have all of you joining us today. If this is your first time joining us, the Hangout is our open space to hear what's going on in the world of data across different industries, chat about data science leadership, and connect with others facing similar things as you.

So we get together every Thursday at the same time, same place. But I did, I always like to say this, if it is your first time joining us, it really is so nice to meet you. And we'd love to have you say hi in the chat so that we can all welcome you in here as well for anybody joining the first time. We're all dedicated to keeping this a friendly and welcoming space for everybody. And we love hearing from you no matter your years of experience, titles, industry or languages that you work in as well.

It's also totally okay to just listen in here if you want. And you can be a part of the party that happens in the Zoom chat. You'll see there's lots of resources and comments that will get shared there. But there's also three ways that you can jump in and ask questions or provide your own perspective on certain topics. So you can raise your hand on Zoom, and I'll call on you to jump in. You can put questions in the Zoom chat. And then feel free to put a little star next to it if it's something that you want me to read out loud instead. And then lastly, we do have a Slido link, which Hannah or Tyler will share in the chat in just a second here, where you can ask questions anonymously too.

So thank you to Tyler and Hannah who are helping as my co-hosts in the background and will be helping me collect some of the questions. One other note, if you're watching this as a recording on YouTube at some point in the future and want to join us live, the link to add it to your calendar is going to be in the details below on YouTube. And as always, no rule that anybody has to stay on the whole time or talk. Come and go as it fits your own schedule.

But with all that, I am so excited to be joined by my co-host today, Jamie Warner, Managing Director at Plymouth Rock Assurance. Jamie and I are both usually in the Boston area, but we're both traveling. Jamie, I'd love to have you kick us off by just introducing yourself, sharing a little bit about your role, and also something you like to do outside of work.

Sure. So thank you for that intro, Rachel. So as she mentioned, my name's Jamie Warner. I work currently at Plymouth Rock Assurance, which if you're not in the New England area, you might not have heard of us, but we are an insurer that covers home and auto. And I've been in the insurance space for a while after being in the tech space initially. And I really love it because there's really amazing data science to be done in highly regulated industries. We have a lot of problems to solve. We have a lot of data and we have a lot of excitement around how to get to them. And Plymouth embodies that. We're a great test and learn environment.

I'm very lucky here to get to do a couple of things. One of the smaller things that we're doing is our cloud migration. But in addition to that, I also get to own the translate ability and the implementation of our modeling. So how we take those really large scale, unconstrained best data science models and make them something that regulators will approve and will work for us in terms of our pricing, our underwriting, things like that. Something that I do out of work, I do really like to scuba dive. So that's actually why I'm down here. We're taking a three-day vacation with my partner so he can scuba dive for his birthday.

Data science in heavily regulated industries

Awesome. Well, have a great trip. I'm glad you're able to join us today too. So I get to kick off some of the questions. But I know something we talked about when we were sharing your hangout is about revolutionizing the way heavily regulated industries understand and adopt data science. And I'm wondering if you could just share a little bit about what that looks like for you today.

Sure. So I think something that sometimes data scientists stay away from is some of these places that, I don't know, maybe viewed as a little less cool. I think insurance is cool, but I know maybe it's not the hottest topic that everyone on the phone was excited to talk about today. But something that we forget is that insurers have been collecting data for hundreds of years. And actually, if you think about actuaries, if you're not familiar with them, they do a lot of the background work in insurance and the historic data work are kind of like the original kind of statistician data scientists. They used to do that work by hand, even folks that I work with today. And the industry has really evolved and we have tons and tons of data.

But a lot of times we can't do something like a tech company where we just throw out the old data and say, we're just going to use all the new data coming in because we need that history. Because claims and accidents and life, depending on what area of insurance you're in, you really do need what we call a long tail or like a lot of historic data to predict what's going to happen in the future. And so we have to figure out really creative ways to get that data, to use it, to put it in a format that's usable, and then figure out how we use that to best price or distinguish between different types of folks. And the other piece is that we do have really strict regulations around the way that we do that. So the type of data we can use. And so we have to be really creative in how we pick our data, we pick our techniques, and then how we make the story, which is kind of the most critical piece. I'm sure everyone's heard that tagline, it doesn't matter how good your model is if you can't tell the story of why. And that piece is really critical here. So it's kind of a fun space to be doing that in. And it's a space where there's a lot of opportunity, because historically, it hasn't been done as much.

I'm sure everyone's heard that tagline, it doesn't matter how good your model is if you can't tell the story of why. And that piece is really critical here.

Career transitions and research roles

I see a few questions starting to come in here. And one was from Arsenis, and I see a star there. So I'll read it for you. But the question is, I started out as an economist doing research and moved into being a data scientist, and now yearn to get back into research analysis. How would you recommend approaching the job market to make that change?

Sure. So it would be helpful to know what you mean by research analysis, because I actually think of all of our roles as being research roles. That's something that's, I guess, exciting to me about the type of work that we do is you don't have to be in academia to be constantly doing test and learn research in this space.

Good morning, all. So yeah, Jamie, so my situation was, I was actually an economist for the federal government. I worked at the US Bureau of Labor Statistics, ended up really kind of falling in love with R and data science in general, started building out a lot of tools for them to use and all that, some of which are still in use, which is kind of a little doff of the cap to me, I guess. I'm pretty proud of that. But, you know, I ended up going in a direction that was more tech, that was more like data science, building tools and doing all that kind of stuff and found myself getting further and further away from the research and analysis stuff that I was doing. I was publishing papers before doing research on macroeconomic trends and some other things. Also, consumer trends and things of that nature. And I kind of missed that research, research-y type of feel where I'm asking questions. You know, I always wanted to use data to ask and answer questions rather than to build tools, which is also a really important thing. But it's not, kind of, you know, my, it's not really my bag.

Yeah. So, yeah. So, this is, and that's helpful context. So, one of the things I would say is I would ask yourself, number one, if your job has to, has to fill all of that. So, I personally also love the research component. And one of the things about insurance and finance is we're not always using the most relevant techniques or the newest techniques in our stuff. And so, I have to stay abreast of that somehow. And we didn't talk about this in my intro, but I actually have a side gig, and that's teaching at Northeastern. And I teach data science courses there at the graduate level. And I also do research there. I have research grants that are active that I do with them as a side kind of thing. And that's what keeps me up to date. It kind of tickles that research fancy because especially if you're working in kind of a corporate environment, you're not going to be publishing papers about the work you're doing usually because your company wants to keep that secret. Like, that's your, that's your whole advantage.

So, for me at least, one of the things that I did to kind of tickle that was to do work as an adjunct through a university or to help out on grants and like certain grants. And I won't be like the lead author or anything like that, but I'll be a side author. I'll get to do some of the analysis. I'll get to get in on some of those conversations. And I'll get to see the techniques that I could be applying in our space right as they're coming out. So, for me, that's, that's one way to do that. I know that might be a little bit higher barrier to entry, but kind of as I think more deeply about your question, I would also consider what roles within a company look like because there are roles that are very, very technical. And then there are data science roles that are more explaining things. And you might find that you enjoy the roles more that shift over toward the product side where you're really explaining and looking at the impact of the data and models, not just, you know, building them.

Digital transformation and change management

Yeah, I see, Alan, you had a question in the chat. Want to jump in here next?

Yeah. Hi, everybody. Hi, Jamie. You mentioned really, really briefly in your intro that you're involved with digital transformation. And I'd love to hear more about how your team and your team's needs fit into that bigger effort. I'm curious how you fit into that effort and how you find your needs to like fit well or influence the broader thing. My experience is sometimes there's some tension and it's challenging to figure out how to balance all of those needs of a team versus a whole enterprise. And so just really curious about your experience there.

Yeah, I would say something, and I saw another question in the chat kind of about my background training. And I would say something that I learned kind of going into those types of transformation projects. I did a similar one previously when I was at Lincoln, is you go from being like a data scientist or a data person to being what I would call half psychologist, half salesperson, where it's more about negotiating the change management and the emotions around the change than it is about the actual data. And I would say one of the things that is really helpful around that is that frequently IT is trying to make a change that's an important change and they need business leaders to help them drive that and find the use cases.

And frequently they have thought of some use cases, but maybe they aren't spot on just the same way as data scientists. Sometimes we'll think of a model and then the business will be like, that doesn't make any sense for this reason. The same thing is happening with IT. So I think the general experience tends to be that IT is trying to push this transformation or this move to cloud or this new technology, and they don't really know how it applies. And so I really love to grab onto that and be their biggest advocate. And then they become my biggest advocate. And that's really kind of how we drive the digital change forward.

I would say, yes, it's a huge distinction between what team needs and what business needs. I tend to like to set the standards because then other people will just adopt them. It's easier to adopt someone else's standard than it is to make your own. So I love the ability to go first and make a best practice and then have other people just adopt it rather than having to argue that I don't think that that best practice might be the best one and try to switch it over. So I love to be a front runner.

Yeah. And I think it's really about making other people feel good at the end of the day, like rewarding IT, showing them business value, and then they're able to go to their managers and show that. And then they're super loyal to you when you do that.

To add on to that answer as well, I'm curious, when you were interviewing with organizations, what kinds of questions did you ask to figure out if you could be a driver in that organization?

That's a great question, especially in insurance. Some of it is just knowing the industry and asking different people that work there kind of how they approach their work. I think some of the questions you could ask are about their technology suite. How much they work with their IT partners is usually a really big kind of... And asking, for example, when I interviewed with Plymouth, I asked, hey, could I talk to the head of IT? Can I have a conversation with them or specifically someone that works in this space? And really was able to get their take. They weren't part of my interview panel. So kind of asking sometimes for those additional resources is really helpful to get the full vision. And if there's a lot of disconnect between the business, sometimes that's a sign in and of itself.

Domain expertise and business knowledge

Well, thank you, everyone. Good morning, Jamie. I'm really intrigued with the fact that you're working now in a highly regulated industry. My question deals with how did you obtain the domain expertise to complement your expertise as a data scientist? How long did that take? And when did you feel like you were an expert in the domain?

No, no, it's a great question. And it's something I actually, I would say I have a little bit of my tech background to thank for that. So I started my career at Forrester Research, which is like a market research firm. So going into that type of work, we would always pay a lot of attention to having to upskill really quickly on the needs of our customers. And so when I moved into insurance, I was building models for initially for underwriters. So the people that pick the initial price for the policy, and I had no knowledge of what underwriting was. So I actually did go get certified as an underwriter through a process called the CPCU. So property casualty underwriter certification. And that was really beneficial, not just because it gave me a basic knowledge of what underwriters do, it was also helpful, because I could go to their conferences, and you get no better feedback than when you are at a conference with underwriters, and you announce that you are a data scientist, building the models that they have to deal with every day. And it's not the kind of feedback you get formally in the business environment. It's the kind of feedback where I hate this and this and this, and you get a lot of information.

And I also understood, I mean, all of the data and insurance, something I love about it is my backgrounds in survey data, and all data and insurance is survey data, a claims adjuster put it in, you put it in, like, think about the last time you answered that questionnaire, you kind of fudged it a little bit, maybe. And so a lot of what we're doing when we clean the data is we do have to understand the business expertise. I also do a lot of like kind of trying to mix across the business. So I love to do a shadow, like job shadowing, a claims adjuster job shadowing, various folks across the business will really help. And then the final piece is, I think, as a data scientist, you're used to being kind of like, smarter person in the room, know everything. When you go into a business environment like this, it's important to lean into being the dumbest person in the room. And understand that, especially with historic processes, and this was a big one for me, when I first got in there, I was like, well, this is a stupid way to do it. This is a stupid way to do it. Why do they do it that way? And as I started to ask questions, it came about that, like, maybe there's a regulatory piece that was why they were doing it that way.

Or for example, when I was at Lincoln, we had one thing where we were sending out these really expensive mailings. And I was like, why don't we just email this, it's so much faster, we actually know if they get hit back. And it was a regulatory constraint, there was a regulatory requirement around a physical piece of paper. And so that means that you need to be able to partner across the business to get over that constraint to get the adoption you need. So really asking those questions and coming in with rather than kind of like, I know better, coming in with curiosity, I think helps a lot. I will never know as much as our product folks. So really working with them too.

In your university hat, your adjunct hat, do you mentor your students to take similar approaches? I really commend your insight in obtaining that certification you mentioned. And things like that, are you advising your students to do similar things in their careers?

Yeah, so from my perspective, like at the end of the day, the data is the most important thing, right? You can use any model, but if you don't understand your data, there's probably weird biases and distributions that you never understood. And you understand that by understanding the subject matter, right? So by subject matter expertise. So I'm a huge proponent of that. And I really like it when students dig in there. And I think there were some questions earlier in the chat about getting a job in different places. When you look at the applications for data scientists, when I post a data scientist role, I immediately get maybe like 500 LinkedIn messages and 1000 applications in that role. I cannot differentiate between all those people. But what I can differentiate is if a product analyst that I know comes up to me and says, hey, I was like connected with so and so and they, you know, are doing really interesting stuff. And so that's also a way to get great connections is to be able to like kind of help out folks in those in those spaces and partner with them and get to know them.

Code sharing, tools, and open source

But the question is, can you discuss how you and your team share code and models with the rest of your organization? I'm especially interested in whether or not your organization uses R with other tools like Python and how that affects everyone's workflows.

Yeah, it's a great question. We use anything and everything. I'm a big, the tool fits the job kind of person. The biggest thing around that is I am a, I would say a jerk about code headers and code documentation. We typically use Git for a lot of like the sharing because that's a really easy, if people aren't familiar, it gets just kind of like a SharePoint for code. And it tracks version control, which is a really big thing for me to be able to have one version of the truth.

But with code sharing, and I'm not sure if your question is just about code sharing or about sharing like the output and information. But we tend to use a lot of comments because if you're a great Python coder, and you want to look at someone's R code, if it has great comments, it doesn't matter that you don't know R as well, like you can read it. Same thing with SAS, we have a lot of legacy SAS code, SAS, not SAS. And definitely kind of transitioning code between different types of code, especially as we move to the cloud. I mentioned earlier, we're doing a big cloud transformation. And that means we have to upscale in new languages, like from Python to PySpark. So there's a lot of different code happening. And to me, it's all the documentation and the code.

I know Bill actually had a question earlier that touches on what you just, because you just brought up SAS, but Bill asked what role have Excel and SAS played? Or what role do they play currently in your company? And is there resistance to open source tools?

Yeah, I would say that the challenge, the resistance comes from a couple places, right? So Excel is what actuaries use. Actuaries love Excel. But they're also really smart, really technical people with statistical savvy. So you can upskill them if you give them the space to be upskilled. And I would say with SAS, especially like all these link legacy languages, the first piece is, are the people you working with able to understand the new code you're doing? And is there like a fear component? A lot of times when there's a lot of resistance, it's because I'm, you know, I've been doing this job for 30 years, I know exactly how to do it in SAS. SAS has great customer support. And you're expecting me to download this tool that I don't even know how to download. And where does the notebook live? I don't understand it. And so a lot of that, to me is really proactively being aggressive with training to make people feel comfortable.

Because if you just give it to people and expect them to go back to their desks, like ideally, data scientists, a lot of curiosity, they would just go and break things. But people get nervous and people get kind of especially if you're known for being really good in your space. And then suddenly, you don't know a new code platform that can be really scary. I would say also, you know, with SAS, some of the some of the stuff that SAS has innate in it is really helpful for something like an insurer, because the packages aren't open source, which means that they have to be tested really extensively before they can get pushed out. And that's a really big risk when you move to open source code. Because and I talked back to that documentation, we have to have really good documentation, like yaml files and things like that around what was the environment we were using at the time we built this code, because when we go back to refresh it, we want to still be able to run it. And that's one of the risks of open source. So it does require a lot more analysis of the packages explaining that, like your SAS code, if you go back and run it 20 years later, it will still run, guarantee you, versus your Python code, you run it 20 years later, it's not going to work, all your packages are going to have dependencies, same thing with R.

So really being aware of that. And picking some of the packages that are really more comfortable to use. But really, yeah, I'd say it's that training education piece. And there is part of it is just there's a lot of legacy code that runs and nobody quite knows who built it. And nobody quite knows why it runs, but it does. So breaking that apart takes a lot more effort. And it doesn't show a measurable outcome, necessarily. Like, if you fix it all and put it in a new technology, the code still just runs. So you do have to kind of balance those things.

Cloud migration and data governance

And the question was around what has been most challenging about using cloud, or maybe I'll add about the migration to the cloud for an insurance company.

Yeah, I think it was the insurance data, all that kind of stuff is regulated and related to personal data, you know, very sensitive data. So I was thinking how you guys deal with that data, or what's the kind of challenge for you, you know, if you build a model from the beginning to the end, everything in the cloud, how do you deal with that, make sure not get a, you know, issue for the regulation or other stuff?

Yeah. So insurance is, and most of the regulated industries are special in that we use a lot of like PII, which is like kind of personally identifiable information. And sometimes we have health information, we have social security numbers, we have all sorts of things like that. There's a couple pieces to that equation. The first is that you're working with IT and you have really good data governance, because you can mask a lot of those fields, you don't need them, you can create like IDs to replace things like social security, and then have it crosswalk back later if you needed it. So things like that you don't actually need to have in the cloud environment that everyone has access to if you have good and I again, I always say it goes back to the data. If you have good kind of data standards around your governance, you can take out a lot of the information that you don't need.

Now you still typically do have a lot of information that is private. So we do have a lot of different compliance pieces, but one of the things I like to remind people of is just because you have it on local doesn't mean it's secure. And I think people tend to forget like, yeah, it's on John's machine over there and Sam's machine over there. And it's getting sent by email, you know, that isn't necessarily more secure than putting it in like an AWS or Google Cloud or Azure environment. In those environments, we can really restrict security and add permissions and know who's touched data when. And so a lot of those things are how I make the case around it. But it is definitely it's a big, it's a big step for people. And I think comparing it back to the current state and really making people aware of what the current state is does help.

Do you actually follow on to that too? I know there's something that came up in a hangout with Biogen on like the data governance or data stewardship piece. How do you, I guess, come up with best practices for how people handle work on their desktop or the ways that they're handling data?

Yeah, it's a really good question. I was lucky enough to initially grow up in a very strict data culture in my first role at Liberty Mutual, where we had really high standards. And so for me, a lot of it has been repeating those standards over and over and over again. But I also go back to actually in the insurance industry, one of the things that's really riling some things up right now is the National Association of Insurance Commissioners just released their letter on AI, how they're thinking about AI, how they're thinking about data. And we really always want to be following like a lot of those standards. And I think that goes back a lot to knowing what's in your data, knowing what your different fields are, knowing how much you actually need them, and kind of how you are, even if you have like a data dictionary, do you understand which fields are sensitive and which fields aren't? Or do you have basic protocols to not email your data?

And kind of just a little bit of awareness, because I think, especially coming out of school, which I got a lot of folks that are like PhDs when I hire, and they're used to like data that's already cleaned or anonymized, or can be emailed around between different groups, or can be uploaded to the Google share whatnot. And our data really doesn't act like that. We have to be a lot more strict. So a lot of it is creating that protocol around, this is the process, this is how we follow it. If you're doing work on your machine, are you saving it back? Are you, you know, doing a pull and a push request and making sure you're saving it back every day? And kind of that is a big piece of it. Because we do a lot of testing locally, just before you kind of run something up all the time. So that's really push and pull requests. And I've definitely gotten people before on that. I think once you make the mistake of not doing it, and you lose something, you never make that mistake again.

Thank you. I know there was just an anonymous question that came in while you're answering that too, which was, how do you keep track of which fields are sensitive? So PII or PHI, how do you ensure those fields are not shared outside of having documentation in place?

I think that definitely depends on the scale of your organization. So if you're like a larger organization, you might have a tool like Snowflake or some other sort of software where it's really easy to like tag things and label them. With a lot of insurance data, a lot of it's like historic files that have been stored somewhere or some of it's like uploaded at specific times. So with the documentation piece, I think it's really about documentation and then kind of masking a lot of things as you push them out. So doing it at the first step rather than at the every model step, and then having people come back and request it if they need it. So instead of like generally having one modeling data set that everyone can work from, you have the set that's already been transformed kind of hidden. And then if people need those fields, they can come to you rather than leaving the fields in as default.

So sensitive fields we tend to remove. It's also really important in highly regulated industries to know there's some fields you can never have, even if a vendor tries to send them to you. And those include all the protected class information. So just kind of being really aware of that. And if there's an issue, letting people know that we're in an environment where they can come and tell you they've made a mistake. Because the worst thing is to pull that data over and then to keep having it rather than to flag it. So it's kind of that culture around, hey, I can raise my hand and let you know I accidentally had this open data or did this is really important.

But Jeb said, I'm also in a heavily regulated industry and the pace of tech change is wildly different at different layers of the business. One side is bleeding edge and pushing the edge to new places. The other might still need to keep an AS400 running because that's the one that's approved. Do you need to navigate this kind of dynamic in insurance and how do you do so?

Yeah, it's absolutely a dynamic that I'm navigating. And I like to be the first mover and then prove that we've got a business case that we can save money. I always think about with that, if I'm in the slower group, the first thing I think about is what is something I first look at the company goals for the year, right? So like, or organizational goals for the three-year period, whatever they are. And then I try to think about if there's any projects that we're working on that would make a material impact to those goals. And usually it's a cost savings or a revenue generation item. And then I use that project to try to justify why I need new tech. And that's the only way I've ever seen it work is really focusing on that piece of like, this tech is going to drive this ROI and it's going to cost us money, but it's also going to save us this much money. And it ties to your objectives up here is really how we make noise on that.

And then when you're part of the company is doing that and seeing success, you're able to kind of say like, hey, why haven't they adopted it? But I tend to think of it as if you can like kind of a rising tide rises all boats, right? So if you're moving up and you're getting new technology, like being the people that upskill and train and help the other departments is great. If they don't want to come along for the ride though, they don't have to, and you don't need to drag them. Dragging them can like, I used to try to do that like earlier in my career, and that can really drive you down and slow your progress. So making it accessible to them, but not putting too much time if they're like really resistant when they're not really kind of tied to your business objectives.

Hiring, onboarding, and workforce analytics

Yeah, so I noticed, Jeannie, that you taught a class around human resource. I think I was looking at your LinkedIn page, and I'm curious if you've ever done any modeling or work around capacity planning, either the demand side or the supply side.

Yeah, and funny short story on that. I actually do teach for the HR program at Northeastern because when they realized that HR professionals would need an analytics course, they also realized that none of the professors wanted to teach a math course. So I moved over there to teach that course, and that's how I ended up, and I actually do a lot of my research is in that space and the adoption of tools in that space. So when we think about the operational piece though, that's a big piece of modeling that I've done historically, and that's across like things like claims and call centers, and also groups like underwriting and staff planning, especially around sales.

Well, one thing that we're starting to look at in my office is around resource capacity planning. So the idea that managers sort of submit something early on that says this person's going to be spending this many hours on a project and then trying to reconcile that with what that resource actually does submit and sort of plan ahead to try to line up those expectations about where people are going to be, which projects people are going to be spending their time on.

Yeah, that's definitely a fun one, and one thing I would caution is if you can make sure that your metrics have kind of two levels, and what I mean by that is everyone who works in any environment is great at optimizing the benefit to them. So in a call center, for example, if you optimize the number of calls taken, people will actually just start hanging up on customers to get more calls taken, and if you optimize on like a quality metric, people will stay on the phone as long as possible and make as much quality as possible. So having kind of a marriage of the two, so if I think about like your time tracking, I might also think about like a project success metric or something else that you can track to really balance that so that you're not penalizing people that are spending a little extra time but making a lot more impact versus like other folks.

Also, you said that you, you know, have done work on that and that you've published some papers. I don't know if it was directly on that, but where do you publish your work?

That's a great question. In some of my work, I publish in insurance journals. The two grants, I'm working on two grants right now at Northeastern, and we're going to publish probably in a number of journals, and then also we'll probably do some releases on one of those grants with the Koch Foundation and one of those grants is with the Walmart Foundation, so they'll both also probably do a variety of like publications in partnership with various consulting groups or businesses as those evolve.

Okay, Rick, I see you asked a question with the star, so I'll read it, but Jamie, is the digital transformation initiative company-wide or is the effort restricted to the data science function? Also, to what degree are future-ish data science considerations considered in this roadmap?

Sure, so the initiative is technically company-wide, but I am front-running my area in it, so the company-wide initiative, it looks like it will probably take another few more months to even get an idea of what tool suite they're going to go with. They're doing a lot of analysis. They're doing a lot of thinking, so I offered to be a test guinea pig and said, what if we just front-run you guys? We'll go in there. We'll test and learn. We'll use everything, so it is technically a global initiative coming, but our initiative, we're alone for probably the first couple years.

In terms of the broader piece, I think one of the things that's really powerful with getting leadership on board with the level of spend you have to use to get and migrate into something like a cloud environment is that new age stuff, right? They love the buzz, the AI, the gen AI, the future of whatever, and if you think about something like, for example, home insurance, today we might use a lot of features that are point in time, like the square footage of your house, but if you think about the devices you have in your home today, in the future we might want to be able to take actions based on real-time data, and we can't do that with the infrastructure we have today. And so a lot of it is around kind of justifying. We need to start building now if we want to be able to do that in five or six years, and that's the unfortunate truth. Even if we wanted to do it in a decade, we'd probably have to start building now, so that case resonates a little bit more with the long-term planning, like executive suite versus the case of like today.

Someone had sent me a direct question around what you look for when hiring, so somebody just getting out of school, but I also wanted to add into this because I saw that you were hiring a data engineer recently, and so I thought it would also be good to chat a little bit about that and just understand what are you looking for in that role?

Sure, so first kind of in general hiring, the number one thing I look for is curiosity. I think that you can learn, if you already know a programming language, you can learn any other language if you're curious, but I cannot necessarily teach you to be the kind of person that sees a weird pattern in the data and wants to know more. I think that that's kind of like you're either that kind of person or you're like, oh man, there's a weird pattern. I wish it would go away, and so that's really what I look for when I hire, especially for data scientists, is the drive to understand the why of what they're seeing and that test and learn kind of interest.

Sure, so first kind of in general hiring, the number one thing I look for is curiosity. I think that you can learn, if you already know a programming language, you can learn any other language if you're curious, but I cannot necessarily teach you to be the kind of person that sees a weird pattern in the data and wants to know more.

When I think about the data engineer role that we're hiring, I'm really excited about it because we have historically never had any data engineer roles, and this is a really common thing at lots of companies because they hire a bunch of data scientists and they're like, aren't you data scientists? Why can't you ingest and prep the data as well as building the models? I think I like to think of data scientists as the people that could play a data engineer on TV. They know enough to be dangerous and also to make really badly unoptimized pipelines. It's a totally different skill set, especially as we move into some of these cloud technologies. There's so much really cool, interesting optimization to be done, and you need to understand what to ask for to do that, especially with your IT partners. You need to understand what the capabilities are because you won't always have, you know, in Excel or in SAS, you would have almost total control of everything that was happening. In an environment like AWS, you really have to rely on your infrastructure people, and they don't necessarily know what's best for things like a data pipeline. So that is really what that role is about, is rethinking the way we ingest our data and process it, especially so that we can do it in a way that's automated. There's so many good new tools now to look at things like flagging you're missing, like identifying imputation opportunities, and then actually doing it and keeping a great record of it in the system.

Sure, thanks. I just, yeah, following up on the hiring questions, I was just wondering what are the things that you do to onboard your new data scientists? Like, I work in insurance as well, and I know when people come in, they're sort of a little shocked at the amount of data that we've got in all of the different topics that everything encompasses, compared to a lot of different industries. What does that look like for you?

Absolutely. It's a great question. I think the number one thing, I go back to having really good historic documentation and holding my team to a certain standard makes onboarding very easy, because we have the documentation. Now, for example, I moved to Plymouth about six months ago. I don't have that legacy of me being too aggressive on documentation historically. So, a couple of things. Going back to one of the earlier questions, I really think business knowledge is important. So, one of the first things I have data science folks do is shadow a bunch of the key business users that create the data that they're using. So, for example, if they're working in an agency space, I have them go watch the agents and put whatever they're inputting. Same thing with a claims adjuster or an underwriter. I'll have them do a shadow in the first couple weeks, understand that person's role, understand the intricacies of the data. So then when they get in the data, they recognize, oh, here's what I'm looking at, what I'm trying to see.

I'll always get them a buddy on the product side or someone somewhere else in the business. Those people usually love to hear more about data science, and they get the added benefit of this person learning more, someone they can go to really quickly for business side questions. And then just documentation, documentation, documentation. So, I usually like to give them a tiny project that uses the data that we like to use, but in a way where they'll kind of find some issues and errors that I can know about already and expect to see how they treat them just to get a feel for them too, right? When I hire them, all I know about you is what you've told me. I don't actually know like your quality, what your gaps are, what you're good at. So, giving a little project where I kind of know where the outcome should fall can really help me identify, okay, here's where you're good, here's the growth areas I need from you.

I really love the point about shadowing, and I see Libby had commented about literally begging to shadow some users, but being told no. Did you have to go through anything internally to set that up, or how does that work?

Yeah, it's definitely a, I know when I first started in insurance, the first time I asked to shadow an underwriter, my manager looked at me like I was insane. They were like, no, we don't let the data people near the underwriters. And I think that that's really a conversation you should be, the data science leaders that you have should be having with their peers. Like, I think a lot of times the reason that we don't get adoption is because people don't think that we understand the work that they do. And showing that curiosity, I have almost never had a peer across the business turn me down when I say, I really want my people to understand this better because I want them to be able to make the best outcome for you, and I feel like we don't understand what you do effectively. They're always going to agree that we don't understand what they do. So, that's a much better way to do it.

But again, if you can't get that from inside your company, I would totally recommend going to the professional networks. So, there's like networks of claims adjusters, there's like the underwriting network that I'm a part of. They have like all sorts of different meetups and trainings and things like that. And you can go and meet, you know, underwriters from some other company if the ones from your company won't talk to you. So, that does take a little bit of a proactive push and a little bit of research. But that piece, if you can't get it internally, I would definitely push to get it externally. And if your manager is not pushing for it, I mean, I would consider a kind of, do they have the right vision? And if they don't, sometimes data science managers can be very, very technical and not as good at this particular area. And that's when it's really helpful to have a buddy on the product team who maybe their manager can go have that conversation.

Data de-identification and unstructured data

Yeah, no, I was just curious when she mentioned, I'm doing a little market research on, I've got a lot of background in data de-identification, particularly in PHI. But any types of things you'd like to see, I think particularly with LLMs, there's a lot of background in working with structured data and de-identification when it comes to unstructured data. It's kind of like, if you follow a lot of the buzz about vector databases, things like that, it's kind of like the holy grail. Everyone's looking for a way to find some type of de-identification, universal, you know, solve everything problem.

Yeah. So, I think it's helpful to tell a story. So, if you guys remember, Obama had a chief data scientist, DJ Patil. And when he left his role, he initially decided he was going to tackle the problem of data and insurance. And he was going to create like some sort of system to handle all the data that we have in all these and really streamline it and get it all in a single source and really improve outcomes that way. And then he gave a talk maybe a year later and he said, you know, we got there and we realized all the data is in PDFs and maybe this isn't what we want to tackle. And you've similarly seen Amazon recently did a foray into an insurance. They shut it down like a couple months later. They just shut down the UK operation, Google, same thing. The data problem in these highly regulated industries is enormous. And it's very, very challenging. And the first stage of it, like you're thinking of like the fourth stage and the first stage hasn't even happened. The aggregation, effectively aggregating and combining data across different sources hasn't even happened yet. And there's a couple startups starting to do this in like the health space where they aggregate like your health data across all your various technologies, like your watch and your whatever, whatever. But like, if you think about insurance, we collect data from the motor vehicle records. We collect data from your health, you know, different types of doctors and providers. We collect data from all these places where their data is actually really bad. And we don't have good system linkages and all this, like think about trying to get your hospital records and what that data looks like. And so to me, like the first problem to be solved is not even masking it, it's getting it and putting it in a format that's actually usable and consistent from all these different places.

And we've seen some providers do that in very specific spaces, like prescription data. There's a provider that has worked to like aggregate all of that. Or like motor vehicle records, there's a provider that has worked to aggregate, but those pieces to me have to be answered. And when you answer those pieces well, then when you start doing these giant kind of other types of models and structures, you actually have the foundation to be able to tag, flag, whatever, versus trying to start from like step four.

I think I was just gonna give a background. I've been focused the last several years, my current, I've just started my new gig, but last several years I've been in a healthcare company where that is what we were doing, taking like claims and EHR, practice management data, pharmacy data, et cetera, and building a common data model. There's tools like OMOP and others, you know, that are used in the healthcare side. I can't necessarily say all of insurance, but the idea, the next dilemma then is, so now you've got this common model where it's the holy grail that you're trying to get to is a large data set that can be used to build predictive modeling, get towards precision medicine. Instead of having all the patients in a hospital, you could have like every breast cancer patient in America in a data set and you could get really, really fine-tuned analysis and that's the holy grail. But to get there, once you get this big data repository, you find out like, okay, we can handle the structured data, we can do de-identification on that. Now comes the unstructured, all the doctor's notes, all the other lab results, the PDFs, as you said, everything's in a PDF and they all have proprietary mechanisms to try to decode them. And there's work being done there, but then once you get to that, it's like, okay, how do we de-identify this and how do we do it reliably?

Yeah. And there's some really cool startups doing work in the space of pulling data off PDFs and even doctor's terrible handwriting, really cool, like LLM, Gen AI work on that, some of them local to Boston. So that is really exciting to me because that five years ago, nobody was doing that.

Data engineering cross-training and career paths

Yeah, I guess it's more kind of a general question of, have you ever heard of a data scientist cross-training as a data engineer?

Yeah. Oh my gosh. A hundred percent. I have people going back and forth both directions. You do need to understand, I think that there should be cross-training period. Back when I went back to school to get my master's, a bunch of the courses I actually took at that point were in database infrastructure, which is less relevant now, but being able to understand the database infrastructure really helps with implementation. And if you can ask the right questions and use the same language as IT, that's super helpful. And I like those skills are so useful back and forth. And I could definitely see kind of cross-training, especially because a lot of times, I think for the last decade or so, the data scientist terminology has been kind of viewed as sexier than the data engineering terminology for no particular reason. They're both critical to the process. And so I think people would be very welcome to bring you the other direction. We usually have the opposite problem where you hire a data engineer and they only want to be a data scientist. So yeah, I would welcome someone going the other way.

So in terms of getting that training or knowledge, would you recommend just talking to the subject matter experts in your company or are there any independent resources that you know of to get started?

Yeah. I mean, I would start with kind of informational. The way I always start whenever anyone thinks about any development across job areas is some informational interviews with people doing that work, asking them what tools they use day-to-day, what is their job really like day-to-day. And I would do it across a few different companies. I would ask the data engineers at your company, but also really think about where it is you want to land. Because even if I think about data science, people ask me, how do I become a data scientist? I get that question all the time. And my response is, well, is it an insurance? Is it in tech? Is it in... Because in some of those spaces, you're going to want a PhD and a specialization. In some of those spaces, you're going to want... You can come right out of undergrad and do XYZ and you'll be working with certain types of techniques. So I would say data engineer means so many different things that I would really start to hone in first on what's your industry space and what does data engineer mean to you. And then where are those types of roles and how do I start working with that peer group? And then there's really great volunteer organizations. An example would be Code Across America, where you can come in and work on little projects. And there'll be other data engineers in the group where you can learn from them informally.

Cloud tools and closing advice

Jamie, there was an anonymous question I missed earlier, which was about migration to the cloud. And just ask me if you could share a little bit about the tools that you're using for that.

So we happen to be like an AWS infrastructure company. No pros or cons there. I don't think I can share necessarily the proprietary tool set we use inside that. But I would say if you do decide to use AWS versus Google Cloud versus Azure, just be aware that AWS is like Legos. There's a bunch of pieces everywhere and you have to put them together the way you want them. And if you don't know how to do that, it can be really hard. So I would say we have lots of different Legos and then you can also hook in other different types of tools. If you go with something more like Azure, a lot of it's more pre-built, a little bit more predefined in Google Cloud. But yeah, we're using AWS as part of our tech stack. And then we have a bunch of other tools stacked in there.

Well, thank you. As we get to the end here, I would love to ask you if you could share a piece of career advice that kind of sticks out to you in your mind, or maybe it's something you received or that you've given to someone on your team.

Sure. So my biggest piece of career advice would be to lean in to being the dumbest person in the room rather than focusing on being the smartest person in the room. And really just keep asking why. Because a lot of times a legacy process or a model or whatever, if someone can't explain it to you, that means that there's probably issues that you can find in it that you can fix.