The skydiver to data scientist pipeline | Kevin Dalton | Data Science Hangout
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hey there, welcome to the posit data science hangout. I'm Libby Heron, and this is a recording of our weekly community call that happens every Thursday at 12pm US Eastern time. If you are not joining us live, you miss out on the amazing chat that's going on. So find the link in the description where you can add our call to your calendar and come hang out with the most supportive, friendly, and funny data community you'll ever experience. Can't wait to see you there.
Hey everybody, happy Thursday and welcome to the hangout. I'm filling in for Libby today if you can't tell. So if we haven't had a chance to meet yet, I'm Rachel, I lead customer marketing at posit and I'm so excited to bring in our featured leader today, Kevin Dalton, senior data scientist at Great American Insurance Group. And Kevin, would you be able to introduce yourself and maybe share a little bit about the work that you do today as well as something you do for fun?
Sure, absolutely. So welcome everybody, excited to be here at the hangout, have been a big fan for a lot of years. My name is Kevin, I'm a data science, insurance data scientist. So what does that mean in practice? It means I do a lot of work with actuaries, insurance companies like to consider themselves as really old school data science. We've been doing this for a long time. My background was originally in economics. I was trained as an economist, financial economics. Then I took, I worked in the insurance industry for many years and I took off some time to become a professional skydiver. So I worked as a working skydiver for about 15 years and which is pretty fun, not as glamorous, as well paid as you might imagine. And now I've been back working in insurance data science focused areas for about six years. So that's kind of a brief history about me.
I work for Great American Insurance Group. We're a very large American insurer and I work in the predictive analytics business data, predictive analytics group. And we are what's generally referred to as like a corporate resource. So underwriting groups, we're very decentralized company can come to us for their data science and analytics needs. We have several different groups, one of which builds what would be considered sort of traditional insurance actuarial models that's for bread and butter. How do we price? How do we segment? How do we market? Those kinds of things. But we also have, I think some of the more exciting things now, which are computer vision and natural language processing agent models in the AI sense of that as well. So that's what I do day to day. I'm an individual contributor. So half of my time is really spent between theory and then half of it spent between sort of implementing that. I'm also big into MLOps now. So hopefully that wasn't too much or too fast.
Oh, that's great. I think that's the first professional skydiver we've had. Are there any other professional skydivers in the chat? I think some people maybe want to go skydiving with you.
It's a good time. Oh, what do I like to do for fun? I actually am a big kind of nerd about programming and stuff like that. So I've been writing a lot of code recently, but I have little kids and I like to go out and be outdoors with them. So I live in Boulder, which if you know anything about Boulder, you know, it's all, everyone's outdoors with their dogs all the time. So it's fun.
Love it. Well, Kevin, I had the privilege to get to meet you a few months ago at the Pawsit.com registration desk. And I just want to say thank you for your kind words about the hangout. And I was looking back at my notes on my phone, because I can't remember anything from comps. But I was like, get Kevin on as a featured leader! Lots of exclamation marks. So thank you for being here.
No problem. Yeah, sorry, big fan of RStudio and sort of the Pawsit now. I remember when they, I told Connor this, I was sitting there when they announced it. And I was like, this is so smart. I'm so glad they're doing this. I think you guys have done amazing work for the community. And I've been involved with the R community for a long time. And, you know, you guys are doing the work that needs to be done. I really appreciate it.
Little P and big P production
Oh, thank you. One of the other notes that I had from that conversation, though, was around some conversation we had at that desk around little P and big P production. And I think something I want to talk about today. Before we even do that, what does it actually mean to put models into production at an insurance company?
Yeah, that's a great, that's a great question. I appreciate it. For you'll see, and I've seen this a lot. I mean, you'll see from the very, very unsophisticated to the very sophisticated. So, you know, a model for insurance company can be just a very simple linear model, or it can be something very, very sophisticated telematics type model. And you'll see most of that's sort of heavily weighted towards the simple linear models. You know, we're trying to segment different risks based on different classifications. I always use the sort of blue cars are better than red cars kind of thing, right? Did we know about our young drivers, those kinds of things.
And we have this great analysis. Now, how do we get it on the road? How do we put that out there? For a long time, insurance models were very static. You know, they, and some of them still are, they have to be fine in the US, they have to go before a regulator that says this. So it was okay to hard code them or put them in a spreadsheet or put those sort of coefficients in there. So that could be being in production, right? You could get a data set every month and say, Hey, score this for me. And that would be that. Now it's moving more towards automated production, automated inference to sort of cloud centric, real time inference engines where an underwriter, for example, can input the data into an underwriting system. And part of that underwriting system takes that data and hits an inference engine and gives them a score back. So that's where we're moving. That's where I see the industry moving, but a lot of it is still, you know, sometimes my production system is a notebook. I give new data to, so, but that, you know, whatever, whatever works. So we're moving towards that, but those are never, notebooks are hard to, hard to productionalize as, as I'm sure you all know, right. And it's not something that's easy to go back and then say, why did we do this? Which is where we want to be now, but that's what it means for insurance companies.
Thank you. How do you actually explain that to people at the company that there's this need for two different types of production?
I, I've read, it's a great, great, great question. I haven't really had to explain it. I think the need has just been so obvious to everyone that like, why does it take so long to get something back? Why are we waiting? Why is it hard? It doesn't have to be, you know, but you have to, you have to be willing to do the engineering, put the systems in place, think about, the tools you're going to use, you know, Posit Connect, Posit Workbench, whatever, whatever it's going to be, or you're going to code it yourself. I think, that's more of my communication role of that is, is to how we do it rather than why. Everyone wants to go faster. Everyone wants to do, I guess what, I didn't come up with this, but I really like it, analytics at speed. So, everyone wants to go faster and you can't go faster unless you have this productionalized pipeline.
Everyone wants to go faster and you can't go faster unless you have this productionalized pipeline.
Small language models and Bayesian modeling
Thank you. I'm starting to get a lot of questions coming into Slido. I'm having to get back into my multitasking mode that Libby is looking at. But Edward, I see you asked a question that has had lots of upvotes here. Do you want to jump in next?
All right. Let me go to Slido, get the words. So, a lot of companies are integrating like large foundation models, like LLMs into their like processing and analytics pipelines. But as an owner of an old computer and at a company that's kind of like AI adverse, I do a lot of my work with LLMs on small language models that can run on my local machine. And I get some great effect out of it. It's smaller, but have you guys in a large company found any application for smaller language models in like production pipelines?
Not to, not specifically those smaller train models. Certainly, I kind of do some of that for fun. I just got a NVIDIA DGX Spark myself, right? So, I've got a little ability to kind of run these small models and I'm trying that out. Ours are bigger at this point, I think, just because we have enterprise resources, but I second all the things that you said. I think that's really cool that you're doing that. So, thank you.
I see some questions coming in about skydiving.
Absolutely. Actually, I talk about that all the time. So, I'll kind of link these two, because I see somebody said, what does a professional skydiver do? So, most professional skydivers teach, so they take tandem. I'm sure if you've seen it, a lot of people have interacted with the tandem, right? You go to a skydiving thing and we hook you up like your baby Bjorn and we go to the airplane and we jump out and there's two of us, one parachute kind of thing. So, that's kind of the day-to-day bread and butter. What I did a lot was, just because my background is, I did a lot of sort of military instruction. So, teaching military parachuting, that's my background. And I always say that most of it is like guidance counselor. Skydiving is pretty easy. It looks like you're doing a lot, but mostly you're just falling, which everyone can do. Trust me, gravity takes over for that. But most of it is, you know, sort of the psychology of that. And so, that has kind of come over to a new skill for me. Well, a new skill for me, something I've been working on a lot is the people side of data science, of trying to understand my customers, trying to understand my co-workers. There are a lot of, and I mean this in the best possible way, there are just a lot of quirky people in data science and that they're just awesome and good to work with, but the communication styles are so different sometimes. So, like being able to work with so many, being able to see so many different people and talk to them and skydiving has really helped me do that. So, I wouldn't have traded it, but the people skills that I've learned has been very good.
But mostly I just am a human carnivore, mostly I was a human carnivore ride, so if that makes sense.
I love it. I can imagine you have people who get up there and maybe need to be talked into it a little bit. I was just to tell them, I'm like, I'm not going to make you go, but just remember the skydive's free, it's the plane ride you paid for. So, I'm helpful, you're more than welcome to ride back down with the plane, but we're keeping your money.
Well, I see there is another anonymous question I'm going to jump to. That was, what recent innovative ML models in the insurance sectors have you brought live, if you're allowed to share?
Yeah, I am. I think some of the things that I think are innovative are, we have been doing a lot of work with sort of Bayesian hierarchical models in the insurance space and actually getting that to work at scale is something that's been near and dear to my heart. We have some very weird distribution insurance. So, if you think about, if I could put my statistics geek hat on here for a second, we have very non-normal things, real distributions that are very right skewed. Most insurance policies are zeros, they don't have any losses, but then the ones that do can be very right skewed. And so, being able to put that together in a Bayesian framework and get that to work at scale has been something that's been near and dear to my heart, really challenging to do, especially since we have things like the Tweedy distribution, which some of you may or may not know about, which are no closed form solutions. So, we have to do it all numerically. So, we've been working on that. That's really cool.
There are, as you can imagine, a lot of great, we were working on natural language processing before it was really cool, before there was ChatGPT and those kinds of things, the original BART models, because we take a lot of text, a lot of unstructured documents. And so, we've been working on those to get those on the road, but those are really cool.
Thank you. As someone who's not a data scientist, I've noticed that whenever Bayesian is mentioned, there's this community around it that gets very excited. And I was wondering, why is it like that? Or can you explain to me a little bit more about what it is?
Yeah. I look at it as a probabilistic program. It's a different way. The statisticians, I guess the non-familiar way to look at it is there's two camps. There's this Frequentist camp and there's Bayesian camp, which I don't think is true that much anymore, because I think graduate statistics programs are pretty much sort of this probabilistic Bayesian framework goes a long way. You'll see it in, I think it's become super popular because of Richard McElroy's book, Statistical Rethinking, has come back and you see this. And a lot of different non-statisticians have picked non-data scientists to pick that up and epidemiology. And it's just the right way to look about things. And it's just one of those things I think you're passionate about. I'm very passionate about it. And I always explain to people, I go, everyone's a Bayesian. It's how you cross the street. You're like, well, here's my prior and then here's likelihood. And I get a big kick out of people explaining it to me like, oh, what's your modeling strategy? And they're like, oh, well, I have a thought and then I have some data, but then I'm going to do this. I go, okay, so you've just described a prior and a likelihood. Why don't we just do it that way?
But it's always been super hard to do, unless you want to do some toy example, but now it's not. Well, it's still hard to do, but now we can do non-toy examples. But there's only one R community I've found more, I guess, passionate than the Bayesian group, and that's the spatial group. They're fun, too.
R vs Python and using Positron
Michael, I see you asked a question over in Slido. Do you want to jump in here next?
Yeah, sure. Thanks. I've been using R for, gosh, I don't know, at least 15 years. And the firm I am at now, I started, we were fairly small, maybe 70 people or so, now we're close to a thousand. And so I had the luxury at that point of being able to dictate what we use. So most of a lot of the core infrastructure was R. But I've been using R for a long time. But I feel like I'm in a bit more of a minority now. And so trying to integrate workflows for people who are more focused on Python, things like reticulate can get you pretty far, but sometimes that only takes you so far. So I'm just curious to hear from you or others even in chat of people who are at its core, kind of spend a lot of time in R, but to the extent that you need to kind of work with folks who are maybe more Python, like I'm kind of at that point where do I feel like I need to force myself to just get better at Python and come to terms with that? Or if people have found other ways that are effective at maybe compartmentalizing, using certain things for what they're good at. But again, it's hard because if everyone's not speaking exactly the same language, then there's some frictions there. So especially when it comes to production and deployment, like the research maybe is a little bit less critical, but even then there's frictions because you'd like to be able to reuse research code as much as possible. And generative AI certainly makes it easier to refactor that. But anyway, I'm just curious if there's strategies you've found effective at kind of bridging the gap effectively.
That's a great question. And I don't know if I have any great thing about strategies. I can tell you what I do. I use a lot of Python. I use a lot of R and I go back and forth between the two. I might probably, maybe I might misquote Emil here when I think we were talking about it a couple months ago. He said the difference between R and Python is which flavor of C wrappers you prefer. And I kind of look at it the same way. I always just tell people like what's the right tool for your job, right? And sometimes it's R and sometimes it's Python. So I guess people like that thought that was a funny joke, but I think it's true to me. I learned in C a million years ago when dinosaurs walked the earth and I'm like you. I started using R when it was basically S plus before there was even RStudio. And to me, they seem very similar and they seem very familiar to me. I'll tell you my strategies locally and that recently has been saying use Positron.
And I have to admit that when it came out, I started using it. I guess as soon as they released it, I downloaded and built it from source. And I have a little note. I'm going to frame it one day. It was like, who is this for? Right? Because I just know how passionate people are about RStudio and I use VS Code a lot. And now I finally completely 100% get it. And I'm working on a project now that is, you know, same repo. It's got UV in it. It's got RV in it, which I'm a huge fan of, right? And then Positron has just got both environments going. I'm writing Python code. I'm writing R code. I like to tell people I go, you know, maybe 20 years from now, there'll be two dialects of the same sort of data science language. But I tell people, if I can just circle back to your question, my strategy is to use the right tool for the job. Sometimes that's Python and often it's R. And if you, you know, if you sit down and think about it, I think you're going to accomplish the same things in both languages. Just what's more convenient for you to do that. But thanks for your question. It's a great question.
Thanks, Kevin. And I was just about to ask you about Positron. So, I'm glad that you already got that there. What was the main thing that got you from who is this for to using it?
I just think I started to use it, right? Because, you know, I tried with, I don't know what the experience has been with VS Code and R, right? And I had high hopes for it, right? Because I use a lot of Jupyter notebooks in Python and I really like that. And then, but it never really worked well. It just was hard to use. And so, I think Positron and the work that that team has done to make it very R-centric as well as Python-centric has been great. You know, it's my daily driver. I use VS Code less and less, find out myself using VS Code less and less unless I have just a pure Python, you know, with C kind of focus group on it. But I knew I could do the same thing in Positron. So, I think it's the next, I joked with Connor once, I was like, you know, you guys are going to be maintaining RStudio forever still though, right? Because it's such a passionate crowd around it. And I think it's great. And I grew up, not grew up on it, but, you know, I've used RStudio for a long time and I thought this was really good, but I think Positron's the next step. And yeah. But I did wonder who it was for. I was like, man, who, and I said that because I'm like, I just know people who, they were like, you know, you will take RStudio when you pry it from my cold dead hands, right? So, and I'm sure there's, they're, I'm sure they're out there. So. Yeah, absolutely. And we love RStudio too.
Tool stack at Great American Insurance Group
We love it. Actually, while we're talking about tools a bit, could you share a little bit about what your tool stack looks like at Great American Insurance Group?
Sure. You know, we have a lot of, you know, we are building out our, to the extent I can talk about this, you know, we're building out our sort of Posit workbench and team deployment now. It's very, we like a lot of enterprises are heavily invested in Snowflake now. So that's primarily what our tech stack looks like. But, you know, we have the ability to sort of individual contributors have the ability to work locally as well as that, which we're trying to move that over in the ML ops. And so my, you know, into the, into the weeds, I alluded to it a little bit, you know, my tech stack now seems to be a lot of Python with UV and Rust, which I'm a huge fan of, and R with RV, which I know is sort of still kind of newish, but for me, it works really, really well, better than, dare I say it better than RIM. And I really like it. So that's my tech stack. And again, it built up from there, but I do write some Fortran. I still write some C, so it kind of compilers in there and go with that. But.
Thank you. I might need some help from people in the chat, sharing the links to those packages. But so you actually run Posit within Snowflake. Is that right?
We do. Yeah. We, so we, we run it as a, as a containerized service inside Snowflake. And that allows us to, it's a great idea. I think, you know, we're big, we're big Snowflake users and Snowflake is super powerful from a data engineering side and from inference side, and it can do a lot of good things and Snowflake container services is great. Now, you know, you can run Python, you can serve Python models from there. You can serve R models from there. Once you get the, once you get some of the idiosyncrasies of where the way Python treats the list and R treats the list, you can get that figured out, but feel free to hit me up if you ever run into that roadblock. I spilled a lot of brainpower on that, but you can do it. And it's, and it's great. And you, you know, you get all the authorization, you get all that for free, which is kind of why we like it. But that's good. You know, our hope is to move to the full, the full Posit team stack and that, and to, you know, to an actual deployment of, you know, Hey, this would be from, you know, from a small P deployment to a big P production environment kind of thing. So.
Model guardrails and drift monitoring
My question for Kevin is I come from a data engineering and AI engineering background. And one of the first things that we did when deploying models is we would put guardrails around them just as you know, in case it starts spitting out nonsense that nothing falls apart. What sort of guardrails are you using that are specific to insurance to guard against that kind of drift?
Let's see what I can talk about. So general things that you might understand, right? Like, so we have a model monitoring process that is working with as real data comes in to make sure that it meets certain standards, that as the data crosses from the threshold, from the data engineering to the model inference threshold, it meets certain requirements. And there's a couple of different packages out there, both Python and R that do really, really well to sort of funnel that in and make sure that everything's good to go. And a lot of it, we sort of stress the data scientists and the developers, your model can't be brittle to these sort of things, right? Like, if you're predicting blue cars, red cars, it can't break when somebody says space shuttle, right? It can't just break it and say, I don't know what to do with the space shuttle. I mean, it has to fail intelligently. So we do that. And we have some different kind of insurers have different sort of standards. So health insurer, which I'm not, right? They're going to have very specific legal and ethical standards around what kind of data they can use and what sort of model drift is acceptable to that. We have less of that. We're ensuring buildings and cars and stuff like that. But we're also very sensitive to models in all stripes being just quote unquote discriminatory. So we have some setups around that. So a lot of it, I would think to answer your question is on three fronts, sort of data, and then make sure that the inference itself isn't out of line with what we would expect, or it doesn't fail brittly, right? It fails intelligently if it can do that. And then model checks of discrimination.
One of the things that I've been working on lately is to, speaking of generative AI, is to use generative AI to build synthetic data sets that test it, that test the model drift, and are designed to be edge cases and are designed to come up with clever ways, almost like having it write unit tests for you. But now it writes model monitoring tests. So there's a whole another, so beyond unit testing for software and models and beyond integration tests, model tests, right, that are using things like Y data accumulator, those guys are doing great things with generating simulated data, synthetic data sets, using generative AI to bang on it, basically, and make sure that it falls apart. But that's a great question. Thanks.
I do, but I'm not an expert in the field. I mean, that's what I'm, that's the road I'm going down. I think it's a great, you know, I'm not some, I'm not a developer by training. Certainly, I've picked up a couple of things, and I'm always struggling to sort of put my education out there, you know, unit test, which is probably simple for a lot of people. And now it is, but for me, it took me forever to bang my head on. And now, it's so easy, right? Like, I just am developing, and I've got two developers working beside me that happen to be generative AI. And I'm like, okay, great, write a unit test. It's going as we're doing. And I think that's, you know, people are like, oh, that you can have an opinion about that. My opinion is that it just makes me way better as a developer. Or it makes me a developer period versus whatever I am now. And but, yeah, the synthetic data, I think is fascinating, right? Because even well tested models, I see fail, but I see fail because of edge cases. And so, the ability to kind of go out there and hit those edge cases, I think is super cool. Thank you. Awesome. Yeah, thank you.
Synthetic data and public data sources
I don't do anything with geopolitical risk. I mean, I know, I kind of, I think I know where you're going with that. Like, there are insurance policies, insurance carriers that do that as a specialty. I've never worked in that. I've worked in a lot of areas of insurance, but like, I've never worked in anything like that. We do use a lot of public, I use a lot of public data sources about certainly spatial statistics. So there's a lot of great APIs out there. We have a large crop, you know, we have a large crop insurance division. So there's a lot of public data sets available for that. So I've much recent hiccups aside, the US government produces a lot of spatial data, a lot of data specific to what you use, but I'm always searching. I'd actually turn this question around. Why don't you tell me some great sources you're using?
Yeah, we'd love to hear that in the chat. And I guess on this same topic, I think there was another Slido question that was, if you want to learn more about using the tools you mentioned for building synthetic data, where would you recommend that people start? Are there specific packages that you really love or like other than just Googling build synthetic data set?
No, I think Hugging Face is a great place. There's some great sort of smaller models there. It's part of the reason I, much to my wife's chagrin of like, what is this package in the mail? Bought my NVIDIA DGX local supercomputer. And to put these models into there and run them locally. So I wasn't paying $200 a month on that, but to generate these. There used to be a package, a company, I think it's called Ydata now, but they used to, I mean, they still do, but it used to be called Pandas Profiling. Now it's called Ydata Profiling. And they would take in sort of Pandas datasets and give you, here's a profile of it. Now they've sort of added AI to their name and pivoted over to, we helped build generative AI synthetic datasets. That's a good place. That's where I started. I'm sure there are better places. Also, if I can make a plug for it, I think some of the new Codex GPT, I've been working on that as a sort of large scale model to kind of guide me with that. But Hugging Face is a great place to find some of these models, I think.
Fraud detection and AI's impact on insurance
So hi, Kevin, Trupti here. So I worked for an insurance, auto insurance client back in 2019. And we worked on identifying underwriting frauds, like misstated mileage, incorrect number of drivers or incorrect zip code to lower down their premium amount. And we use clustering techniques to identify those like clusters wherein there is a fraud. So with the AI evolving, what kind of techniques are being recently used? I would like to learn more on that.
Yeah, that's a great question. And I haven't worked in that recently. So I'm probably the last time I worked on it was probably around the same time you worked on it, right? So use the same sort of clustering techniques. So unfortunately, I don't, I don't, I'm just gonna have to say I don't really know what's being done with that lately, because I've just been working on other things. But it's a great question. I'm sure that some, you know, I'm sure that some very clever people are doing that, because it's a fraud is a huge problem. Like we build fraud models, like as banks do, and those kind of things, right to find those sort of things. And the level of sophistication is just going up and up and up. But I haven't used anything generative AI to look for underwriting fraud. So I see. Yeah.
Broadly, how is AI affecting your day to day work in the insurance industry?
Well, it's affecting the insurance industry. I can't speak for the whole industry. I mean, I see it affecting it day to day a lot. You know, we're, we're, there's a lot of a lot of startups, a lot of sort of Fin, I guess, insure tech is what they call it, our little branch of FinTech. They're just, you know, thick on the ground people starting these, especially for unstructured data. You know, we, we generated a lot of paper back in the day. And now we generate a lot of PDFs. And, you know, the ability to, to get those and process them, quickly is a huge, most money's being thrown at that. I think, from a data science perspective, it's changed my day to day tremendously. You know, like I said, I, I, both theory now, and on the development side, you know, I sit there with a one or two, at least one, probably two different models running along beside me, you know, and I think it's not a shame to admit it. Because I'm certainly not a great developer, a great coder, but are even a great statistician, but I'm constantly using those and those tools to make me better at what I do. And so that's, I think it's changed for everyone. Certainly in the development there. You know, you can see it in the tools that are out in Positron, right, like they're being implemented there. I'm racing to, I mean, I think it's a huge deal as for your career to keep learning, certainly in this field. But, you know, it's just added, trying to learn the basics of it, how to learn and train my own models and how to run my own models locally and how to, how to generate your own agentic AI and train your own agents, Langraph, Linkchain are really great with that too. And I don't do that just because I like it, although it is very, it is fascinating to me, but I do it to say, okay, how can I do, how can I make, be a better data scientist, better developer, approach this problem? How else would I do this? And I think some of the foundational models out there are getting spookily good at it. I mean, for a glorified, you know, for a glorified piece of linear algebra, it's getting pretty good.
Career advice and domain expertise
Yeah, I think the bit of advice that I think is most important to me and that I would give is, especially with the tools that are out there now, what you really need to focus on is your sort of business area. You know, if you want to work in insurance, learn about insurance, the data science, the statistics of that is there, you can learn it. Keep learning that, always that, but really focus on understanding that. You know, when I learned data science, sort of came back in from skydiving to learn data science, you know, I always heard that, well, you know, domain expertise and whatever, right? Like, you know, we're here to build models and do theory and get the math right and get the statistics right. But I think I underplayed it. Now I think I've, you know, I'm an individual contributor. I enjoy the theory. I enjoy the models. I hope that's what I get to stay doing and developing. But, you know, the ability to understand the problem itself is probably still remains understated. So, you know, keep learning and become a domain, you know, be as, get as much expertise in your domain as you can. Those would be my advice, but.
The ability to understand the problem itself is probably still remains understated.
Yeah. Sorry, go ahead. How would you advise people do that before they've gotten a chance to do a job in that industry?
I don't think you can. You know, I, there's no way to bootstrap your way into it. I think for people joining data science in the industry.
Yeah, I've been a professional recruiter for like more than 30 years. I would recommend everything. Send notes to people saying, hey, can I shadow you for a day and call off on that day? I saw something recently I just loved. Reading gets you the general stuff. Articles gets you the applied stuff. Read, study articles, go to meetings, meetups, form a group if you have to, but, um, don't take anything of like, you know, oh gee, you can't do it too. You can do anything. Okay. Um, put your energy, put your enthusiasm into it. And it's amazing what you can actually do. Um, and as a person that's hired more than 15,000 people, I can tell you, I don't necessarily hire for skills. I hire for passion, drive, enthusiasm. So if you say, hey Russ, I did this on Saturday when I was on my own time, guess what? You're screaming at me that you're interested and I want to transition into dah, dah, dah. Great. So yeah, there are lots of ways, uh, read, study, join groups, um, shadow people, um, all that kind of stuff so that when you walk in, you can do it. And then the other thing, I'll just make this really, really quick. Uh, the extreme majority of people, when they apply for a job, send a resume and their resumes are crappy, by the way, that's a whole other issue. But the people who get hired, I'm going to send a resume letter, recommendation stuff they posted on GitHub, you know, LinkedIn. So they're going to show, Hey, I care. I'm passionate. I know how to do this. I'm walking in. I might never have gotten paid to do this, but tada, I'm Babe Ruth and I'm going to hit a grand slam every time you give me the ball.
Cool. Thanks. Yeah. So, um, uh, so I agree with everything Russ just said about, uh, reading everything, joining groups, all that stuff. The one thing that I would add to that is, um, pick a dataset that's related and do a small project as small as it might be. The reason for that is that you begin to understand how the, how the data play into some of the questions that you might have to deal with, um, in the real world. And I say this as a person who, uh, kind of has the opposite experience of Russ. He is, he's hired 15,000 people. I had been hired, not quite 15,000 times, but I've, I've been through a lot of industries, uh, from aviation to clinical sciences to economics and, you know, what have you. So the, the way, the way that I've been able to successfully go from one industry to another, um, though I'm kind of, I think I'm done doing that. Um, the way I've done that successfully is, um, by just doing small projects before I actually start my job. And that also gives me a little bit of insight into, um, what I still need to learn before jumping in. So it kind of tells you what, where your weaknesses are as well.
Thank you. Kevin, glad to have you back. Are you good now? There you go. Okay. Sorry about that. Yeah. Um, so what'd I miss? No, those all sounded, those all sounded like great advice. Uh, I think I was talking about it. Like I don't, um, uh, I will just add, I think that's all great advice. Um, I haven't used myself of building sort of a portfolio. I see it, I see it out there. I don't think it's bad advice. Um, I think if you can, um, I guess the things that I'm trying to do for myself, and I just will speak for myself. I think I'm trying to, um, get involved in open source. So I think if you can be, if you can find a hook into the open source, uh, software analytics side of the community that you would like to develop domain expertise, I think you can learn a lot. You can learn a little about the data and certainly that gives you an opportunity to show your work as a data scientist, as developer. So, um, that's what, that's all I would add.
Evergreen skills and change management
I see, um, direction had asked a question in Slido that was as an experienced data scientist, which skills, uh, would you say are evergreen irrespective of the tools that you're using? Not really about like tech stack, but skills like soft skills, good data practices.
Yeah, that's a, that's an amazing question. And one that, uh, I think is very, very good. Um, like I said, cause I think that you can learn the tech, you can learn the statistics, you can learn what you need to know. You can even learn the domain expertise. I think what's going to make people last and allow you to last in the field, if that's your goal is the, is, and it's trite. And I know people say it over and over, but it's the people skills. It's the ability to work in a very diverse. And again, I mean, no disrespect. And I have to say no pejorative way with a quirky bunch of sometimes neurodiverse people who, you know, are developers and data scientists and data engineers and the ability to work with them and communicate with them, I think is something that will put you in good stead. And it's very soft skill. I myself have been taking a lot of courses and seeking a lot of mentorship and change management. Um, so people don't like change surprisingly enough, right? Like all change is bad change. And so that's their, like, how do you get them over that? How do you, how do you move up to that? That I think is so phenomenal question. Thank you very much for that question. That is always a big thing for us, change management. Tell us, what are you learning about change management?
You know, I apply it, I think in a corporate setting, I mean, you know, I think it's just a great skill to have in life in a corporate setting, right? Like we are, we work across different groups and sometimes it's, you know, herding cats. People don't like, I don't have to do what you say, et cetera, et cetera. Right. And so the ability to lead without being a, lead people without being a, an appointed leader or to influence people without being an influence leader. Right. I think is extremely important. I can't downplay it enough. And I think if you want to last in the industry, you're either going to have to be so good at writing code and doing data sciences that, that, that your people just can't ignore you, or you're going to, you're going to have to, to be able to do most of that in the way and then be great with people and great with the domain. But honestly, I haven't hired 15,000 people. I have hired some people, right. But I, I think we can get there on the tech stack. We can get there. I would rather, much rather hire someone that I know I could work with and it's going to work with the team well, and it's going to, I can count on them to have the understanding of soft people skills and

