Resources

Why regression still matters | Keith McNulty | Data Science Hangout

video
Jun 19, 2025
55:38

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hey there, welcome to the POSIT Data Science Hangout. I'm Libby Herron, and this is a recording of our weekly community call that happens every Thursday at 12 p.m. U.S. Eastern Time. If you are not joining us live, you miss out on the amazing chat that's going on. So find the link in the description where you can add our call to your calendar and come hang out with the most supportive, friendly, and funny data community you'll ever experience.

I'm just really super excited to introduce our featured leader today, Keith McNulty, Analytics Leader at McKinsey & Company. Keith, welcome. Will you let us tell us a little bit about yourself, what you do, what you like to do for fun?

Yeah, thanks, Libby, and thank you guys for inviting me. It's a lot of fun to spend an hour speaking with such a diverse and interesting community here. Yeah, my name is Keith McNulty. I kind of navigated a path into the data science world in what's probably a little unusual a way. I think nowadays people are coming into data science having done some sort of data science-related qualification, whereas I started out as a mathematician. I then went into management consulting. I then moved into psychometrics, which is a fascinating field which I still am very involved in, and the study of employee behavior and how it affects the workplace. And then about maybe seven years ago, I caught the data science bug and decided to retrain myself as a data scientist, built a team at my employer, which is now a fairly large team that kind of focuses on the applications of analytics to solving questions related to people and talent.

And in terms of what I like to do for fun, I mean, I just love engaging with the open source community. It's how one of the reasons why I've managed to have a lot of success in my current career is because of the contributions of a lot of open source individuals and the fact that I've been able to take advantage of that and my team has been able to take advantage of that. So I believe a lot in giving back. So I spend a lot of time developing materials and publishing open source books and things like that to make sure that things that I've learned are passed on appropriately to other people so they can take advantage of that as well. And I think, you know, for me, one of the things that attracts me to working in this space is the amazing open source community and the way people like help each other.

Keith's path into data science and LinkedIn presence

I first learned about you via LinkedIn. You're very active on LinkedIn. Were you that active before you were in data science? Were you like already out there spreading knowledge, writing books, doing all the things that you do?

I certainly wasn't very active before I got involved in data science. I think part of that is because my involvement in data science kind of coincided with when LinkedIn became more of a social network. But certainly, you know, there was this very exciting period on LinkedIn, probably between 2016 and 2018, where people were discovering a lot of this tooling for the first time, lots of people. And there was a huge amount of sharing of kind of open source knowledge. And a lot of that happened on Twitter as well during that period. And so, you know, one of the things I try to do is keep that going. I think, unfortunately, LinkedIn has become a bit more salesy and a little bit more, you know, it's kind of its culture has changed a little bit since that time. But I certainly personally try to keep that spirit going and making sure that like what I share there is useful to people and, you know, that it's something they can take away with them and potentially use themselves.

Learning resources and self-retraining

So, hi, Keith. Thanks for sharing your information today. I was just wondering if you could maybe talk a little bit about your most favorite resources that you used while retraining yourself as a data scientist. Things that seem to be the most helpful or most enjoyable, engaging.

It's a great question, Tony, and I'm not sure my answer would necessarily align with how people would retrain themselves today. And the reason I say that is because I retrained in 2016, and there was a significant limit to the amount of public information that was available at that time in terms of example code, how you might get something done if you want to achieve a particular task with your code. And what I found was that the learning during that period for me was in a huge amount of trial and error. So I'd write code and I'd just watch it fail. And then I would dig into Stack Overflow and some of those kind of resources that date back quite a while to try and work out how it failed. And sometimes I wouldn't be able to find the answer, and I would just keep changing my code until eventually it worked. And then I'd realize why it worked.

And that type of process was a huge part of the learning for me to get underneath why your code works. And of course, the resources that got me started are the resources that a lot of people still rely on today. I started out in R. I'm a big user of both R and Python now, but I started out in R. And of course, Hadley and Garrett's book was huge at that time. You know, their early editions of that were out to get me started with the basics of using the language. And there was a little bit of online resource, but that only took me so far. It was really having my own data set and trying to do my own tasks with my own data that really kind of forced me to learn. And I learned almost entirely through trial and error.

And I think there's an element of that that we still need to have today. So I think that if programmers are not learning to program through trial and error, then they miss out on a lot of learning. In particular, one of the things that I see is people are using Copilot to auto-complete their code lines. And then their code lines don't work, and they don't know why they don't work. And so they have to go back and dig into it. And that process is very valuable, right? It's how you actually learn what's going on under the hood.

And then what I found is later on, after I'd gone through that initial kind of hump, and I'd, you know, worked out how to get my code working, I always say there's this period of success, where you kind of, you've passed the initial hump of success. And that point is when you type a line of code and you expect it to work. So there's a period you go through, you type a line of code and you say, I kind of know this is going to fail, because I don't understand it well enough yet. But then there comes a point where you say, I type a line of code and I'm surprised if it fails, right? When you hit that point, that's kind of your first step up point in learning to code, I think.

When you hit that point, that's kind of your first step up point in learning to code, I think.

And then once I passed that point and got further past it, I then started to, you know, use a lot more advanced resources. A lot of, for example, Hadley's more advanced books on R. And started to really get under the hood of that language before I moved on to Python. So it was, it happened in various stages, right? But I think that the message I want to get across is the actual resources are not helpful if you're not actually applying them to your own context and data sets, because that's how you really learn, I think.

Keith's books on regression and network analysis

Yeah, sure. There's two areas I think that I learned very quickly were critical to the space that I work in, which is the understanding of people and talent and how they operate within organizations, but obviously have much broader applications. So one area is regression, and the other area is organizational networks or network analysis and graph theory in general.

So regression is something that I think people have lost sight of a little. And one of the reasons why I wrote a book on regression was to try and get that back into the forefront of people who are particularly interested in doing data science to explain things. So we ended up getting into this situation, I think, in 2015 to 2018, where a lot of the new kind of Python scikit-learn toolkit came on board. There was a huge amount of machine learning algorithms that weren't very explainable coming onto the... And Python and scikit-learn make them very easy to use. So we got into this situation where data scientists were jumping on and just running algorithms and getting results from them. But a lot of people were having problems understanding those results, and in particular, understanding that those results were predicting something, but they weren't necessarily explaining something.

Now, in a lot of fields, like my field in econometrics and sociology and a whole bunch of other fields, the explanation of what's going on is more important than the prediction. And so regression is just an incredible Swiss army knife for doing that type of analysis, for being able to look inside a kind of phenomenon and understand what's actually going on. How do I explain how one thing leads to another? And so one of the reasons I wrote the book was to get it back into the kind of consciousness of data scientists and say, there's a whole toolkit here, which you're kind of forgetting about, which has a lot of applicability in kind of explaining the problem you're dealing with.

So, and part of the book doesn't just cover the code and how to execute it. A large proportion of the book covers like when you get the results, what do you do with them? Like, what do they mean? How do you explain them to others? How do you like make yourself a compelling data scientist by being able to really well explain the outputs of your models, right?

The other one is kind of more niche, which is around network analysis and graph theory. So, you know, it's a huge area. I'm fascinated by the fact that data can be structured in ways that are different from rows and columns. So the idea behind a graph is that, you know, your kind of entities are stored in nodes and your relationships between them are stored in edges. And it's just a beautiful data structure to work with. And it solves a lot of problems that like typical tidy tabular data does not solve. And so the role of that book is like to get people who are interested in those types of data structures and have applicability in their work or their studies like fluent in how to use them in R and Python.

Extracting personality traits from text

So the question I put in Slido was, given your background with psychometrics and data science, what do you think about the possibility of extracting latent character traits about people from text? The reason I'm asking that is this is something that I'm thinking about. There's a project that I'm thinking about that would be helpful for employers looking at individuals from recommendation letters, all that. So if I gave you 10 letters about me, is it possible to extract character traits rather than just my skills?

I mean, in theory, the answer is yes, because text, the way that people express themselves and the way that they write words and the order in which they write those words and the types of words they use are indicative of their background and personality. And but the problem is that you can't really identify that if you don't have an appropriate model of personality that you're trying to map it against. So you can't really say to a model, here's a bunch of text. Tell me what this person is like, right? The model is unlikely to give you anything that's useful or structured from that. But it is absolutely possible to say, here are a bunch of constructs of people's personalities. Here are the you know, how we're defining those constructs. Can you kind of, you know, use can you use the language in this data source to map individuals to these constructs?

The I'm also very interested in the role of AI in this space. So I'm like not a massive AI convert, right? I approach AI very, very carefully because of kind of my not my under the hood knowledge of how it operates. But areas where I think AI excels is where it can identify patterns that are kind of beyond the capacity of the human brain to identify, right? Because the amount of data is just, you know, too large to handle. So, for example, we all a lot of us know situations where there's large company surveys where, like, tens of thousands of people have written text, and it's just not possible for one person to read through those and synthesize them, right? So this is where kind of AI has a really big or large language models, in particular, really big role to play.

Balancing self-learning with work

The math problem every morning kind of started by accident. I started to do a little bit of math teaching about three years ago. And I just enjoyed the kind of cut and thrust of working with the student to kind of solve. Some of these problems were very hard, right? And I enjoyed the cut and thrust of working with the student to try and solve it. And I kind of enjoyed the euphoria of solving it. It kind of brought me back to what it was like when I was a student again myself. And then I thought, you know, if I did this every morning, what a confidence boost that would be, you know, to like get started every morning having said, look, there's a hard problem, which I've already solved, you know. And so I got into that habit. It doesn't work out every morning. You know, sometimes I have to go 10 mornings in a row before I finally solved the problem. But it is a really good kind of confidence boost and exercise for the brain, I think.

Because first of all, like, if you're in work that really interests you, then work time is learning time, I think. Because you're doing problems and you're cracking things that require you to stretch your brain and do new stuff. And one of the things that I've always been kind of very conscientious about is, if as part of solving something in my day-to-day work, I've come across something that's reproducible, you know, like, you know, it's a method or even a few lines of code where I thought, you know, I can see how I could use this for a lot of other things, then I'm very deliberate about grasping that and making sure that I've recorded it. And, you know, it's kind of part of my toolkit.

In the last maybe two or three years, my job became extremely busy. So it became harder and harder to actually do real technical work. Because one of the things you get in more senior positions is you get dragged into a lot of meetings and a lot of things that, like, take you away from hands-on keyboard work. And during that period, one of the things I deliberately did is kind of I tried my best that it wasn't always successful to kind of designate my Fridays as coding day. And I would make, like, a lot of people feel bad about interrupting my coding day. And so that created an environment where I did get a lot of time on Fridays to actually do technical work. And I would, like, hold it off until Friday if I wanted to do it. So if there's any of you kind of on the call who are kind of struggling with that balance, because I know it's a really common problem, you know, like, where do you find time to do technical work versus have to sit in meetings and, you know, where you can't get hands-on keyboard? One tip is, like, designate a block of time for it. And another tip is, like, don't be afraid to make your colleagues feel guilty about interrupting that time.

Network analysis in people analytics

The, I guess, in terms of whether people are using network analysis and that type of approach really depends on what I found is it really, really depends on the data that they have available to them. So, one of the big blockers to people using a network based approach to their analytics is the fact that they don't have data in any sort of form that you can put in a graph. And I actually dedicate a whole chapter in my book to, like, how do you transform rectangular data into graph data for the purposes of solving a problem? Because that's the biggest blocker that I've seen. You know, like, if you go into a random organization, you show them a graph, a way of working with things in graph, they'll say, that's great. But none of my data looks like that.

But that said, if you can get your data in that form, there's immense number of really impactful problems you can solve with them. And some of them are even problems that you wouldn't think that you would be able to solve with graphs, right? So, a couple of examples, a couple of, you know, a few years ago, I was faced with this problem of, like, a really open problem of, you know, if we, if we were to reorganize our organization with different organizational units, you know, than the ones we have today, what kind of organizational units would be most optimal to ensure that people collaborate with each other better?

And I basically created a massive graph full of a bunch of employees, got data on, like, their current collaboration, you know, populated the edges with that, used various community detection algorithms to try and identify, like, where are the hotspots of collaboration? And very quickly, it was obvious that there was one way of organizing that was particularly, you know, conducive to high levels of collaboration. And to be able to put an analytic point of view behind that is, like, a really powerful, you know, because most people view organizational, you know, network analysis and organizational behavior as a very, like, touchy-feely thing that that's not very data driven. So, to be able to actually drive a data-based approach to that is huge. And to be able to put good visualizations behind it, and this is along the lines of what Libby was saying earlier around making things intuitive. You know, if you can visualize a bunch of dots with those dots coming together more in a particular framework than in others, you're actually, people can actually see the dynamics of what's going on in front of their eyes. And that can be hugely impactful to them, you know, believing what you're putting in front of them.

AI skills and trustworthiness

And first thing I would say is, like, AI is moving so quickly and at such an early stage that anything I say now could be very dated in a year or two, right? And I think that's important to call out. But I think there's a balance here, right? One part is, like, learning the technology behind AI, right? So there's a lot of very rapidly moving technology. As these models develop, you've got, like, things like tool calling. You've got to learn things like data validation. There's a whole set of tooling that's involved in building an AI agent, for example, that you could think about skilling yourself up with, right?

But on the other hand, there's the whole issue of the trustworthiness of AI and the likelihood that the response you get from a large language model is usable without a lot of risk to your organization or to the work that you're doing. And that's much less of a technical, at least it's much less of a coding problem and much more of a kind of how knowledgeable are you about how these models work and what the benefits and risks of using them are.

I would probably say that if you are not in a situation where you have to actively build AI workflows or AI technology right now, I would limit the amount of time that you're spending learning how to, like, code agents and things like that. Because if you're not doing it right now, you don't really have a use case to use it. And by the time you do have a use case, it could have changed a lot between, you know, between now and then. So for that side of the things, I would be saying, you know, stay in touch with the latest developments, but you don't necessarily have to have fluent hands-on keyboard skills for that unless you're directly working on something where you're building something related to it, right?

On the other side of things, I think these issues of how trustworthy are the results of large language models, when they're safe, when they're not safe, those issues will persist and they'll persist for a long time, possibly forever. And so being knowledgeable about that and being able to, like, advise appropriately and, you know, to be able to deal with super AI enthusiasts, to be able to, like, pull them back and say, hold on, there's a few things you need to think about here before you go gung-ho on this. That skill is very useful in all contexts, I think. So that's one I would index on heavily. It's like, stay in touch with what the experts are saying. Look for a set of people that you trust in terms of what they're saying about this, because there's a lot of hype out there, which you have to be very careful of.

Look for a set of people that you trust in terms of what they're saying about this, because there's a lot of hype out there, which you have to be very careful of.

Measuring talent and white collar productivity

I would break it into kind of two components, which is there are a set of constructs which we kind of organizations impose on that relate to what they regard as success of their employees. And I'm specifically referring to white collar employees with this. And those are usually designed by like psychometricians and talent teams, and they'll often be part of development structures or evaluation forms, all of those things. And those things tend to be developed on a theoretical basis using research. It's usually qualitative research with your employee base. But then you have this question, which is more interesting for me as a kind of quantitative individual, which is what data can you get to kind of validate those constructs?

So one of your constructs might be that you're successful in a particular job over a particular time period. And there's several constructs, several measures which you could use to validate that construct for individuals. One is what's their performance rating? But what if your performance rating system is rubbish and everyone gets the middle performance rating and nobody gets anything else, which is quite common in many environments? You also have promotion, right? But what if you're in an environment where promotion is automatic after a certain period of time? Then that's not a great differentiator of success. So a lot of what my work involves is how do we take data that could be indicators of a certain construct and use it in some sort of intelligent way to try and answer some of the questions. And often there's a lot of data that we'll throw out because it's just not useful from an analytic perspective. And there's other data that turns out to be very related. And one of the things, going back to network analysis, to be able to understand like colleagues networks and how they interact with other people, that turns out in many, many use cases to be a very, very valuable indicator of a lot of things with white collar workers, right? If they have large networks or if they're very central to their networks, those sorts of indicators I found are often very valuable.

Moving from academia to industry

It was a real baptism of fire to move from academia into industry. And I think anybody who's done it probably kind of can relate to this, particularly as a mathematician and a pure mathematician at that. You know, the way I describe it is it's a very enclosed and protected environment. And because of the kind of intellectual kind of nature of that environment, there's very little investment in how you communicate as an individual. So a lot of the attitude is, you know, if you don't understand what I'm saying, that's because you're not as smart as I am. Right. And that's kind of a protective envelope that a lot of like, particularly in mathematics, there's some that kind of prevents you from having to work hard at how you communicate because you've got that sort of comfort blanket.

But when you move into industry, it's like, you know, sometimes more than half of getting to the right outcome is how you communicate your approach and how you solve your problem. Because if you can't, if people are not bought into the solution, they won't go along with it. And all that work is wasted. They have to believe what you're saying. And that was just I had to learn that skill because that was not something that was in any way taught to me in my academic career. And that took maybe a year or two to really build my confidence there. And it was incredibly enriching because I don't think I'd be the person I am today if I hadn't have gone through that.

And and I think that's I've taken that through with me because it's taught me the value of communication. Right. I think if there's a danger that if I'd gone straight from academia into into data science, I might have still had that mindset of, look, you don't understand my code. That's because you're not as smart as I am. But I know now that writing the code is one thing, explaining the outcome of the code and being able to convince people that you know what you're doing and that they can trust you because you have good, good knowledge and a good foundation to be able to listen to their questions and understand what it is that's concerning them, to be able to give you give tailored responses to that. All of those things I learned in those first three years of my career, I think, which have served me well in the long run.

Imposter syndrome and career transitions

I mean, imposter syndrome is a really big thing in many, many fields, right? Anytime you kind of change your environment and you often have to join a bunch of people who you perceive to have substantially greater knowledge than you in the field is a massive opportunity for you to feel kind of imposter syndrome. And it's perfectly normal, I think, to feel it. And I felt it myself on many, many occasions.

I honestly believe that you overcome imposter syndrome through, first of all, you have to be in the right field, right? If you are not feeling comfortable in the field you're working in, if you don't have that passion for it, if you're not interested in it, it's likely that that imposter syndrome will continue because the incentive is not there for you to become more knowledgeable and to become more fluent at it because you don't enjoy it, right?

If you are lucky enough to be working in a field that you love, you get over your imposter syndrome through good hard work and collaboration with your colleagues and learning, right? Where you go through enough experiences that you get that pattern recognition so that you can say, you know, I've seen this before. I know what to do. And over time, you start to have success from that. And that's the confidence building, which helps you get over imposter syndrome. So I mentioned previously, like, this period when I was learning to code and where, like, I got to a point where I realized, hold on. Now, when I'm typing my lines of code, I actually expect them to work. That's, like, a huge step up for me. That was, like, step one of me overcoming my imposter syndrome, right? Because then I could get on calls with my colleague and I could, like, live code and not be worried about embarrassing myself, right?

So there's various ways in which you can overcome it. But the first thing I would say is you're probably not going to overcome it if you're not working in a field that you enjoy. If you are working in a field you enjoy, it's hard work and learning. And, you know, putting yourself out there and making the mistakes that help you learn is what helps you overcome imposter syndrome, I think.

The future of data science vs. software engineering

I think just the one topic which I want to get across to people, which relates to something that we spoke about earlier. But, you know, I am – I have a little bit of a concern about, like, what data science is becoming in the kind of world of large language models, right? And I see a lot of people who are data scientists, but what they're actually doing is they're just – they're really doing software engineering, right? They're interacting with language models and trying to get them to do stuff for them. But if you look at the code that they're writing, it's software code. It's not data science code.

For me, like, data science is working with data, understanding the patterns in that data, using those patterns to drive insights, and taking those insights to your business or your organization to, like, have impact. And I just see that gray area coming up now where a lot of people kind of have the data science label, but really everything they're doing is software engineering. So – and that's fine, right? Some people might prefer the software engineering, and that's cool. They should go and do software engineering. So one of the things for you guys to think about in your career is, like, as we move towards this idea where people are asking data scientists to interact with large language models, how much of the data science part of your job are you giving up in that? And are you actually moving more towards a software engineering path in the work you're doing? And are you happy with that? And if you're not happy with that, how do you get yourself back onto the kind of data science track where you're – where I think you said it, Libby, when we connected earlier, you know, the science word, right? If you're wrangling large language models for software purposes all the time, you're not actually doing any science, right? So that's something for you. I'll leave that as a thought, a career thought for you guys to think about, I think.

If you're wrangling large language models for software purposes all the time, you're not actually doing any science, right?