Inside Sports Analytics | Nick Wan | Data Science Hangout
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hey there, welcome to the Paws at Data Science Hangout. I'm Libby Herron, and this is a recording of our weekly community call that happens every Thursday at 12 p.m. U.S. Eastern Time. If you are not joining us live, you miss out on the amazing chat that's going on. So find the link in the description where you can add our call to your calendar and come hang out with the most supportive, friendly, and funny data community you'll ever experience.
Can't wait to see you there. I am super excited today to announce our guest, Nick Wan. Nick is the Senior Director of Baseball Analytics at the Cincinnati Reds, but he's also a data science streamer on Twitch, and he's been streaming on Twitch for many, many years. Nick, would you like to introduce yourself? Tell us a little bit about what you do and also what you like to do outside of work for fun that's not streaming.
Yeah, thanks for having me at the Data Science Hangout. I'm Nick. I am at the Cincinnati Reds. I've been there since 2017. I came in as a data scientist, and I have helped build and build out the department there. I oversee, depending on how you count it, anywhere from 11 to 15 analysts, data scientists, data people. And we work to support the team on the field, so it's not business related. We're not working with ticketing. It's nothing like that. It's how to produce runs. It's how to prevent runs. How to develop players. All using data and quantitative analysis and strategies to do that. So that's my day and a lot of my nights. And when I'm not doing that, I'm streaming. So I'm over on Twitch streaming under NickWayne underscore DataSci. I've been doing that since 2016. End of 2015, actually, but 2016.
Really long time. Really long time. And we stream all sorts of data science stuff. Anything from small projects to big projects. We do game show stuff. We do all sorts of different events. We've done conference coverages. We've done a lot of different things. So right now we were working on NFL draft related stuff and the NFL drafts tonight, so that's been kind of fun. And there's several people I see in chat from stream, so hello, chatters.
For fun? I don't know. I don't have a lot of free time. I guess when I'm not doing streaming or data science stuff, I'm doing probably like in the gym lifting.
Types of data in baseball analytics
All right. Well, I was hoping that you could give us a little bit of context. You've already given us a little bit more, but I don't know anything about baseball. And I really don't know anything about baseball analytics or sports analytics. And we are all coming from a bunch of different industries here. So could you help us understand what type of data you use? Where the data comes from? What exactly you do with it? What decisions get made based off that data? Stuff like that.
So the data, it really is like a lot of the data that people see when you go to like ESPN or if you're using any of the different sports data aggregate websites like pro football reference or sports reference or fan graphs. So you get like that box score data where at the end of the game, if you're like, I don't know who reads newspapers, but at the sports section, you have like the score and how many things happened and something that contextualizes the game and gives a reader an ability to kind of catch up on the game without actually having to watch it. So you can take all that data. They've collecting box score data since the 18, the late 1890s. So you have all this data. And then ultimately, depending on how detailed that box score data is, you can actually start building out predictions of whether it's run scored the next game or whether it's how many hits are going to be scored the next year or achieve the next game. All sorts of different things, just given like very simple things like counting stats that people track just by watching the game.
So the data since the 1890s has evolved as we can all imagine with the advent of technology. So you have stuff like the physics of the pitch. So how fast the ball goes, but also like the spin rate on a ball. So if you have a baseball, depending on how much force you apply to it, you could actually spin it and the RPMs actually give different value to the quality of the pitch. You can have differences in the environment. So if you're in higher altitude, like in Colorado versus lower altitude, like in San Francisco, you can have differences given the projectile of the ball. And you use all this information to ultimately either predict events or evaluate players. So you could evaluate players by the events that happened in the game using data. So is a pitcher, for instance, throwing really hard? Is there a particular inning where their velocity of the pitch starts tailing off? Are there particular strategies you can employ depending on the quality of your pitchers?
For your batters, are they swinging a bat in a certain way? Are certain batters matching up against a certain team optimally? If not, is there something to do with the lineup that you could optimize the amount of at-bats a particular player might get throughout a series depending on the team matchup? So the data itself, whether it's starting relatively small, or you have all these physics-type layers that we could capture through, whether it's computer vision or some sort of radar tracking, you could actually build pretty complex models to describe the events that are happening in the game. And ultimately, not just quantify them, but have pretty good predictions on what's happening, whether it's in-game at the pitch level or whether it's at the season level.
Motion capture and injury prevention
Yeah. So we do have skeletal data back in the day. And still to this day, you'll see people wearing styrofoam balls and a motion capture suit against a green screen or something. And so given the advancements in computer vision, if you set up high frame rate cameras, enough of them, you kind of get that motion capture marker list now. So you don't need the foam balls or anything. So at all major league parks, there's 12 cameras that are set up that capture the events on the field. And several of them are focused on whether it's the pitcher or whether it's the batter or different players on the field. And that's capturing 29 points, things like your head and your neck and places on your wrist and elbow and down your spine and stuff. And we could actually end up quantifying the biomechanical space, as they call it. So the skeletal data that comes in, you can now quantify not just like how fast you throw it, but like how much torque or how much force you're putting on your elbow or your shoulder. And tracking that can actually help prevent injury or at least curb injury.
You know, there was an anonymous question in Slido, actually, that sort of asked this. It's like, what kind of mechanisms just all the hundreds of cameras in the game allow for these detailed measurements and stats of in-game metrics? I think that that question also kind of applies to other stuff. Like how do you catch all of these things? Whose job is it to watch all of these things and like record everything? But you know, people have been recording baseball stats forever. Yeah, it's luckily not hundreds of cameras. It's like maybe a dozen cameras. So you have the 12 cameras that do the motion capture and then you have like maybe another like eight cameras, several of them are like broadcast cameras. And then you have some that we set up for coach use. So we're trying to evaluate plays and stuff. And most of it's just computer vision. So like you said, there's been watching baseball for a long time. A lot of people are, I mean, the league spends money on people who are official scorers. So they're tagging events that happen. And all of that actually can be treated as label data. So if you're doing any like classification modeling, you can take certain events and say, yeah, that this is this label. And then as a play develops, computer vision models can actually start tagging those events given what kind of motions are happening, where the ball is in space, all of that kind of stuff. And then there is some manual review. And shout out to everyone at the advanced media team at the league who spent a lot of time trying to quality assure quality check.
Streaming and community
Yeah, the streaming, so I was in, I did my grad school at Utah State. And at the end of it, when I was working on my dissertation, I was kind of inspired by someone who, one of my friends, when they were trying to finish their PhD, they had, like, this tracker of how many words they typed on their dissertation. A lot of people in this chat probably know, it's Chris Albin. So he did this whole thing of, like, how many words he put into his dissertation, and people kind of kept accountability over him to finish through his progress in tracking in that sheet. So I decided to, you know, I'm in my dissertation at this point, trying to finish it. And so rather than doing it this, you know, Google Sheet tracking way, I decided to just stream it. So for, you know, several months, I would just go live, writing my dissertation. It was literally me just screen sharing a Word doc in my lab. And people would come in, not a lot, as you can imagine, people were, like, super interested in, like, very esoteric neuroscience work. But it was good, because the people who were interested, they, some of those folks ended up sticking around for longer. So talked about my dissertation, kept me honest about putting words on paper and rewriting things, explaining things. And then when I finished that, finished writing, I still liked streaming, because the people who were coming into chat were fun, and it was kind of like, you know, nice social outlet in the middle of the day for me. So continued to stream, mostly just streamed, like, random stuff. I used to stream, like, generative art. And like, back before, like, there was this whole, like, large language model thing, like, we would make, like, bots that talk like other people and, like, deployed bots on Twitter and stuff.
We would code all this random stuff, and then started doing NFL work, because NFL started releasing these, like, gigantic play-by-play datasets. And I ended up messing around with it. And then a bunch of people ended up coming in for that. And then they realized, like, oh, this guy actually is also a sports analytics person who isn't, you know, shy about sharing, like, what materials are being used, and what methods are being used, and all of this stuff. So that was, like, 2016, 2017. And then ever since then, been streaming random data science projects and sports analytics projects. I saw someone in chat ask about cricket data science. And while we haven't had much cricket available, we've done a lot of different sports. We've done bull riding. We've done F1. We've done a lot of eSports data. We've done, we did an entire thing on The Bachelorette.
Yeah. So we have data science game shows. So I think back in 21, back in 21, our most popular stuff was called Sliced. And that's on YouTube now in VOD form. But we would stream once a week. It would be four data scientists. And they would have two hours to basically complete a Kaggle competition that we made. They had no idea what the data was. They would get the data, like, five minutes before the clock started at 9pm. And they had from 9pm to 11pm to put together a... Yeah, that's right. Greg's in chat. He was in Sliced. But we had 16 data scientists. A lot of people in here probably know a handful of them. We had Julia Silgi was on there. And we had D-Rob was on there. Jesse Mostapak was on there. So a bunch of people. Greg was on there, of course. And 16 data scientists. And they competed against each other twice. And then we had a finals. And it was pretty big. It was like, maybe the turnout for this Hangout, but like, rather than talking about data science, everyone was like picking sides and like, cheering on people. It was pretty cool.
Communicating analytics to coaches and players
Yeah, we actually were talking about this over the last like week on my stream. There's a lot of the time where you finish a project, and you hand it off or you have a product or a report or a dashboard and you hand it off and there isn't much communication as to like, what went into it. It's just kind of, here's the analysis, right? So we went through like the past week of like, say you create this like, quality of pitch model, this value of like, how valuable a particular pitch would be. How would you communicate that to a coach? And so for like, a couple streams, we had people say like, all right, who's your favorite pitcher, we put it in. And then here's what the pitchers, you know, arsenal of pitches are, you know, they throw a fastball, a curveball, a slider, a change up. And here's what they plot on this value map. What would you recommend them do for each pitch in order for that to, in order for their pitch, say a fastball to be more valuable, given the model recommendations. And so we went through the process of like, oh, well, the model says move it, you know, four inches higher. So they just had to throw it four inches higher. It's like, well, it's not that easy, right? Like, the coach is now going to come to you and say, well, the only ways to do that is if you get them to change their mechanics of throwing or maybe change an arm slot. Maybe they have to do something differently that they haven't done, or haven't needed to do for like 20 years, because they're a professional baseball player in the major leagues now.
So it's really difficult sometimes to communicate, like, what is objectively the best thing for a player versus like what a player might actually be able to accomplish. So how to build that trust is like, on one side, yes, like always try to stay true to like the most accurate interpretations of your data and what you should do to accomplish the best outcomes. But you also should at least try to start seeing it from the perspective of the stakeholder or the perspective of the end user, right? The person you're applying the data to. So if a pitcher, if you tell a pitcher like, hey, your slider could use a, you know, four more inches of horizontal break, needs to sleep a little more for it to be even more valuable. They're just going to ask you, how do you do that? And if you don't know, if you're just like, I don't know how to do that, just do it, I guess. You lose a lot of credibility in the communication side. And it's like, you don't want to lose that because you've spent so much time putting in so much effort into your model and your analysis, right? So that sort of communication layer is its own project. And I try to emphasize that, like, yes, step into the shoes of the person you're talking to a lot of the time. Like, what does it mean to move a pitch a few inches? Like, is it a grip change? Like, are they going from like a grip like this to like up the seams a little more? Is it an arm slot change where like, that's a bigger mechanical change, and it doesn't just affect a slider or a fastball, but every single pitch that they have?
So that sort of communication layer is its own project. And I try to emphasize that, like, yes, step into the shoes of the person you're talking to a lot of the time.
And the best way to build buy-in is like, to level, right? To level with the coach to level with the player of like, I understand that this is a difficult thing. And if we do want to get better at this one thing, this is the recommendation. And at times, I think another thing, just thinking about last night's stream, because we were doing this, someone asked, someone was like, yeah, we should move it more, you know, horizontally. But really, the the better answer was like, actually, you should probably move it like more vertically, not because it's in the better direction, it's actually still in like a bad area. But it's better than where it was, it's relatively much better. Even though it's not in a good, valuable area, it's less bad, right? And sometimes less bad is the best we can do. And like, I know, in the world of quantitative analysis, like we always want to be best and like, settling for less than like optimal is like frustrating at times. But sometimes that is the answer. Like, it's better to just be better. And sometimes it's better to just do better.
Evaluating the analytics team's impact
Yeah, department KPIs, it's, you know, unlike other industries, or you could be evaluated on like, does your ROI improve given your marketing budget or something? It's hard to quantify or evaluate, like, is an analytics team improving the organization as a whole? And I think the best ways to evaluate whether or not that's true is a couple things. Like, the first thing is, analytics not only provides like quantitative analysis, but it also provides like, guardrails, right? A lot of the time, analytics isn't about like, what's the right answer, but it's about the less wrong answers. So if you're able to eliminate being in the gutter, and you're just on the fairway the entire time, like, that's a lot better, right?
A lot of the time, analytics isn't about like, what's the right answer, but it's about the less wrong answers.
So one evaluation metric is, are you able to showcase that throughout all of these departments, there are like statistical or analytically driven boundaries of like, this is like a red flag zone, or this is a no go area. And if so, like, what are the reasonings? What is the quantity of analysis behind that? What is the objective reason why we shouldn't do certain things? And usually those things, like nine times out of 10, coaches believe too, they're like, oh, yeah, like, those are bad things. We shouldn't do that. Or like a staff member might say like, yeah, like, those are things we should avoid. Sometimes we do them. But now I know we should never do them. So setting up more boundaries, I think is one way to evaluate like, the especially the width, the reach of analytics is like, how much involvement is there in like the discussion of what is the go zone versus the no go zone.
And I think that's like a pretty easy thing, not just to achieve in baseball or in sports, but like really any company, right? Like, while I've been at the Reds for a long time, I also worked at a Kentucky Fried Chicken as their manager of data science for two years. And it was similar there to like the object is to increase profit margins and do it in a time where people weren't really going out to eat. So how do you improve that? One of the, you know, behind the scenes things that happened at KFC was stop buying things that aren't profitable, which are like, sounds easy. But data really helped sell that story of like, yeah, let's replace wedges with fries. Let's make sure that pot pies are a seasonal item, because no one's buying pot pies in July. Let's pivot towards more chicken sandwiches, because they sell faster, they're easier to stock. And that ended up being fairly impactful for that brand for that period of time. So being able to just set up boundaries is like one way to show like, yeah, analytics is doing research, they're able to implement like rules of the road for a lot of different departments and decision making tools. And then from there, once that gets adopted, I feel like then you can kind of start working towards that, like, idea of analytics should be predicting the right thing, as opposed to suggesting what you shouldn't do.
Tools and tech stack
So the tools we use, our team is split, half of us are doing all of our work in R. Actually, the majority of our repo is probably R. And then the other half are in Python. And then we're always in the database, and we're accessing database through whether it's SQL or whether it's Spark, depending on the environment we're in. The environment that we're mostly using is our Posit's Pro Enterprise thing. So everyone just kind of loads up the workbench there and they go to town. And then if we're not in there, if we got to go on more of the data engineering side of things, our data engineering team works mostly out of Databricks. So we jump in there and help them out when we can.
We try to centralize everything in a data warehouse through Databricks. So that's a lot of it. There are other places we store data just because of availability of resources. So we have a Snowflake database that also helps us prioritize certain data feeds for other departments. For example, when you're playing a game and you have a game every day, data needs to be available, but not just the data, but the resources to run all the reports and generate all the things that the coaches and the players need for the day. So if you're competing for resources all in the same cluster, it's hard to really guarantee execution times. So we've moved some of that to its own resourced environment through Snowflake. So we have some of that. And then Databricks is where a lot of our dev work happens. So our Projections team, our Innovations team, our Insights team, they're all building out not things on the daily basis, but reworking the backend, update to a certain new model or a certain new versioning of something.
I try not to touch the code base these days. I feel like it's like bad luck for me to go in there. And like, you know, maybe I look like a PR every now and then, but I don't think I have there, you know, I don't know how many people relate to this, but as I've grown with this department, and now that I lead it, like, just people are so smart. And I've hired absolutely like some of the most talented people, and they could code circles around me now. So it's more of a detriment if I'm touching the code base than if I, than if I don't let, you know, if I, the people I hire, they're going to be able to do things faster and better than I. So I don't touch it too long.
Sports betting and public data
No, not really. I think like, if anything, it's a, it's probably grown the availability of data in the public space. While I, you know, my day job is working with a ton of data for a baseball team. On stream, I try to promote as much public work, whether it's in baseball or other sports, as possible. And maybe the number one thing that I'm noticing with like an influx of like, gambling companies and sports books, and that kind of stuff, entering the mainstream sports space now. It's probably like the availability of data that's, that's grown in the public space more than that it's affected the private sector, the on the team side sector. So you have like access to way more data and way more things than you've ever had before. That's not just like a, you know, sign of the times in terms of what people want. Like, I don't know how many people are out there casually looking up the spin rate of fastballs. But like, if you build models, and you're trying to like, bet against certain things, maybe some of these things are going to be helpful for you and your predictions on, you know, whether it's like daily fantasy, or whether it's like some sort of like money line stuff. And that's not just baseball, but that's football, too. That's, that's really all sports. So I think like, there's more of an impact there, like for me, and for, you know, people who work on teams, we're always going to have more data, then, like, not just the gambling companies, but the public, right. So while, you know, we've been talking about all this stuff, and like pose data and biomechanics data, there's really not much of that, really, there is none of it available through the league right now, other than what's provided to the team.
Exit velocity, bat speed, and new data
Yeah. Thanks for the question, Tony. The first part in terms of exit velocity, there's both exit velocity and there's the new stuff right now with bat speed. So shout out to Alan Nathan, physicist who's a professor emeritus at University of Illinois, right? Chat, you could correct me if I'm wrong. But he's done a ton of work for the last two decades on exit velocity and the impact of hitting on the baseball and projectile motion of baseballs. And a lot of his work is about not just exit velocity, but how fast you swing and the transfer of energy from the bat to the ball and stuff like coefficient of restitution and drag factors on the ball, backspin versus no backspin produced on the ball, all sorts of different things on the ball that you can think of in terms of projectile motion. And it is a gigantic factor. And it is an important factor when it comes to evaluating batters. It's how hard do you hit it? And then can you produce a speed of your bat that allows you to hit the ball effectively harder? So there's a lot of that. And we actually just did a Kaggle competition over the last two months. I put one together for my community using Batspeed data as one of the features. So MLB releases, like I said, they release a ton of new data now. And over the last year, they released a ton of new Batspeed data. And in the coming weeks now, maybe this is a leak. I don't know. But it's okay. Tell Mike and Tom to yell at me. It's fine. But there is going to be more new data that's coming out through Baseball Savant, which is the main data resource for the public when it comes to baseball. So be on the lookout for new data feeds. They showed off some of this at the Saber Analytics Conference in February, March. And some of it's going to be related to attack angles. So like the angle of your bat as you swing through the zone, more Batspeed metrics, more biomechanics metrics for batters. So there's going to be this entire batter analytics revolution coming soon. And I'm pretty excited to see what the public is going to come up with when it comes to all this new batter stuff that's going to come out this year.
Career advice for aspiring sports analysts
Yeah. I mean, the two easiest things, one, like, if they have an interest in quantitative analysis of some sort, learning how to code sooner than later is always good. If they're already dabbling in Excel or Google Sheets or whatever spreadsheet service they use, that's a really good start. On my YouTube, there's a boot camp I put together. Seven videos, completely free. It takes you from using Excel to being able to manipulate stuff in Pandas. And there's free data, and there's an entire boot camp homework section. And you can use the Discord to get homework help if you want. But being able to at least do things quantitatively and get used to that, whether it's like playing fantasy, like whether it's fantasy baseball, fantasy basketball, fantasy football, trying to quantitatively approach playing something like that, that's a pretty good space to get started. A lot of people end up doing that first.
And then I'd say the other side is like to just get in is the more you have a project in mind, the more projects you can complete and put on, whether it's GitHub, being 13 with a GitHub is kind of crazy to think about, but there's a lot of people with that now. But being able to put projects out there and just say, like, I went from beginning to end with a project. And it was, you know, how to evaluate whether, like when to draft a running back in fantasy football, when to draft a quarterback in fantasy football, like some questions that might, you know, have a lot of resources out there already, but coming up with your own answer and using your own methods, that shows a lot of initiative for hiring managers, I would say. And that's not just for sports, but that's, I think, for everything. Yeah, follow your curiosity. I would say, like, pick a pick a thing that you're interested in and have a question about. And if he really loves sports, like, go find the data, see if you can answer it, even if it's imperfectly.
Applying baseball analytics to other domains
Oh, yeah. I think the easiest example from baseball to other, this actually happened at KFC for me. There's this idea of park factors. So, in baseball, to hit a home run, you gotta hit a ball really far over a wall. But that wall isn't standardized. It's not like all walls are, like, 400 feet. So, some walls are, like, as short as 320 feet, and some walls are as far as 421 feet. So, the idea of park factors really makes team construction, like, what kind of hitters you want, what kind of pitchers you want, pretty different from team to team based on the park they play in. So, if you think of, like, the effect of the park, how hard you gotta hit a ball, how far you gotta hit a ball to get a home run or to just get a hit, you can actually apply, like, the park factor value or at least the methodology to all sorts of things, right? Like, at KFC, we did, I started doing not just, like, park factor analysis, but really, like, zip code analysis. Like, are there certain zip codes that have a predisposition to wings than other zip codes? And this actually ended up going as far as to suggest we should actually reduce sales of wings from the majority of the United States, except for KFCs that are within a certain radius of SEC schools in the South, because that's really, like, where the profitable margins are for selling wings.
And so, if you apply this, like, idea of, like, what you need to do to hit a home run, what you need to do to sell wings to make them profitable, you could actually apply that to a lot of things. I actually ended up doing that with bull riding, too. So, when you have all this data on not just the riders, but also the bulls, you could actually put together, like, do certain bulls add to certain riders' scores? And so, there are certain bulls that actually, if you ride them, your score does improve by a certain margin. And for other bulls, they would, like, you know, decrease your score. And talking to certain cowboys, they would say, like, that's completely true, because some bulls look like they're hard to ride, but they're easier to ride, and that gives you a higher score. Whereas some bulls are just very difficult to ride, and they do it in kind of an ugly way, and that actually, like, decreases your score. So, it's pretty interesting when it comes to, like, applying something, like, from baseball to another field. I always think I'm taking so much stuff from other fields and applying it to baseball. But I think park factors is probably one of those things that I've seen in a lot of places that they might not call it park factors or environment factors, but I see it as park factors. I'm like, oh, that's park factors.
Amazing. Joey, this is a fantastic question. Thank you so much for asking it. And thank you so much for the millions of questions that we could not even get to. We got so many. Thank you so much to Nick. And next week we have Trevor Fry, lead data analyst at Pinterest. So, stick around for that next week. Nick, thanks so much for joining us. This was so much fun. I could have done this for another hour. Everybody go hang out with Nick in his stream. Thanks, Libby, and thanks everyone in the closet for letting me come on and help host today. And yeah, for those who wanted to ask questions and couldn't get them now, feel free to join us. We start the stream usually 8 p.m. Eastern, so if you're available then and you want to ask questions, join us in chat and I could always get your answers over there.
