Meghan Hall & Mitch Tanney | R in Sports Analytics | RStudio
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
So, good morning, good afternoon, good evening. Judging by the chat responses, it looks like we have a very diverse crowd from all parts of the world, which is great. So, thank you for joining. The title of my talk today is Moving the Needle Toward Organizational Success. And I'll include just a general framework for decision making. I think it applies in multiple areas, not only in sports, but also in business.
So, just a few disclaimers as I get started. The field of sports analytics is an extremely broad, diverse field. The 15 to 20 minutes today that I'm going to talk is going to be a heavy bias toward American football. Trying to cover the entire field of sports analytics in 15 to 20 minutes just isn't possible. But ideally, the takeaway that I hope for you is that you're able to take something away from this that's a pragmatic solution to solving difficult problems, whether that's in sports, business, or just general life when you're forced to make a difficult decision and one general framework that you can apply to making that decision and then obviously using data to help inform that.
So, I had to include this in here at the beginning. So, for those of you that are familiar with the movie, this is from Moneyball. It's Jonah Hill's character. And if I had a dollar for every time someone had asked me, so you're like the guy in Moneyball, I'd probably be retired somewhere on a beach by now. In my mind, if you approach Moneyball as a general philosophy to using data to help inform decision making, I think that's a really good definition. If you limit Moneyball to just player acquisition, player selection, I think that's a more narrow focus. I think quite often sports analytics, the first term that everyone throws out is Moneyball. And while player acquisition and how you acquire players using data is certainly a piece of sports analytics, it's not everything.
The dice game: expected value in action
Okay, so if everyone can, please take a look at that Slido, we're gonna roll some dice here. So there's a, this is a shiny application. And what we're going to do is I'm going to go over the rules here just briefly. What I'd ask you to do is please go to that Slido site. Please vote whether or not yes, you'd want to play this game. No, you don't want to play this game.
Alright, so here's the concept, we're going to roll, we're going to simulate a dice roll of two dice. And then we're going to take the sum of those two dice. When you take the sum of those, nothing earth shattering here, it's going to yield a sum between two and 12. But there's going to be associated payouts with the results of those dice rolls. So the way this game is currently constructed, if you see here, is that my winning numbers are eight or less. So anytime that the sum of the two dice is eight or less, Rachel is going to owe me $45. On the flip side of it, the second component is that if the sum of the dice is nine or more, I'm going to have to pay Rachel $108.
So as you can see, we just selected one role to simulate. And what we're going to focus on is the results here, this rolls detail section. So dice one was a two dice two is a six. The sum of that's eight. So I win $45. And this is randomly generated. So I'm just going to continue to click roll them. And you'll see that there's some variants here.
So let's bump this up to 100 and roll this. Okay, now obviously, the payouts are getting a little bit bigger. I just won a little over 3000. But I had to pay Rachel 2700. So net difference is 675. So now as I'm starting to increase the number of roles to simulate, what I'd ask you to focus on is this section in here, the summary results.
Okay, so when I ran this 10,000 times, so we just simulated 10,000 dice rolls. The more times we play this, the more I'm winning. Hopefully, that's the takeaway here. Another takeaway is that there's certainly variants that's associated. So let's go ahead and deconstruct what's going on here.
Expected value framework
So the general framework for this game, and just in general, that I like to apply is that expected value is equal to the sum of the probabilities and the payouts. And some of you that are in academia might be looking at saying, hey, wait a second, there's this great summation notation that highlights exactly that. I put this on here in this context for two reasons. One, is that as an undergraduate math major that I reached a certain point where I was writing proofs. And I said, you know what, if I have to put some sort of formula, I'm going to try to deconstruct it into simpler terms. The second reason I put this on here in this context is that when you're presenting sometimes to non-technical audiences, whether that's a coach, an executive, someone in your business, I think sometimes the notation can be a little off-putting.
So when we add the players, you can see the advantage. The advantage is certainly to player one. Player one has more numbers. So there's 26 out of the 36 squares. That's where the advantage is to player one. But recall that player two has a much higher payout. So player two has fewer squares, winning squares, but there's a much higher payout that's associated with player two.
So again, going back to that calculation of why this works, why expected value. So when you go through the math, again, it's the sum of the probabilities multiplied by the payouts or the outcomes. For player one, again, 26 out of 36 with a payout of $45. Player two wins 10 out of the 36 times, and there's a negative payout associated for player one when player two wins. The way the game was constructed, it was not rigged from a variation standpoint. It's just using sample within R. But the way that the game was constructed was, it was rigged for player one. Player one, every time that the dice, those dice were rolled, player one had a $2.50 advantage.
Player one, every time that the dice, those dice were rolled, player one had a $2.50 advantage.
So from player one's perspective, if somebody offered me that game and we could play that all day, I would play that game happily all day, every day, because I know that the more times I roll those dice, the more times I'm going to win. There's a small advantage and that's going to equate to a significant gain in the long run. Another piece is that there's variation. You got to be able to withstand the losses.
Win probability in sports
Ultimately win probability is what you're trying to maximize, but you're trying to add value with win probability. There's no shortage of information about fourth down decision-making. I wanted to take a slightly different approach here. And I wanted to talk a little bit about challenges. The short version of the summary of what you're seeing in this GIF is the head coach from an NFL team carries around a red flag in their back pocket. And essentially they have two opportunities a game to be able to challenge a ruling on the field.
The way that I approached it when I was working for teams was from an expected value standpoint, there's going to be a net change in win probability based on the decision, but there's also a probability associated with the likelihood of the reversal. So again, using our context for expected value, it's the sum of the probabilities and the payouts. There's essentially two different outcomes that can happen when you challenge. It can either be reversed or it stays the same and that's it.
There is a penalty associated with challenging unsuccessfully. If you challenge a play, you lose a timeout. So at various points of the game, that timeout takes on very different meaning. And that was one of the first questions that I was asked when I first started working for teams is what's the value of a timeout. And I thought it was a really interesting question because there's a lot of different ways to answer that question. And based upon where you are in the game, where you are in the game certainly makes a difference.
When you're in high leverage situations, ideally, this is close to one. And this is a huge swing in win probability. It doesn't always turn out that way, but when you're trying to maximize decisions and trying to make positive expected value, that's essentially the perfect context there.
Other applications of sports analytics
It wouldn't be a presentation on football analytics without mentioning some of the fourth down work that's publicly available. There's some really good work historically from Brian Burke. Ben Baldwin recently did an updated version of this from the athletic. And then the league office also posted a video after the playoffs talking about the importance of fourth down decisions. And as you go through these links, I would encourage you to go through them again. And it's essentially expected value based on the outcomes and the results associated with those outcomes.
Now there certainly are other applications. To my point earlier about Moneyball not being the end all be all. From the personnel standpoint, personnel staffs are responsible for player acquisition. So in an NFL front office that can be either on the college side or on the pro side. With professional baseball teams, it's often referred to as amateur scouting. Player performance is another extremely important piece and with all the tracking technology that's available, not only for practice, but also for games. And it's commonly referred to as sports science or load management. Load management is certainly the term in the NBA that you've heard publicly. And then the last major areas, cap and labor, and that's how you value contracts. And again, these are all applications where that expected value framework can help your organization make positive gains in the long run.
Lessons learned
So I'm seven slides in and you might be wondering, this guy hasn't said anything about sports yet. I wanted to set that framework because it's an extremely important framework that applies to multiple decision-making processes within a front office, within a coaching staff, within a sports performance staff, within a contract and labor perspective.
The first one is consider framing your decisions as bets. And if you're not framing your decisions as bets and you're just looking at it at a deterministic level, I would highly encourage you to look at your decision making from a probabilistic mindset. Even something that may be perceived as 100%, whether it's a reversal rate, it's a player selection, it's a coaching decision, it's a contract decision, I would argue that things that are 100%, there is other things can happen and they do happen. Another key piece here is a small advantages can equate to significant gains in the long run. Small advantages, when you do that over longer time horizons, and you do that at length, those equate to significant gains.
I've already touched on this luck and variance piece. It's real, it happens. Things happen in sports, players slip, somebody falls down, everything else could go right on the play, but somebody slips, it's bad luck, it happens. But on the flip side, the ball can also bounce your way. And then the last piece is extremely important. It's valuing the process and placing a greater emphasis on the process rather than the outcomes. It's a really difficult thing to do, especially when you're talking about a 16 and now 17 game season for NFL teams. But if you value the process and understand that luck is going to happen, but you place an emphasis on the process and you can live with when you're making best decisions that are maximizing expected value, then that's, in my mind, a really good framework for operating.
Getting started in sports analytics
I'm sure there's a number of people on this call that are attending that maybe are unfamiliar with sports analytics, want to do something. So here are just some general recommendations on how to get started. One is first and foremost is go do something. If you're not familiar with the big data ball, I would highly encourage you to go look at this information. Credit to Mike Lopez at the NFL League office for starting this and continuing to push this forward. The volume of data has changed significantly in recent years for the NFL because of player tracking data.
Second piece is approaching subject matter experts with humility. If you're going to get into sports analytics, just one general piece of recommendation that I have is to approach subject matter experts with humility. Many coaches and executives may not have formal PhDs, but they've been around the game so long that they have what I like to call the equivalent of PhDs in their respective domains. And that before you put something in front of someone and essentially ask them to completely change their philosophy because of a research study that you did, I think it takes a certain level of humility to be able to approach those decisions or those conversations that way.
Last two pieces here, certainly related to the items above is if you can't clearly explain your work, don't expect somebody to buy in and use it. If it's not transparent, what's going on with your analysis, it's very difficult for people who maybe are non-technical to be able to understand it and actually buy in and use it. So place a huge emphasis on being able to explain your work. And then the last piece is that continue attending events like today. Everybody started from scratch at some point, whether you're just getting started as an undergraduate, you're continuing, you're thinking about a career change, I would highly encourage you to continue attending events like today.
Q&A with Mitch
Awesome. Thank you so much, Mitch. There are a few questions on the Slido and I'm noticing a few questions coming into the chat too. But Mitch, a few questions for you from the Slido. One from Chris is, do you recommend any datasets that detail when a timeout challenge was used throughout a given game? For example, we can know the score at the time of challenge.
Yeah. If you're looking for public datasets, the group at Carnegie Mellon, I want to give them credit, a few years ago started NFL Scraper and then that's now turned into NFL Faster, I believe I'm saying that right. So it's NFL F-A-S-T-R. It's a public, it's an R package, it's on CRAN. You can go grab that and it's play-by-play data. So yeah, it is on CRAN. So you can go grab NFL play-by-play data and that includes win probability, expected points, and a host of other things.
You can't get your hands on data directly from the league. And then another data source to go look at if you really want to test your analytical skills and just general framework for approaching large datasets is go use the Big Data Bowl data. So that's player tracking data that's available for players. That's movement data for all 22 players on the field, if I'm not mistaken. And it's a sizable dataset.
Thank you. One of the actually most upvoted questions I want to make sure I get to is, is teamwork online really the primary avenue into professional sports jobs, particularly the NFL? Are there any better suggestions?
That's a loaded question. I'd say a lot of it is right place, right time. I was working for Stats Inc. They've since been merged with Perform and it's changed ownership a little bit since I worked there roughly 10, 12 years ago now. But I was working in essentially a startup group at Stats working with teams and the Chicago Bears were a client of ours. And I developed a relationship through that. And next thing you know, I found myself sitting in a house all working for the Bears. And again, a lot of that was luck being in the right place at the right time. I'd like to think that my work had certainly influenced it as well, but it was also, there was a lot of luck involved.
So in terms of recommendations for people to get started, you have these unbelievable platforms with social media to be able to show your work, direct message people that are involved in the field, LinkedIn, Twitter, and so forth. So be able to just reach out. Teamwork Online is certainly one source, but there are others as well.
Another question is how are analytics used for rapid in-game decisions in your experience? That was a big part of my work when I was working for teams was being able to provide data-driven recommendations to the coaching staff on the field or that were also in the coaching booth with me. There are certain limitations with the league rules. The league rules currently prohibit any live data feeds. If that changes, I would anticipate the role of technology certainly changing for teams because you now have access to potential calculators for win probability and all sorts of things.
So one, let's see, we have time for one more question. Is injury data available via the big data bull set looking into games lost to injury? I don't know offhand. I'm not familiar though, if there is injury data associated with that.
Asma, I have a feeling you may know more information on that. Yeah, I just saw Mike Lopez chime in. So thank you, Mike, for attending and thanks for answering that question. So again, Mike Lopez is from the NFL.
No, thanks. Thanks, Mitch, for the great overview and for mentioning our event. We certainly encourage all folks, football fans, not football fans, to participate in the big data bull. Due mostly to sort of sensitivity around the data, we haven't included injuries in the past. I don't anticipate that changing in the future, but we do welcome folks to get your hands wet. It's not just good preparation for work in football, it's a preparation for a career in really any sports analytics. We've had recent big data bull participants get hired in soccer, in the NBA, and it's just a good way to get your hands messy with tracking data, because really that's kind of where all sports are going.
So yeah, thanks. If I can chime in there too, I would argue that if you can handle the NFL's data, the tracking data, you're going to be well prepared to handle data in a lot of other industries, because it is an extremely challenging data set. So again, if you're just a data enthusiast and you want to challenge yourself, I would really highly encourage you to go grab that tracking data, because it's a really interesting data set.
Meghan Hall: extending R Markdown
All right. I'm going to assume everyone can see my slides just fine. First of all, thanks, Rachel, for such a nice intro and thanks just kind of overall for inviting me to speak. It is a true honor to speak about one of my great loves, our markdown, especially at a meetup on R and sports analytics, because truthfully, most components of my career, and to be honest, like my hobbies and my social life, fit somewhere into the Venn diagram of those two things.
In terms of the R side, I am a data manager in higher ed, so I use R every single day in our markdown. I'm also, as Rachel said, I'm currently teaching a data viz course at CMU that uses R, RStudio, ggplot, R markdown, of course. And then more on the sports side, I am a data scientist at Zealous Analytics, where I focus specifically on data visualization and reporting in, you guessed it, R markdown. And then also part of the decently active member of the public sports analytics community, mostly focusing on hockey and also creating tutorials and resources for beginners or people who are trying to get started with R through the lens of sports analytics, which I think is a pretty nice way to kind of usher yourself into the R ecosystem.
But today, we're talking about R markdown, which I think, again, we got Mitch's great talk kind of about the overview of decision making and sports analytics, and I'm going to talk about a particular technical tool from RStudio called R markdown, that I think is really useful to tackle some of the specific things that Mitch talked about. So R markdown is, at least in my mind, a really powerful and undervalued tool for a really powerful and undervalued component of data science, which is communication.
R markdown is, at least in my mind, a really powerful and undervalued tool for a really powerful and undervalued component of data science, which is communication.
We all know, or most of us probably know, that the tidyverse and all of its related packages make the analysis side of data science much easier. And I'm hoping today to convince you, if you're not already convinced, that R markdown and all of its associated packages do kind of the same thing in making the communication piece of data science much easier, similar to the way that, again, tidyverse does to the technical side. Because I personally think that a lot of data science education falls a little bit short in terms of really emphasizing the communication piece, which, again, is a really essential piece of your kind of general analysis pipeline, workflow, whatever, because without this communication piece, your analysis can kind of exist in a silo.
And this communication can mean communicating with yourself. Probably all of us have had the experience of opening a R script or whatever language you work in from like six months ago, and you're like, well, I wish I had commented that better. And you don't remember kind of the decisions that you made. So focusing on communicating with yourself, clearly, first of all, is super important, can really save time. And also true to if you're talking about communicating with your teammates as in like immediate people who you work closely with on the same projects, being able to really clearly document, again, your decision making process, why you chose these certain elements, maybe why you chose this model, why you chose to visualize this data in a certain way.
And again, kind of most importantly, communicating to people who are above you on the organizational ladder. Because sometimes, or excuse me, most of the time, the person that is doing the analysis is not the person who is actually responsible for making the decisions based on those results. And so, being able to adequately explain your results to different layers of people, again, if you were explaining to someone, to a teammate, your analysis would not include the same kind of details and context as if you were explaining that to an assistant coach, for example.
Which ties perfectly into something that Mitch said. We did not even coordinate this, but I was so happy when I saw this on one of his slides, I even wrote it down. If you can't clearly explain your work, don't expect a decision maker to buy in and use it. Which is really so true. And it is a true essential skill to be able to, again, distill your analysis and either apply or remove context and technical details based on the audience that you're trying to present to. Because without, again, if you can't convince, say, your coach on the results of, you know, your analysis on timeouts, then your analysis is basically pointless.
The R Markdown workflow
Many of us are probably, you've either been at this stage of a workflow or you know someone who's in the stage or hopefully maybe you've helped someone who's in the stage who's using Excel for their data analysis. And I'm never someone who, I guess, kind of, who pooh-poohs like Excel. Like, Excel is a great tool and being proficient in Excel and using it proficiently is a really essential skill, I would think, for a lot of data analysis roles. But a lot of people do take kind of the capabilities of Excel a little too far when I think they could really be better served by using a tool like R, tidyverse, all of its associated packages, etc.
So, hopefully, again, people move to this next step, which is an amazing step. Again, just taking the data analysis and making that more reproducible using R is amazing. I personally stayed at this step for a very long time and, again, this is miles better than this first step I showed. But I think some people can stay in this step too long, as I said I did, where I had really, was really pleased with all the efficiency gains I got in moving my analysis work into R, but I was still handling kind of all the data communication pieces by kind of keeping all my documentation in Word documents and creating all my slide decks to, you know, convey the results of my analysis to, you know, various higher-up audiences in PowerPoint.
R Markdown output formats
So, we talked about how, you know, you can, you need to communicate at different levels, and many of you might be familiar with kind of what I call like a classic R Markdown document, which doesn't take any kind of special package, it outputs an HTML file. These can be really useful, again, to yourself and your teammates, as they really easily incorporate code, and plots, and text. This example is actually a lab assignment from my course, but it's a good example of how you can very easily combine text, and code, and plots into a simple HTML file that's, again, very easy to share, and it really makes you focus on reproducibility. I make my students use R Markdown, because part of the class is learning about how to get used to a reproducible data science workflow, and if I cannot, you know, reproduce your HTML file on my computer, you lose points.
So that is a great step, and then also in terms when you need to, you know, start presenting to some layers of people above you, you're not going to send him, hopefully, maybe there are some coaches out there who would appreciate it, but you're probably not going to send him like your really long standard R Markdown file that has all of your different code decisions in your plots, like that's too much information, but thankfully R Markdown has lots of other options. You can create dashboards with flex dashboards, you can create slides with sharingan. These slides I've created today are created, again, with R Markdown, with sharingan.
And then, lastly, might be more on kind of a personal, if you ever want to communicate to the public, again, there are lots of R Markdown driven packages that do that. Bookdown makes it very easy to create, like, online books. Most of the, kind of, online books you've seen online, the popular ones about R related topics, most of those were built with Bookdown. There are also several different R Markdown driven packages for website development. Distill is a very popular package. I personally have my course website that I created with Distill, and, actually, as I mentioned, everything about my course is completely created in R Markdown, from my lecture slides to, again, assignments, etc. And then there's another popular R Markdown driven package called Blogdown that I personally use for my personal website.
And there are, again, several other even categories that we didn't get into today. There are special packages for creating journal articles. There are packages for creating interactive tutorials, like hosted, again, through HTML online, which is super useful if, like, anything about your job involves any kind of education component.
So I just kind of, you know, spit off the name of, like, half a dozen packages, and you're probably like, oh my god, there are so many different packages to learn, but kind of one of the bright sides is just like how, if you kind of learn the basic syntax of the tidyverse, is that kind of allows you to easily extend that into other related packages, like tidy models and ggplot. The exact same is true for R Markdown, and I would argue it's even actually much easier. Once you have the basic syntax of R Markdown, again, as I mentioned, you can create things in chunks, and there's various options as to what you can do if you want to show or hide code or plots, etc., and once you're familiar with kind of the basic syntax of R Markdown, that means it's very easy to, all you have to do is, again, learn kind of the specific details about the various output formats in the packages, but it's really much, very easy to just take whatever your analysis work is and easily make that into different output formats.
And so, I've included a few links here. These slides will be posted, by the way, on my website and my Twitter. This first link here is just a basic R Markdown tutorial on the RStudio website, which is really great for just, again, learning the basic output formats and capabilities of R Markdown. And I've also linked here to the definitive guide of R Markdown, which is an online book, of course, created with Bookdown, that, again, goes into lots of details on all the various output formats. And then lastly, Allison works for RStudio and is, in my mind, at least, the queen of R Markdown and has produced so much great content on so many different aspects of R Markdown.
Allison works for RStudio and is, in my mind, at least, the queen of R Markdown and has produced so much great content on so many different aspects of R Markdown.
So, again, I hope this has inspired you to incorporate some more R Markdown into your life, and I wish you luck on that journey. And before I leave, I just wanted to mention, since we have such a great collection of people here who are interested in R, interested in sports analytics, I definitely want to mention that Dell, such a company that I work for, that is an industry leader in the sports analytics space, is about to be hiring in several different roles. We're going to be hiring for data scientists and data engineers and analytics engineers, both at junior and senior levels. It's a really fun, really smart team that's doing a lot of cool work across multiple sports.
Q&A with Meghan
Thank you so much, Megan. If you have questions for Megan, if you wouldn't mind, when you put them into Slido to put Megan's name as well, just so it's a little bit easier to sort through them.
And one question that was just shared, Megan, is can you collect user data via BlogDown? Oh, that's a good question. I actually don't know. I would have to look into it. My personal website, I don't. I don't have any tracking, like I don't have any tracking data or, you know, data on who visits my website, but I assume there are probably options. It might depend on who you host your website through.
Something I wanted to ask you about, Megan, is I know you commented when people were asking about how to potentially be hired by a sports team, and you said, like, putting your work out there and having public work. Would you be able to speak a little bit more?
Yeah. It's certainly not the only avenue to work in sports, and whether that's working for a team or working for some kind of associated company, like the company I work for is a consulting company, basically, that's hired by teams in various sports. So, as kind of Mitch was mentioning, it does certainly depend on luck. For most sports-related jobs, there tend to be a lot more qualified applicants than positions available, just because it tends to be a popular industry.
So, there are certainly people who just get hired either straight out of school or because you apply. But a lot of people also at least get on people's radar because of public work that they have. So, becoming involved in really any way in the public sports analytics community, again, whether that's, you know, in one sport or in multiple sports, there is some kind of data out there for at least most of the major sports. And so, having some kind of portfolio of public data in some kind of easily available place, again, whether that's GitHub or a super basic website is, again, really helpful.
That's a great point. And I see two other questions for you. One is, how do you think hockey stats compare to other sports for analytics? Is it easier to use or more definitive results, larger or smaller sample sizes?
I think it depends on the type of data that you have access to. The public data for hockey is nice in that it is consistently available, but it is not super particularly detailed. I mean, in saying that, I think there are still there are plenty of insights that have been gained from the data that is publicly available. And I would argue there probably still are insights that can be gained from that data. But probably a lot of the kind of cutting edge, again, more of the tracking data, we know that exists for hockey, but it's not publicly available at the moment. But I do think there is a decent amount of data available for hockey, at least in terms of the NHL. It does vary wildly as you move into either women's hockey or other leagues, junior leagues, etc.
Yes. So everything that I have hosted, again, I have two websites, one that's built through Distil and one that's built through Blogdown. Both of those are hosted through Netlify, which is a pretty common workflow for hosting these Markdown sites because it connects super easily with GitHub. So if you just have a GitHub repo for your site, it connects and it's free. I do have, again, my course website has this netlify.app at the end. You can, of course, pay for a custom domain name. But if you're just a personal R user and you, like, use your website for R related purposes, you can get a, for free, again, you can just Google it, you can get an rbind.io address, which you'll see pretty common.
Thank you so much, Megan. And thank you, Mitch, for your presentation as well. I know there are quite a few questions that are on Slido that may have gone unanswered in the time we had today. So I'll be sure to share all the questions and take them from the Zoom chat to share with the speakers to get answers for you. So much, Mitch. And thank you, Megan, for your presentation. Have a good day, everyone.
