Meghan Hall & Mitch Tanney | R in Sports Analytics

Transcript#

This transcript was generated automatically and may contain errors.

So, good morning, good afternoon, good evening. Judging by the chat responses, it looks like we have a very diverse crowd from all parts of the world, which is great. So, thank you for joining. The title of my talk today is Moving the Needle Toward Organizational Success. And I'll include just a general framework for decision making. I think it applies in multiple areas, not only in sports, but also in business.

So, just a few disclaimers as I get started. The field of sports analytics is an extremely broad, diverse field. The 15 to 20 minutes today that I'm going to talk is going to be a heavy bias toward American football. Trying to cover the entire field of sports analytics in 15 to 20 minutes just isn't possible. But ideally, the takeaway that I hope for you is that you're able to take something away from this that's a pragmatic solution to solving difficult problems, whether that's in sports, business, or just general life when you're forced to make a difficult decision and one general framework that you can apply to making that decision and then obviously using data to help inform that.

So, I had to include this in here at the beginning. So, for those of you that are familiar with the movie, this is from Moneyball. It's Jonah Hill's character. And if I had a dollar for every time someone had asked me, so you're like the guy in Moneyball, I'd probably be retired somewhere on a beach by now. In my mind, if you approach Moneyball as a general philosophy to using data to help inform decision making, I think that's a really good definition. If you limit Moneyball to just player acquisition, player selection, I think that's a more narrow focus. I think quite often sports analytics, the first term that everyone throws out is Moneyball. And while player acquisition and how you acquire players using data is certainly a piece of sports analytics, it's not everything.

The dice game: expected value in action

Okay, so if everyone can, please take a look at that Slido, we're gonna roll some dice here. So there's a, this is a shiny application. And what we're going to do is I'm going to go over the rules here just briefly. What I'd ask you to do is please go to that Slido site. Please vote whether or not yes, you'd want to play this game. No, you don't want to play this game.

Alright, so here's the concept, we're going to roll, we're going to simulate a dice roll of two dice. And then we're going to take the sum of those two dice. When you take the sum of those, nothing earth shattering here, it's going to yield a sum between two and 12. But there's going to be associated payouts with the results of those dice rolls. So the way this game is currently constructed, if you see here, is that my winning numbers are eight or less. So anytime that the sum of the two dice is eight or less, Rachel is going to owe me $45. On the flip side of it, the second component is that if the sum of the dice is nine or more, I'm going to have to pay Rachel $108.

So as you can see, we just selected one role to simulate. And what we're going to focus on is the results here, this rolls detail section. So dice one was a two dice two is a six. The sum of that's eight. So I win $45. And this is randomly generated. So I'm just going to continue to click roll them. And you'll see that there's some variants here.

So let's bump this up to 100 and roll this. Okay, now obviously, the payouts are getting a little bit bigger. I just won a little over 3000. But I had to pay Rachel 2700. So net difference is 675. So now as I'm starting to increase the number of roles to simulate, what I'd ask you to focus on is this section in here, the summary results.

Okay, so when I ran this 10,000 times, so we just simulated 10,000 dice rolls. The more times we play this, the more I'm winning. Hopefully, that's the takeaway here. Another takeaway is that there's certainly variants that's associated. So let's go ahead and deconstruct what's going on here.

Expected value framework

So the general framework for this game, and just in general, that I like to apply is that expected value is equal to the sum of the probabilities and the payouts. And some of you that are in academia might be looking at saying, hey, wait a second, there's this great summation notation that highlights exactly that. I put this on here in this context for two reasons. One, is that as an undergraduate math major that I reached a certain point where I was writing proofs. And I said, you know what, if I have to put some sort of formula, I'm going to try to deconstruct it into simpler terms. The second reason I put this on here in this context is that when you're presenting sometimes to non-technical audiences, whether that's a coach, an executive, someone in your business, I think sometimes the notation can be a little off-putting.

So when we add the players, you can see the advantage. The advantage is certainly to player one. Player one has more numbers. So there's 26 out of the 36 squares. That's where the advantage is to player one. But recall that player two has a much higher payout. So player two has fewer squares, winning squares, but there's a much higher payout that's associated with player two.

So again, going back to that calculation of why this works, why expected value. So when you go through the math, again, it's the sum of the probabilities multiplied by the payouts or the outcomes. For player one, again, 26 out of 36 with a payout of $45. Player two wins 10 out of the 36 times, and there's a negative payout associated for player one when player two wins. The way the game was constructed, it was not rigged from a variation standpoint. It's just using sample within R. But the way that the game was constructed was, it was rigged for player one. Player one, every time that the dice, those dice were rolled, player one had a $2.50 advantage.

Player one, every time that the dice, those dice were rolled, player one had a $2.50 advantage.

So from player one's perspective, if somebody offered me that game and we could play that all day, I would play that game happily all day, every day, because I know that the more times I roll those dice, the more times I'm going to win. There's a small advantage and that's going to equate to a significant gain in the long run. Another piece is that there's variation. You got to be able to withstand the losses.

R markdown is, at least in my mind, a really powerful and undervalued tool for a really powerful and undervalued component of data science, which is communication.

We all know, or most of us probably know, that the tidyverse and all of its related packages make the analysis side of data science much easier. And I'm hoping today to convince you, if you're not already convinced, that R markdown and all of its associated packages do kind of the same thing in making the communication piece of data science much easier, similar to the way that, again, tidyverse does to the technical side. Because I personally think that a lot of data science education falls a little bit short in terms of really emphasizing the communication piece, which, again, is a really essential piece of your kind of general analysis pipeline, workflow, whatever, because without this communication piece, your analysis can kind of exist in a silo.

And this communication can mean communicating with yourself. Probably all of us have had the experience of opening a R script or whatever language you work in from like six months ago, and you're like, well, I wish I had commented that better. And you don't remember kind of the decisions that you made. So focusing on communicating with yourself, clearly, first of all, is super important, can really save time. And also true to if you're talking about communicating with your teammates as in like immediate people who you work closely with on the same projects, being able to really clearly document, again, your decision making process, why you chose these certain elements, maybe why you chose this model, why you chose to visualize this data in a certain way.

And again, kind of most importantly, communicating to people who are above you on the organizational ladder. Because sometimes, or excuse me, most of the time, the person that is doing the analysis is not the person who is actually responsible for making the decisions based on those results. And so, being able to adequately explain your results to different layers of people, again, if you were explaining to someone, to a teammate, your analysis would not include the same kind of details and context as if you were explaining that to an assistant coach, for example.

Which ties perfectly into something that Mitch said. We did not even coordinate this, but I was so happy when I saw this on one of his slides, I even wrote it down. If you can't clearly explain your work, don't expect a decision maker to buy in and use it. Which is really so true. And it is a true essential skill to be able to, again, distill your analysis and either apply or remove context and technical details based on the audience that you're trying to present to. Because without, again, if you can't convince, say, your coach on the results of, you know, your analysis on timeouts, then your analysis is basically pointless.

The R Markdown workflow

Many of us are probably, you've either been at this stage of a workflow or you know someone who's in the stage or hopefully maybe you've helped someone who's in the stage who's using Excel for their data analysis. And I'm never someone who, I guess, kind of, who pooh-poohs like Excel. Like, Excel is a great tool and being proficient in Excel and using it proficiently is a really essential skill, I would think, for a lot of data analysis roles. But a lot of people do take kind of the capabilities of Excel a little too far when I think they could really be better served by using a tool like R, tidyverse, all of its associated packages, etc.

So, hopefully, again, people move to this next step, which is an amazing step. Again, just taking the data analysis and making that more reproducible using R is amazing. I personally stayed at this step for a very long time and, again, this is miles better than this first step I showed. But I think some people can stay in this step too long, as I said I did, where I had really, was really pleased with all the efficiency gains I got in moving my analysis work into R, but I was still handling kind of all the data communication pieces by kind of keeping all my documentation in Word documents and creating all my slide decks to, you know, convey the results of my analysis to, you know, various higher-up audiences in PowerPoint.

And so what I'm hoping to convince people is actually the ideal workflow is to substitute R Markdown for all these kind of data science communication needs, because there are so many output formats different that are possible with R Markdown, and keeping everything in the same universe, in the same ecosystem, will kind of streamline the communication side of your data science work, just like using R, tidyverse, etc, really streamlines the analysis part of your data analysis work.

R Markdown output formats

So, we talked about how, you know, you can, you need to communicate at different levels, and many of you might be familiar with kind of what I call like a classic R Markdown document, which doesn't take any kind of special package, it outputs an HTML file. These can be really useful, again, to yourself and your teammates, as they really easily incorporate code, and plots, and text. This example is actually a lab assignment from my course, but it's a good example of how you can very easily combine text, and code, and plots into a simple HTML file that's, again, very easy to share, and it really makes you focus on reproducibility. I make my students use R Markdown, because part of the class is learning about how to get used to a reproducible data science workflow, and if I cannot, you know, reproduce your HTML file on my computer, you lose points.

So that is a great step, and then also in terms when you need to, you know, start presenting to some layers of people above you, you're not going to send him, hopefully, maybe there are some coaches out there who would appreciate it, but you're probably not going to send him like your really long standard R Markdown file that has all of your different code decisions in your plots, like that's too much information, but thankfully R Markdown has lots of other options. You can create dashboards with flex dashboards, you can create slides with sharingan. These slides I've created today are created, again, with R Markdown, with sharingan.

And we can move away from the land of having to screenshot all of your plots, and then pull those into your PowerPoint by, instead, in R Markdown, you can just refer to chunks that already exist, and, again, much easier, so you don't have to worry about when you make a change, either you get new data, or you make maybe a change in your model, you don't have to worry about updating, again, all these screenshot plots in your slides.

And then, lastly, might be more on kind of a personal, if you ever want to communicate to the public, again, there are lots of R Markdown driven packages that do that. Bookdown makes it very easy to create, like, online books. Most of the, kind of, online books you've seen online, the popular ones about R related topics, most of those were built with Bookdown. There are also several different R Markdown driven packages for website development. Distill is a very popular package. I personally have my course website that I created with Distill, and, actually, as I mentioned, everything about my course is completely created in R Markdown, from my lecture slides to, again, assignments, etc. And then there's another popular R Markdown driven package called Blogdown that I personally use for my personal website.

And, again, these types of packages, again, for website development, being R Markdown driven means you build them with R Markdown documents, build them in RStudio, and even, we're getting a little meta here, but the blog posts themselves, like this one I just chose as an example, is also driven by R Markdown, so, again, it's super easy to combine text and code, which is easy for someone to copy, which is helpful for education purposes, as well as, of course, plots.

And there are, again, several other even categories that we didn't get into today. There are special packages for creating journal articles. There are packages for creating interactive tutorials, like hosted, again, through HTML online, which is super useful if, like, anything about your job involves any kind of education component.

So I just kind of, you know, spit off the name of, like, half a dozen packages, and you're probably like, oh my god, there are so many different packages to learn, but kind of one of the bright sides is just like how, if you kind of learn the basic syntax of the tidyverse, is that kind of allows you to easily extend that into other related packages, like tidy models and ggplot. The exact same is true for R Markdown, and I would argue it's even actually much easier. Once you have the basic syntax of R Markdown, again, as I mentioned, you can create things in chunks, and there's various options as to what you can do if you want to show or hide code or plots, etc., and once you're familiar with kind of the basic syntax of R Markdown, that means it's very easy to, all you have to do is, again, learn kind of the specific details about the various output formats in the packages, but it's really much, very easy to just take whatever your analysis work is and easily make that into different output formats.

So, I hope, again, it was difficult to distill all of my love for R Markdown into 10-ish minutes, but I hope that you are either inspired to include R Markdown, to start including R Markdown as a part of your workflow, or if you already use R Markdown a little bit, like maybe you work, you kind of do your personal analysis in R Markdown documents, which is a great start, is that you realize kind of how much knowledge you already have, and you can really easily extend those capabilities into other formats and create slides and websites, et cetera.

And so, I've included a few links here. These slides will be posted, by the way, on my website and my Twitter. This first link here is just a basic R Markdown tutorial on the RStudio website, which is really great for just, again, learning the basic output formats and capabilities of R Markdown. And I've also linked here to the definitive guide of R Markdown, which is an online book, of course, created with Bookdown, that, again, goes into lots of details on all the various output formats. And then lastly, Allison works for RStudio and is, in my mind, at least, the queen of R Markdown and has produced so much great content on so many different aspects of R Markdown.

Allison works for RStudio and is, in my mind, at least, the queen of R Markdown and has produced so much great content on so many different aspects of R Markdown.

So, again, I hope this has inspired you to incorporate some more R Markdown into your life, and I wish you luck on that journey. And before I leave, I just wanted to mention, since we have such a great collection of people here who are interested in R, interested in sports analytics, I definitely want to mention that Dell, such a company that I work for, that is an industry leader in the sports analytics space, is about to be hiring in several different roles. We're going to be hiring for data scientists and data engineers and analytics engineers, both at junior and senior levels. It's a really fun, really smart team that's doing a lot of cool work across multiple sports.