RStudio Sports Analytics Meetup: SportsDataverse Initiative

Transcript#

This transcript was generated automatically and may contain errors.

for joining. Welcome to the RStudio Enterprise Community Meetup. I'm Rachel Dempsey, actually calling in from Connecticut today. We are streaming out to LinkedIn and YouTube Live. If you've just joined now, feel free to introduce yourselves through the chat window and say hello, maybe where you're calling in from. For today's meetup, we're joined by Saim Ghilani, founder of the SportsDataverse. The SportsDataverse is a set of open-source sports data packages that work in harmony because they share common data representations and API design.

Just a few notes during the meetup, you will be able to ask questions. You can either put them into LinkedIn where you're watching or on YouTube. We also have a Slido link that I'll share in the chat so you can ask questions anonymously as well. But just so you know, if you do ask questions, you'll be part of the recording as well. So right when the meetup is over, the recording will be shared to YouTube, which is one of the nice things of doing it on YouTube Live. It's there immediately.

But for anybody who is joining this meetup group for the first time, this is a friendly and open meetup environment for teams to share use cases, teach lessons learned, and just meet each other and ask questions. So thank you all for making this a welcoming community. Together, we're dedicated to providing an inclusive and open environment for everyone. So we want to create spaces where everybody can participate and we can hear from you all, regardless of your level of experience or area of work, too.

But with that, thank you again for joining us. I would love to introduce Saim and pull him up on stage here virtually. Saim Ghilani is the Director of Data Science and Engineering for the Houston Rockets and the founder of the Sports Dataverse. Saim, thank you so much for joining us.

Absolutely. It's a pleasure to be here. I am very grateful for the opportunity to talk about the Sports Dataverse. And we consider it an initiative because everything is kind of in a constant state of work in progress to both make resources exist and then maintain them to a simple standard.

The topic of the conversation is generally going to be about how the Sports Dataverse is trying to develop lasting solutions for accessing sports data and creating analytics based on the open source data we have available and creating public utilities for the community to both use and enhance as research progresses.

And so the main goals are for us to be creating high standard data resources for the sports analytics community. In addition, creating pathways to make the sports analytics industry more diverse, inclusive, and accessible. That's the primary goal here is to lower the standard learning curve that goes into actually making progress in becoming a competitive candidate in the sports analytics field.

And so some of the solutions we brought about are building an extensive set of open source sports data repositories and then creating the packages to load the data from Python, R, and Node. We'll be primarily focusing on the R packages given that this is an RStudio presentation. And establishing the bench of developers from diverse backgrounds to spearhead projects and make contributions within the actual packages within the Sports Dataverse.

And most important, well, the second prong of this goal is to bring women's sports data analytics research on par with the same level of resources available within the public space. And just generally make more strides to make the analytics space a little bit more equitable for both sides.

An incredibly valuable way to promote your own work and get feedback on your work is using Twitter, make analyses, like, put a blog post up, get some feedback. Like, there's so many people who are willing to help, especially if you start, like, putting it in front of us.

I love that. Just start sharing your work.

I see there's quite a few other questions coming in from LinkedIn and YouTube. One is, what are some of the plug-and-play analysis or functions that these packages have?

Oh, so, basically, the, probably the most developed one is CFB FastR, because we, that was initially modeled after, after NFL Scraper, written by Ron Yerko, Sam Ventura, and Maxim Horowitz, which, in turn, was part of the Open War NFL paper, which I highly recommend reading.

Basically, you can add pipelines to any of the existing play-by-play functions, to get, you know, various levels of expected points added and win probability metrics. Not all the packages have them. Basically, I would, the presentation today was actually supposed to be me developing a new metric for Women's College, or Women's National Basketball Association, because that's basically the next phase of this. A lot of it is data get functions, or data get, and then compiled into a loadable data set, but the goal is for people to create pipelines on top of it, and then submit those pipelines to us to incorporate into our data repositories, and then, in turn, be made available to end users through the package functions.

And so, like, that's, that's really where the next piece of this goes. A lot of them, a lot of the packages fall into one of three categories, not all of which exist yet. There's the data scraping, which is basically all of them right now, but there's modeling packages, which may just store models, or, you know, methods to create models, and then there's data visualization packages, like cfbplotr, or mlbplotr, or sportyr, which create, allow you to create very useful visualizations, or tables, depending on what your goal is.

But the real, the real deal thing that I want to see get started in the next phase of the sports adverse, beyond just us, like, covering more sports and sports leagues, I want us to take the next step of incorporating other people's models that they want to contribute to the open source space, so that they can become, like, a standardized method that everybody can reference, like, hey, you know, this person made a, an adjusted plus minus model, and it takes these certain things into effect, we can document it, and then you add it to our nightly data repository load for that sport league, however the pipeline works, and then it's made available for users on, you know, every time we run the nightly load.

Thank you. I see, I know we touched upon this already, but I see a lot of the questions coming in are focused around, like, tips for shifting into the space of sports analytics as well, but do you have any specific tips for shifting from a different industry into sports?

So, yeah, I guess I should probably tell you a little bit more about myself. I, I was in healthcare analytics, and on the actuarial side, and in medical malpractice briefly, just like, as an analyst, where I wind up using a lot of Excel, VBA, SQL, and SAS, and learning SQL helped me a great deal. SAS is very useful as well, but given that it's a paid software, I think not the most useful in this space, given that so many of the data science solutions, you know, are using an open source software.

So, I wound up getting a, getting some hands-on experience while I was working for startups in Python and R modeling, and that was very useful, and in turn, getting into grad school, because that's where I went to Georgia Tech, their online master in analytics program. Very exceptional program, very affordable, and I learned a great deal of various programming methods, languages, techniques, frameworks, as well as the math behind it, which was an exceptionally valuable piece to my understanding of how everything works in stats.

Like, because you have to have both, like, you need to work on your programming skills and, like, just your breadth of understanding of how different pieces in the stack fit together, and being able to manipulate data in those frameworks or languages, and then being able to understand the stats and math that are useful for your data sets, like, being able to understand what your data is telling you, what it can do, what it can't do, as far as building models that are useful, is the, that's like the bread and butter of how you become a valuable data scientist.

Thank you. I see that there are a lot of people commenting in the chat or asking if other people from the community are interested in working with certain sports data, like volleyball data, for example, and so I did just want to take a second to call out that this channel exists, so they're on the R for Data Science online learning Slack community. There's a channel called Chat Sports Analytics, but I thought it could be a good place for people to connect even after the meetup, too, so just wanted to leave that up there. That's the link to join, just r4ds.io slash join, and the specific channel itself is the one in blue there.

Yes, we actually do have a private Discord that we invite people to, in spite of our, like, you know, open and accessible mantra, we try to keep the conversation to people who are actually trying to help with the packages so we don't have a fully open community as far as, like, talking to us every single day and working on packages directly with us, but it's not super hard to get an invite as long as you're about it. You just have to talk to me about it, show me your GitHub, and that's pretty much it.

Awesome. So, reach out on the Sports Dataverse Twitter, right? Yes. Awesome. This isn't really a question, but I really love this comment, the anonymous comment on Slido was, this is the truest demonstration of you don't need to be the most, like, elite programmer or do it all alone to make a really useful package, so thanks for sharing that, too.

I appreciate that, because a lot of days I'm just like, wow, I can't believe anything I write works, and, like, people find it useful, all right, fantastic, but it's true, you really don't have to be, you know, elite to make a difference, you know, because as long as you try and execute an idea, doesn't matter if your code is always the fastest, it's nice if it is, but if you're just trying to get a job done and make it durable, sometimes that speed is not always the answer.

It's true, you really don't have to be, you know, elite to make a difference, you know, because as long as you try and execute an idea, doesn't matter if your code is always the fastest, it's nice if it is, but if you're just trying to get a job done and make it durable, sometimes that speed is not always the answer.

A few other questions that are coming through is, one is, are any of those packages, are they providing real-time data, or is it just batch mode after every single game is over?

I believe pretty much all of them are, all of the R packages have a interface directly to the website source that they pull the loaded data from, so basically, what we wind up doing is, the data repositories essentially work that we scrape data from one of the websites that's included within the package, usually, I think the minimum standard is, we usually have an ESPN for a version of the play-by-play player box and team boxes, and that's compiled at the season level, and so that is usually done through package functions, and we just run it every single night to make sure it's updated with the most recent data, and so, as long as it's available on their website live, then it's available through the package live, if that makes sense, because we're almost always directly interacting with their APIs as they're live.

Great, thank you. I also just wanted to take a second in this platform to share that I know the Women in Sports Data Symposium is coming up, too, on August 20th, and I just put the Twitter account to the symposium there that you can follow if you're interested in getting involved as well. We are highly supportive of them. They are doing excellent things. It's going to be a fantastic conference. Please attend, and we are happy to discuss sponsoring if you need help getting there.

That's great. Yeah, it looks amazing. So, a few other questions. I, myself, am not sure exactly what this question means, but what would be your opinion about DFTS density functional theory as an application? I'm not sure if that relates to a specific package or not, so I just wanted to ask it. Okay, we can save that one, and if anyone has some thoughts on it, feel free to share it in the chat, too.

So, basically, I see this question from Rodrigo from YouTube. Is Sports Dataverse focused only on providing access? Oh, we did cover some of this. It would be helpful if anybody wants to make, like, useful reporting that they would like to be run every night. We are always willing to accept that sort of stuff, especially if it's something that can be updated every night. That'd be good.

Someone had asked about some of those books that you had recommended. Do you know if any of them are available in online free versions as well, kind of like how the R for Data Science book is? I cannot say that I do. I also really like the people who have made these books, so I would encourage you to support them if you can, but I honestly don't know of any free versions. It's worth checking out Konstantinos Pelikrinas' sports analytics coursework that he's provided, because I'm sure he's provided a bunch of excerpts from the book during those course notes and lectures.

Great. Thank you. One other question was, could this package be utilized for prediction purposes?

I'm assuming you're talking about a specific package, but in theory, that's exactly what all of these are for. Right now, they're mostly just pulling in data as it comes from websites, and there's not much being done to enhance it. I'm just trying to get the data ingest going and made available so that you, the user and potential contributor to the Sports Dataverse, could then build a pipeline and be like, hey, if you run these two functions, you can get this set of reporting and modeling done. Here's how you would train that model. Here's a proof of concept of how it works. Please add it, and you make a pull request to us or just talk to us about it. You would get it incorporated to the broader community and package.

That's the next phase. We want that. Please.

I'm muted. Sorry. Eugene had asked, it looks like you have quite the trophy cabinet behind you, which are your sports? Math. Math. Yeah, 100%. Love it.

Another question I realized I had missed from earlier is, what are some newer aspects of data analytics in the MBA that excite you? Well, there's a lot more stat companies now that are providing MBA data. That's always an exciting opportunity to get a check on yourself from various other very smart analysts in the field, which is great, as well as there's actual opportunities for new data that's being provided to us from existing providers, whether it be raw tracking data from either Second Spectrum or Zealous using Second Spectrum's data.

There's a bunch of people that are providing more and more interesting data that you have to become adept at building these pipelines, because that's the work you have to do. The entire purpose of why I think this is a very good route for people to learn the skills they absolutely need on the job for working in sports analytics is, you're going to get new data sources. You're going to have to build data pipelines. You need to be able to do it fast. You got to make them durable, make sure they don't break, and learn how to make them basically work no matter what.