Resources

RStudio Sports Analytics Meetup: SportsDataverse Initiative

video
Jun 28, 2022
1:08:26

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

for joining. Welcome to the RStudio Enterprise Community Meetup. I'm Rachel Dempsey, actually calling in from Connecticut today. We are streaming out to LinkedIn and YouTube Live. If you've just joined now, feel free to introduce yourselves through the chat window and say hello, maybe where you're calling in from. For today's meetup, we're joined by Saim Ghilani, founder of the SportsDataverse. The SportsDataverse is a set of open-source sports data packages that work in harmony because they share common data representations and API design.

Just a few notes during the meetup, you will be able to ask questions. You can either put them into LinkedIn where you're watching or on YouTube. We also have a Slido link that I'll share in the chat so you can ask questions anonymously as well. But just so you know, if you do ask questions, you'll be part of the recording as well. So right when the meetup is over, the recording will be shared to YouTube, which is one of the nice things of doing it on YouTube Live. It's there immediately.

But for anybody who is joining this meetup group for the first time, this is a friendly and open meetup environment for teams to share use cases, teach lessons learned, and just meet each other and ask questions. So thank you all for making this a welcoming community. Together, we're dedicated to providing an inclusive and open environment for everyone. So we want to create spaces where everybody can participate and we can hear from you all, regardless of your level of experience or area of work, too.

But with that, thank you again for joining us. I would love to introduce Saim and pull him up on stage here virtually. Saim Ghilani is the Director of Data Science and Engineering for the Houston Rockets and the founder of the Sports Dataverse. Saim, thank you so much for joining us.

Absolutely. It's a pleasure to be here. I am very grateful for the opportunity to talk about the Sports Dataverse. And we consider it an initiative because everything is kind of in a constant state of work in progress to both make resources exist and then maintain them to a simple standard.

The topic of the conversation is generally going to be about how the Sports Dataverse is trying to develop lasting solutions for accessing sports data and creating analytics based on the open source data we have available and creating public utilities for the community to both use and enhance as research progresses.

And so the main goals are for us to be creating high standard data resources for the sports analytics community. In addition, creating pathways to make the sports analytics industry more diverse, inclusive, and accessible. That's the primary goal here is to lower the standard learning curve that goes into actually making progress in becoming a competitive candidate in the sports analytics field.

And so some of the solutions we brought about are building an extensive set of open source sports data repositories and then creating the packages to load the data from Python, R, and Node. We'll be primarily focusing on the R packages given that this is an RStudio presentation. And establishing the bench of developers from diverse backgrounds to spearhead projects and make contributions within the actual packages within the Sports Dataverse.

And most important, well, the second prong of this goal is to bring women's sports data analytics research on par with the same level of resources available within the public space. And just generally make more strides to make the analytics space a little bit more equitable for both sides.

About Saim and the origin of the Sports Dataverse

So as Rachel mentioned, I'm the lead engineer for the Sports Dataverse. I am an ML engineer by trade. I'm currently working for the Houston Rockets as the director of data science and engineering. I previously worked in healthcare and medical malpractice, freight supply, as well as online data science course development. And most recently, I was working for Deloitte as a consultant in cybersecurity.

I think the roundabout way of how I got into sports analytics, the open source side anyway, I've always been into sports analytics since basically I found out that they were tracking numbers in sports. That was my first foray into it. And then as I got into the open source space, I found the need to work with college football data for while I was contributing to Tomahawk Nation, the FSU SB Nation site covering the Florida State Seminoles, which is where I'm from. I'm born and raised in Tallahassee.

I bleed garden gold, for sure. And so that's how I started contributing to my first open source sports project with Maya Pansubaya and Parker Fleming, CFB Scraper, which later became CFB FastR, modeled after the NFL FastR and NFL Verse.

So we call it an initiative because it's more a goal than it is anything else. Just to bring together an incredibly remarkable set of people who can code at a high level and can follow standard guidelines for good code practice. And the kind of people that can actually create reproducible and durable pipelines, data pipelines for the entire community to benefit from.

Because the idea, we all in the sports analytics industry and any data analytics industry will have to create data pipelines in order to do modeling. And so that's the first step of our initiative is, how do we make getting sports data easy?

And could we get further if we actually built the data infrastructure together so that we all have reproducible, being able to validate models quicker, using standard open data sets will allow us to create better prototypes that can be easily validated.

The Sports Dataverse community and data repositories

So the Sports Dataverse is just basically an organization trying to make the, I mean, the Sports Dataverse is several things and it's basically a catch-all term for the community of people that support the Sports Dataverse. And so we are nothing without our development team, very grateful for the contributions that everybody makes to the packages and data repositories and pipelines that really make the entire thing run.

And so the community is the piece that both develops the projects, maintains it and mentors younger developers. And these will hopefully in turn, turn into future maintainers and developers of further research and packages within the Sports Dataverse and just generally in the sports analytics domain and broadly.

And these packages all typically operate from at least the data fetching side. They typically will have corresponding data repositories, which allow for the fast loading of our data at basically whatever speed, like at the limitation of your internet connection and available RAM. We've created one of the largest open source sports data resources, given that it's over 250 gigabytes produced in various formats. And that's just from the four or five packages that I, or four or five sports slash sports leagues that I've worked on.

Others have also contributed significant amounts of data and comprehensive data. It's incredible to see the amount of, we could fit within the GitHub free limits. It's truly incredible.

And so I think the thing that makes the Sports Dataverse relatively more appealing beyond just having like fast pre-scraped data is that the function names are, they follow a pattern and they tell you where the data is coming from based on the function name. If it's starting to load a build or an update, it is using the data resources created by the package developers. And if it's directly interacting with a website, it will be give you an indication that it's like ESPN data, NBA stats API, the NCAA website. These are all assumed to be get functions.

So we provide the access to these functions, but they should be used carefully using a proper rest in between. Be polite when you're scraping with those functions because you are directly interacting with a open data source.

And so what makes the Sports Dataverse packages a bit more interesting or rather a bit more attractive than most scraper packages is the fact that they are backed by a package data packet, like data repositories, which allow loading for pre-scraped data. And so this allows for like much faster access and also standardized pipelines.

For both ingest and modeling. So you can use your starting point as just loading from the package data repositories and then build pipelines on top of that, submit a pull request to our data repository and have your models become part of the pipeline and can be used in summaries and you can basically create verified open source models, which I think is the next step towards the next phase of the Sports Dataverse.

Packages overview and naming conventions

And so we can talk about the various packages within the Sports Dataverse. I'm only going to focus on a couple just because they are readily available on CRAN and I also wrote them. I've actually had quite a rough weekend coming back from Houston. We just got done with the NBA draft and I have returned home to find that my home PCs are all fried. And so where I had a lot of my presentation written, but not pushed, I am locked out of my locked out of my PC. And so I'm having to use a different computer to create this presentation.

And so I was not able to follow the guideline I'd set in the abstract of this presentation of going through all the various packages, all the various functions added to WeHoop. It just became a full WNBA stats API scraper, added 104 functions and it brings it on par to every available function within HoopR, which also covers the NBA stats API. So wherever there's available data for the WNBA, it will have all the functions that exist.

And so this is just a simple installation from Cloud ER. And let's demonstrate quickly how the function namings allows you for easy transference of knowledge between each of the different packages.

So the once we load the libraries, you'll notice that whenever we use the load functions, it'll be load the sport league, which is a college football, and then play by play. And then the year. And it can be a vector of years. It does not care. For NFL, I would highly recommend using the NFL verse and NFL fast R. We basically took the idea of NFL fast R and the NFL verse and applied it very successfully to create basically identical frameworks for loading data for various sports.

Basically, we took the idea of the NFL verse and applied it to the college football, NBA, WNBA, men's college basketball, women's college basketball, the premier hockey federation, the NHL. Is that all of them? And with more to follow. And others have followed in this example. The world football R by Jason. He is also working on something very similar to make loading easy from world football R, an excellent soccer package within the sports dataverse.

And so for men's college basketball and women's college basketball, it's just the same function, but you're just changing it to men's basketball, women's basketball. And this is a college collegiate. And then similarly, it is this simple. It does not take a rocket scientist to figure out this naming convention. I tried to make it as foolproof and easily accessible as possible. If you know how one of these packages works in terms of how their function naming works, you will have a much easier time understanding what is going on in each of the different packages.

And for the same idea for fast R hockey, you can load the NHL play by play, and you can load the premier hockey federation play by play. And this is just honestly, this is already out of date since I last posted it.

But basically, these are a subset of the packages within the sports dataverse. We have, I think, seven packages on CRAN right now. CFB fast R, hoop R, we hoop, baseball R, fast R hockey, world football R, and Torvik, which is Bart Torvik scraping his website. And I just, I haven't even announced, but odds API R is also on CRAN. Oh, sport ER, that's the other one that I forgot.

Community, contributors, and the game on paper demo

We basically have an incredible number of people to thank as part of this initiative. The community of developers that help, you know, maintain all these different packages are extremely valuable. And I very much appreciate all their help.

And the sports dataverse, the front end of the sports dataverse is eventually going to look something like a game on paper, which is basically using the college football endpoints of the various packages to generate entire stats pages for live games like this. And we can see there's win probability, expected points added, various aggregations of the team level. These are the basic ideas that we can continue to develop with support from the community and driven developers.

So just as an example, I asked somebody to produce a Shiny app using one of our packages, the WeHoop, because I was short on time and I wasn't going to be able to get everything done. And so he was able to throw this together in 60 minutes. A pretty incredible, like, very quick application that can be easily adapted to others.

Q&A

While it's pulling up, just wanted to remind everyone, if you want to ask questions, you can just ask them in LinkedIn or YouTube while you're watching, or you can use the Slido link too.

So I see a question from a good friend. Robert Frey is asking, what currently is holding baseball are from loading full seasons of play-by-play data?

There is honestly no issue with trying to make it happen other than my own personal availability. That's literally it. I have so much that I'm, like, actively working on. And, well, things will fall short if we don't have enough other committed people to just take the reins and actually make it happen. It's just a matter of getting more people familiar with the infrastructure of how the entire organization works.

We've had a lot of, you know, difficult to navigate changes for anybody who has been new as I've been rapidly setting up all these different repositories and organizations, getting everybody coordinated, figuring out who actually is a GitHub contributor, and then allowing them the space to learn how to do it.

I'm not perfect about being a proactive leader on every single piece of this, but I'm just trying my best, I promise. And it's just, like, it's a challenge. I've been learning all this as I've been going, and it's not always easy because I'm not that good of a programmer, I'll just be honest with you. I didn't really learn R until about grad school, and that was only a few years back. And now I have five, six authorships on CRAN packages. So, it's been a very interesting challenge to get all this worked out.

So, somebody asked, what significant improvement do you see or want to see to the Sports Dataverse accomplishing in the next year? So, this is an excellent question, because it's something I think about a lot and have trouble prioritizing where I want the changes to happen most. Because there's, you know, there's the hard goals, which is, hey, can we make this model happen and make it reproducible and make it run every night so that it's updated daily?

But moreover, I think my biggest goal is, you know, creating the kind of organization, open source organization, that becomes the first stop for teams and companies in the sports analytics space who are looking for developers and analysts and data scientists and machine learning engineers and everything in between. Because you can clearly see that they are making important contributions to the actual community and, you know, being helpful within the community to make it better for everybody.

I just want to say thank you so much for all that you do for the community, and I see people are commenting that in the chat as well. And a lot of love for the Sports Dataverse. I didn't want to interrupt and start asking questions if you are still going through parts of the presentation, but I see there are a lot that are coming in too, so you just let me know. Yeah, I mean, I'm fine to take questions wherever.

I honestly, I had so much planned, and I have stuff I can show you that just isn't quite ready. But basically, the presentation I was supposed to do today was going to be on building regularized, adjusted, plus minus models for the WNBA. And so, I would expect by version 2.0 of WeHoop that there will be regularized adjustment plus minus metrics available for players going back to at least 2015. Wherever we can get lineups data, it's already, the data is available in the data repository. I just have to make changes to bring it into the package through the package functions. But basically, yeah, the data is all there. We have it all worked out. Just need to make a regularized adjusted plus minus model, and then we are in business, which is nice. It's a very reasonable standard of a metric that everybody agrees is at least not noise. And so, we're excited about that.

I can pull, I'll pull a few of the questions over that are coming over on YouTube as well. But I see Jeff had asked, is there anything special about the relationship between the sports dataverse in Python and R? Are they effectively the same initiative?

Oh, absolutely. Yeah. I trimmed a bunch of the Python and Node parts out of this presentation. But there's simultaneous efforts being made in both Python and Node to make the, at the very least for the Node version, the scraper pack, the scraper level functions available as Node functions, as well as the Python having the exact same capabilities as both loading and scraping. With the additional benefit of it being only one package, because of moduling within the package, making it a little bit more convenient.

There's only one package on the sports dataverse Python side. And, well, I shouldn't say that. There is one other package that individual contributors make, and I need to figure out a better way to streamline integrating other people's work on the Python side. That's honestly the biggest hangup from it becoming just as, you know, just as robust in its implementation.

Thank you. I see Keith asked a great question over on Slido about how people can contribute. Keith had said, how about sports outside of North America, like European ice hockey or rugby? How can we contribute here?

Yeah, this is probably my own fault. I basically, the people I've brought in, for the most part, it's like 95% North Americans, and then 5% people from around the globe. And that's somewhat by design, somewhat by just simple geography and not knowing everybody in the community. And so, we've been definitely making some outreach. And that's just, as far as specific examples of non-North American sports, world football is a pretty vast covering soccer package.

Though, if somebody would want to make contributions on, say, something like rugby or European ice hockey, I would suggest European ice hockey would go with FastR hockey. So, if you want to, if you were interested on the hockey side, I'm trying to bring sports into one package, you know, divided by, like, leagues, ideally, to make it a minimum number of package names you have to remember. Like, I only want one hockey package. I only want one basketball package, ideally. But given the, if I were to combine WeHoop and HoopR, it would turn into a 350-function package of just a monstrosity that I, you know, I don't want to anger Brian Ripley, like, just to be honest.

And it would be incredibly unmaintainable very quickly. That's how things get taken off of CRAN, and it's just wiser to limit the scope as much as possible. But I'm very open to people talking to me about contributing on non-North American sports. I am very into it. I just don't have the personal knowledge of and, like, understanding of the various leagues that exist outside of the United States and North America. So very open to learn, very open to accept contributions.

Like, I, this was just an idea that lots of people in this community have had, and I just tried to execute it. And as I've tried to execute it, I've been like, wait, I need to change how I do this every step of the way. And it's just a constant learning process.

I like, for example, Jackie Tran, she made a very wonderful Women's National Basketball League of Australia women's basketball package. And I did not know that league existed. So I had to, well, we have to figure out how to make more data accessible through these packages to give it broader coverage, because I really want there to be equitable coverage across, like, women's sports and, like, not just North Americans like leagues. I focused there because I knew there was a lot of data there, and I also knew what the websites were. That's basically how this whole deal works.

Somebody's like, hey, we should scrape this resource, and somebody does it, and we just figure out how to cobble it all together into a useful format.

Awesome. And, Saim, just for people who are interested in contributing, is it best to go through the Sports Dataverse website or through GitHub, or what's the best way to get in touch with people? Both the, like, the easiest way to get our attention is first to, you know, message myself on Twitter, Saim Gilani, or the Sports Dataverse Twitter, also at Sports Dataverse, or the CFB FastR1 if you're looking for something college football specific. But, and then, as well as anybody you know to be affiliated through the GitHub organization, start talking to us. That's all it takes, and you'll become, we'll invite you to our community and have a talk about how to bring in whatever idea you have.

I'm not super picky. I just, like, we just have to talk about it, and we have to make it happen. That is just like a, it's a process of talking about it, and bringing together a plan, and making it, making it work.

Tips for breaking into sports analytics

That's great. I see Samra asked a question, which I know comes up in a bunch of data science Hangouts, as well. I want to pursue a career in sports analytics. What resources do you recommend?

Athletics, excellent book for modeling, and if you haven't read Basketball on Paper by Dean Oliver, excellent introductory tutorial, and then a very recent addition to my collection, The Mid-Range Theory by Seth Partnow. It, I mean, I primarily focused on basketball, but there's an incredible number of resources, just like in general football analytics has come a long way. I would start by, you know, following the introductory tutorials from various people in the sports dataverse, like in the packages, like CFB FastR has seven, eight vignettes, and a bunch of different examples about how you can get started.

Everybody always puts their work out on Twitter. That's a great way to get exposure. Start putting your work out there. Like, you don't always have to be, you know, mind-blowingly brilliant. Like, I just started putting out, like, logo plots, unlike everybody else, and then I started becoming a package developer, and things just kind of took off from there until I became one of the people that's produced the most open source packages.

I, it started somewhere. I was nowhere, I started doing open source development in February of 2020. Yeah, so like two and a half years later, and I'm now the director of data science for an NBA team. I, it's, it's really just a matter of, like, working hard, putting your work out there, having people see it, and know you in return. Like, you have, it's, it's one thing to know, it's one thing who you know, it's another thing who knows you. Like, you have to be able to put yourself out there in order to have, like, the name recognition, like, oh yeah, I remember this person because of this analysis they did.

An incredibly valuable way to promote your own work and get feedback on your work is using Twitter, make analyses, like, put a blog post up, get some feedback. Like, there's so many people who are willing to help, especially if you start, like, putting it in front of us. Incredibly, yeah, I can't understate how important that piece of this is that we don't talk about enough.

An incredibly valuable way to promote your own work and get feedback on your work is using Twitter, make analyses, like, put a blog post up, get some feedback. Like, there's so many people who are willing to help, especially if you start, like, putting it in front of us.

I love that. Just start sharing your work.

I see there's quite a few other questions coming in from LinkedIn and YouTube. One is, what are some of the plug-and-play analysis or functions that these packages have?

Oh, so, basically, the, probably the most developed one is CFB FastR, because we, that was initially modeled after, after NFL Scraper, written by Ron Yerko, Sam Ventura, and Maxim Horowitz, which, in turn, was part of the Open War NFL paper, which I highly recommend reading.

Basically, you can add pipelines to any of the existing play-by-play functions, to get, you know, various levels of expected points added and win probability metrics. Not all the packages have them. Basically, I would, the presentation today was actually supposed to be me developing a new metric for Women's College, or Women's National Basketball Association, because that's basically the next phase of this. A lot of it is data get functions, or data get, and then compiled into a loadable data set, but the goal is for people to create pipelines on top of it, and then submit those pipelines to us to incorporate into our data repositories, and then, in turn, be made available to end users through the package functions.

And so, like, that's, that's really where the next piece of this goes. A lot of them, a lot of the packages fall into one of three categories, not all of which exist yet. There's the data scraping, which is basically all of them right now, but there's modeling packages, which may just store models, or, you know, methods to create models, and then there's data visualization packages, like cfbplotr, or mlbplotr, or sportyr, which create, allow you to create very useful visualizations, or tables, depending on what your goal is.

But the real, the real deal thing that I want to see get started in the next phase of the sports adverse, beyond just us, like, covering more sports and sports leagues, I want us to take the next step of incorporating other people's models that they want to contribute to the open source space, so that they can become, like, a standardized method that everybody can reference, like, hey, you know, this person made a, an adjusted plus minus model, and it takes these certain things into effect, we can document it, and then you add it to our nightly data repository load for that sport league, however the pipeline works, and then it's made available for users on, you know, every time we run the nightly load.

Thank you. I see, I know we touched upon this already, but I see a lot of the questions coming in are focused around, like, tips for shifting into the space of sports analytics as well, but do you have any specific tips for shifting from a different industry into sports?

So, yeah, I guess I should probably tell you a little bit more about myself. I, I was in healthcare analytics, and on the actuarial side, and in medical malpractice briefly, just like, as an analyst, where I wind up using a lot of Excel, VBA, SQL, and SAS, and learning SQL helped me a great deal. SAS is very useful as well, but given that it's a paid software, I think not the most useful in this space, given that so many of the data science solutions, you know, are using an open source software.

So, I wound up getting a, getting some hands-on experience while I was working for startups in Python and R modeling, and that was very useful, and in turn, getting into grad school, because that's where I went to Georgia Tech, their online master in analytics program. Very exceptional program, very affordable, and I learned a great deal of various programming methods, languages, techniques, frameworks, as well as the math behind it, which was an exceptionally valuable piece to my understanding of how everything works in stats.

Like, because you have to have both, like, you need to work on your programming skills and, like, just your breadth of understanding of how different pieces in the stack fit together, and being able to manipulate data in those frameworks or languages, and then being able to understand the stats and math that are useful for your data sets, like, being able to understand what your data is telling you, what it can do, what it can't do, as far as building models that are useful, is the, that's like the bread and butter of how you become a valuable data scientist.

Thank you. I see that there are a lot of people commenting in the chat or asking if other people from the community are interested in working with certain sports data, like volleyball data, for example, and so I did just want to take a second to call out that this channel exists, so they're on the R for Data Science online learning Slack community. There's a channel called Chat Sports Analytics, but I thought it could be a good place for people to connect even after the meetup, too, so just wanted to leave that up there. That's the link to join, just r4ds.io slash join, and the specific channel itself is the one in blue there.

Yes, we actually do have a private Discord that we invite people to, in spite of our, like, you know, open and accessible mantra, we try to keep the conversation to people who are actually trying to help with the packages so we don't have a fully open community as far as, like, talking to us every single day and working on packages directly with us, but it's not super hard to get an invite as long as you're about it. You just have to talk to me about it, show me your GitHub, and that's pretty much it.

Awesome. So, reach out on the Sports Dataverse Twitter, right? Yes. Awesome. This isn't really a question, but I really love this comment, the anonymous comment on Slido was, this is the truest demonstration of you don't need to be the most, like, elite programmer or do it all alone to make a really useful package, so thanks for sharing that, too.

I appreciate that, because a lot of days I'm just like, wow, I can't believe anything I write works, and, like, people find it useful, all right, fantastic, but it's true, you really don't have to be, you know, elite to make a difference, you know, because as long as you try and execute an idea, doesn't matter if your code is always the fastest, it's nice if it is, but if you're just trying to get a job done and make it durable, sometimes that speed is not always the answer.

It's true, you really don't have to be, you know, elite to make a difference, you know, because as long as you try and execute an idea, doesn't matter if your code is always the fastest, it's nice if it is, but if you're just trying to get a job done and make it durable, sometimes that speed is not always the answer.

A few other questions that are coming through is, one is, are any of those packages, are they providing real-time data, or is it just batch mode after every single game is over?

I believe pretty much all of them are, all of the R packages have a interface directly to the website source that they pull the loaded data from, so basically, what we wind up doing is, the data repositories essentially work that we scrape data from one of the websites that's included within the package, usually, I think the minimum standard is, we usually have an ESPN for a version of the play-by-play player box and team boxes, and that's compiled at the season level, and so that is usually done through package functions, and we just run it every single night to make sure it's updated with the most recent data, and so, as long as it's available on their website live, then it's available through the package live, if that makes sense, because we're almost always directly interacting with their APIs as they're live.

Great, thank you. I also just wanted to take a second in this platform to share that I know the Women in Sports Data Symposium is coming up, too, on August 20th, and I just put the Twitter account to the symposium there that you can follow if you're interested in getting involved as well. We are highly supportive of them. They are doing excellent things. It's going to be a fantastic conference. Please attend, and we are happy to discuss sponsoring if you need help getting there.

That's great. Yeah, it looks amazing. So, a few other questions. I, myself, am not sure exactly what this question means, but what would be your opinion about DFTS density functional theory as an application? I'm not sure if that relates to a specific package or not, so I just wanted to ask it. Okay, we can save that one, and if anyone has some thoughts on it, feel free to share it in the chat, too.

So, basically, I see this question from Rodrigo from YouTube. Is Sports Dataverse focused only on providing access? Oh, we did cover some of this. It would be helpful if anybody wants to make, like, useful reporting that they would like to be run every night. We are always willing to accept that sort of stuff, especially if it's something that can be updated every night. That'd be good.

Someone had asked about some of those books that you had recommended. Do you know if any of them are available in online free versions as well, kind of like how the R for Data Science book is? I cannot say that I do. I also really like the people who have made these books, so I would encourage you to support them if you can, but I honestly don't know of any free versions. It's worth checking out Konstantinos Pelikrinas' sports analytics coursework that he's provided, because I'm sure he's provided a bunch of excerpts from the book during those course notes and lectures.

Great. Thank you. One other question was, could this package be utilized for prediction purposes?

I'm assuming you're talking about a specific package, but in theory, that's exactly what all of these are for. Right now, they're mostly just pulling in data as it comes from websites, and there's not much being done to enhance it. I'm just trying to get the data ingest going and made available so that you, the user and potential contributor to the Sports Dataverse, could then build a pipeline and be like, hey, if you run these two functions, you can get this set of reporting and modeling done. Here's how you would train that model. Here's a proof of concept of how it works. Please add it, and you make a pull request to us or just talk to us about it. You would get it incorporated to the broader community and package.

That's the next phase. We want that. Please.

I'm muted. Sorry. Eugene had asked, it looks like you have quite the trophy cabinet behind you, which are your sports? Math. Math. Yeah, 100%. Love it.

Another question I realized I had missed from earlier is, what are some newer aspects of data analytics in the MBA that excite you? Well, there's a lot more stat companies now that are providing MBA data. That's always an exciting opportunity to get a check on yourself from various other very smart analysts in the field, which is great, as well as there's actual opportunities for new data that's being provided to us from existing providers, whether it be raw tracking data from either Second Spectrum or Zealous using Second Spectrum's data.

There's a bunch of people that are providing more and more interesting data that you have to become adept at building these pipelines, because that's the work you have to do. The entire purpose of why I think this is a very good route for people to learn the skills they absolutely need on the job for working in sports analytics is, you're going to get new data sources. You're going to have to build data pipelines. You need to be able to do it fast. You got to make them durable, make sure they don't break, and learn how to make them basically work no matter what.

That means working with tracking data, whether it be provided to you or creating your own tracking data using computer vision modeling and extracting coordinate systems. Learning how to work with very large data sets is the name of the game, because at the end of the day, you're going to be working with continuously bigger data sets the more data you are provided.

I just want to say, if I have missed anyone's question, feel free to add it into the chat again, just to make sure that I see it here. I know we're getting to the top of the hour, but I want to make sure I didn't miss any. Oh, yeah. Just on the baseball R side, the person who actually taught me the most about both baseball and working with R is Robert Frey, who asked the first question at the top of the hour. His YouTube channel was actually very instrumental for me learning how to build a scraper package, because he shows you in incredible amounts of detail how to go about doing it, and I would highly recommend it.

He posted his YouTube page in the chat. I literally incorporated his work into baseball R, and I learned it from watching his YouTube page, and I became a co-author of baseball R as a result. It's just like a crazy roundabout way of taking someone else's idea, making it more functional, and being recognized for that. Making concrete contributions to the open source community. It's just like little ideas like that will make you beloved to me, at least.

Let's see. Just give me a second here to scroll through and see if there are other questions that we missed, as well. I do also want to just say, if people are going to be at the RStudio conference later in July, we will have a birds of a feather group for sports analytics community, as well, so it might be great to meet some people there and connect ahead of that, too.

Let's see. Any other questions you see? Oh, one other one. I see Austin just asked, how do you get your team to align and follow with your metrics? Any challenges there, or are team senior leaders more aligned with data analytics lately?

Well, I work in the best place for that, in the NBA, at least. Analytics and data scientists are highly respected in our organization. They are always asking about modeling, and I think it's incredible that this level of understanding exists throughout the organization. It's not just at the top. Even scouts are asking about the different statistical techniques and modeling, because they're incorporating it as shortcuts to what they know to look for.

You continuously get more and more benefits from actually having a top-down understanding within your organization about analytics and metrics, like what's useful, what's not. It cuts down your search time to insight by an incredible amount. It's so very useful, and everybody should be doing it, and I'm very grateful to be in a position to do that. Everybody should be doing it, and I'm very grateful to be in a place where it is respected and understood as valuable. You can't exist without it anymore. You're just giving up too much of an advantage.

Okay, I know we're a minute over, so I think we have time for one or two more questions, if that's okay. I also just want to let people know, if you're on and listening, and you're hiring in the sports industry, or actually in any industry, and want to share your role in the chat, feel free to do so. That's perfectly okay as well.

But the other question was from Jeff. Great question. In your opinion, is the sports analytics community stronger in R or Python?

So, the open source community is a lot stronger in R, in part because of the excellent examples set by people like Tan Ho, and Sebastian Karl of the NFLverse, Ben Baldwin, and Tej Seth, people who are making all these tutorials using NFLverse data, and as a result, there's an incredible level of knowledge of R within the open source community. And especially, it's very public on Twitter. And so, that might be part of what's influencing my opinion on this.

But the industry is very much into Python, because more often than not, it's already existing within their pipelines, and more people that build applications are more familiar with Python rather than R, just because they're usually not building shiny applications for internal websites where they're displaying analytics. And so, that usually necessitates a lot of necessitates or gives somewhat of a preference to Python, because they already know how to build the APIs using those functions in Python. But there's no reason you can't use it, definitely at the team level, or even at the stats company level for Python, as long as it's not being leveraged more than several hundred times a second.

And so, it's very helpful to know both, actually. I can't say enough about part of the reason I was a valuable candidate was because I knew both. And being flexible is a very useful trait as your team gets smaller. Knowing a little bit of everything is very valuable. But if you have a CS background, I would say you probably need a lot less R. But if you're coming from any background that isn't just purely CS, I would suggest learning R first, because it's a little bit easier.

The learning curve is not that strong, and there's tons of examples out there. That's the other thing, is you have just loads of open code from useful developers around here who just are very generous with their time and efforts, and will show you. The community of R, our community is excellent. The people are the most helpful I've ever seen, and you can get to know them very quickly. It's not that big of a group. And so, you learn who can actually help me with this problem, and you latch on to those people, and it's like, hey, you are my friend now. This is how it works.

I love that. Well, I have one more question I want to make sure we get to as well. And I see this was asked over on Slido. Keith asked, regarding players or teams following metrics, what is the percentage split of data consumed during the game versus outside the game?

I'm not going to answer that as a percentage, but by far, there's much more done after the game, or outside of the game, because you only have the bandwidth to focus on so many things during the game, especially if you're trying to communicate it to the coaching staff. So, that makes limiting the amount of noise you may take during a game to only the most useful things that they can immediately make a decision on or are expecting a response on. Like, hey, should we foul here? Yes or no? And what's our best strategy in this time frame margin? That sort of stuff. But it's not like they're looking at an individual's plus minus during the game. It's like, oh, yeah, we need to get him out of here. He's a minus seven. No, that's not a thing.

Well, thank you so much, Simon, for your presentation and just being so open and honest with everyone and answering all these questions. This was great. I appreciate all the awesome questions in the chat as well, whether you're watching on LinkedIn or YouTube.

I really appreciate everyone's time. I wanted to put the Sports Dataverse Twitter up on the screen just one more time if people are interested in collaborating or want to reach out. And I would also recommend checking out our GitHub organization, github.com. Because that's where you can see the actual developers who are involved and other people whose work you can check out and be like, oh, that's an awesome idea. Let me see if I can make that happen. I can recreate that. Because that's the best way I both meet people I want to meet and learn things I didn't know. Like, I religiously follow people's GitHubs.

That is my primary form of social media. I actually don't like Twitter that much. I don't post that much. But I am definitely following you on GitHub. And that's where I learned so many new things that other people are working on. It's just a great way to pick up, it's a great way to stay up to date on various techniques and like methodologies and just interesting ideas. Please do it.

Good for you. Perfect. I just shared it in the chat as well as on the screen here, too. But just want to say thank you again, Simon. I know you had a lot of technical stuff going on with the computer as well this week, too. So, thank you for still joining us today. Really. Yeah. I would love to come back and show you all how to do, I'll very least make a blog post about it. But because making some very unique women's basketball models would be an exceptional contribution that I would hope we could all do together. It'd be great. I would very much love to do that. But if not, it has been excellent. Thank you so much for all you do for the sports community, too. Have a great rest of the day, everybody. My pleasure.