Workflow Demo Q&A - Oct 25th

video

Oct 25, 2023

30:03

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hey everybody, if you just joined us over here, we'll get started in just a second. We're waiting for everybody to come over. I'm just making sure it's pushing everybody over from the demo room. If you just jumped over here, we'll be here in just a second.

I'll give people a second here to continue jumping in. I think it did. I think it looks like there's 52 of us here. We can probably get started.

I actually can't see the number, but we're good. Thank you for the confirmation in the chat that it successfully worked. And let me pull Simon over here with us as well. Thank you so much, Simon, for the awesome demo. And thank you all for joining us over here.

Thank you all for joining us over here for the Q&A too and spending time with us this Wednesday. As a reminder, if you do want to ask any questions anonymously, you can use the link that I've been sharing in the previous chat, but I'll share it again right here. But you can also just ask questions in the YouTube chat here too.

I know Simon showed he was working from Workbench and Connect. So if anybody's ever interested in doing a trial of Workbench or Connect shown today, you can use this link to book a call with our team and receive an evaluation environment. They love to chat more with you about your use case.

Introductions

But thanks again for joining us. I'd love to just start with some intros here of all of us in the room. I know, Simon, you already introduced yourself, but maybe for the Q&A as well.

My name is Simon. I work on the tidymodels team at Posit, working on open source software for machine learning.

I'm Ryan. I'm a data science advisor here at Posit. So I work really closely with all of our Posit customers and non-customers, just spreading the good word about our open source tools and our professional tools. And so we've been hosting these demos now for quite some time, and there was a pretty impressive attendance today, which was really awesome to see.

And I'm Rachel Dempsey. I lead our customer marketing here at Posit, and I host quite a few of our community events, like these end-to-end workflows and the data science hangout as well. So happy to see you all here and see all of some of these great questions coming in.

LLMs and GPU support

So one of the questions that came in a bit earlier on YouTube was, does tidymodels have a large language model equivalent with fine-tuning a pre-trained model, for example? Also how does one set up GPU support?

So tidymodels is focused on predictive modeling for tabular data. And so large language models are out of the scope of tidymodels, but there has been quite a bit of effort about integrating large language models generally into the IDE experience in RStudio . For GPU support though, tidymodels is compatible with any parallel backend that you can register for each, which is a package in R. So there's all sorts of different frameworks that you can use to connect to the computational resources you have. And tidymodels from there will handle all of the interfacing with CORS.

Teaching with notebooks

The question was, R Markdown is not suited for this, similar to Jupyter Notebook using R in Posit. Is there any tool that you might recommend to me?

As teaching data science for marketing analytics using R in Posit for undergrad students, I want to have a notebook where we write code and write text in it and share with the students.

The idea here is, like, can we pass along some sort of notebook to students where they can actually interface with code inside of the notebook while they're learning? One option for this that I have really enjoyed over the years is the Learn R package, so you can put together tutorials where students can actually run code inside of an R session from their browser or on a hosted server, and then you can intersperse your explanation and stuff through there. Also, if you were working in an environment like a Posit team session where you can just have students pull the R Markdown into their own IDE session, in that instance as well, R Markdown could do the trick, or Quarto would do the same thing.

R alternatives for MLOps

Another question was, are there other R alternatives to MLOps other than Vetiver and Pin and Tidy?

I think there was really a dearth of options in R for deployment before Vetiver came around, and I would say, to my knowledge, it's really the complete solution to that problem that I'm aware of.

Yeah, and Simon pretty much nailed it, but there really was... there was lots of MLOps tools within the Python ecosystem, but there wasn't really one that stood out in the R world. You can certainly build your own way to monitor models and deploy models, but there wasn't one single tool for that, and so that was really the driving force behind Vetiver. So there are probably ways you can home grow some solutions here, but really the reason why Vetiver was created is to make that one solution for MLOps.

CI/CD and deployment

So I see there are a few questions related to CICD, and so one that came in on Slido from Lorenzo was, how about managing deployments via GitLab CICD or GitHub Actions or some other CICD tool? And maybe it's good to even just give an explanation for anyone who doesn't know what CICD is, what that stands for as well.

Yeah, so I can talk a little bit about that, and then Simon, you can certainly round out any edges here, but the CICD is the continuous integration, continuous deployment, almost like set it and forget it method for either deploying products or monitoring products. So for Posit tools, and speaking to the point of taking a model or something that you've created in R or Python and deploying it to, say, Connect so that it lives there, it can be shared and consumed by people. There are ways that you can integrate with things like GitHub Actions to constantly have that deployment process automated via some kind of trigger, for example. Maybe there's, you know, an update to a repository or something that can trigger that CICD pipeline to automatically deploy your model, your updated model. So that's one way to do it with GitHub Actions.

There's also within Posit Connect, there's a feature for job scheduling. So if you wanted to, for example, every morning at 7am, rerun your model to pull in the latest data and retrain that model and deploy that pin to a better model, you can certainly do that with Posit Connect as well. So those are kind of the two methods that I would speak to right now.

tidymodels performance

One other question was, how performant or fast is tidymodels compared to other frameworks in R?

So tidymodels doesn't implement the algorithms used to train the models ourselves. We make use of existing packages in R to make that happen. And then we have a layer on top that sort of standardizes the way that you interface with those models. Earlier this year, we spent a good amount of time making sure that the layer on top of those existing R packages that we've written is as performant as possible. So once you're training more complex models or you're working with data that most people will see in an applied context, that overhead is basically negligible. So your modeling pipeline is as fast as is otherwise possible in R.

We also have a few tricks up our sleeves to reduce the amount of models that you actually need to train to still be able to evaluate many models at once. And we try to make it very easy to accidentally do the most performant approach to a modeling problem that somebody might be tackling.

And we try to make it very easy to accidentally do the most performant approach to a modeling problem that somebody might be tackling.

Model monitoring and retraining

Regarding the model monitoring, I saw an example in Vetiver page, but is there something developed to automatically check that?

I think Ryan's answer to the question about continuous integration might speak to this a bit. In Vetiver page, we'll have some resources about how to connect to different continuous integration options.

But how could I use tidymodels to retrain a model after getting new data in order to update it?

So part of this question, I think, comes back to the CICD. I think, so if you're getting new data on a continuous basis and you want to run the whole modeling pipeline all the way through, then you can use the same script that you used to fit the model in the first place and connect it through continuous integration in the same way like we discussed before. But I think another part of the question is like, is there a way to interface with the model updating methods that some modeling engines have? So the idea here is, if I've trained a model on a million observations, I get 100,000 more. Do I have to go back and run the algorithm on all of those million and 100,000 observations, or is there some way to incorporate the information from those 100,000 into the original? Only a subset of engines in tidymodels will be able to do that. That just depends on what the model supports. But if that's available through the modeling engine, then you can interface with it in tidymodels as well.

Favorite ML models

What's your favorite ML model you interact with? Let me go first, because my knowledge of modeling is not to the extent of Simon's, but during our Posit conference, I was actually fortunate to host a workshop where we actually took in real data from some restaurants around the Chicago area and we built a model. And that was my first kind of, I wouldn't say my first, but my real in-depth usage of an XGBoost model, which I know is a very, very popular model. So that was, so that's now my current favorite.

And my answer is super similar. XGBoost is one option for an engine that you can use to fit a boosted tree model with tidymodels. Another engine that we implemented maybe a year or two ago is called light GBM. It's another framework for boosted trees that is available in R. And the thing that I really like about it, it's very similar to XGBoost. It's quite tunable. So you can find the right configuration for it pretty easily that will result in a performant model and light GBM is really similar in that way. But it also, the time that it takes to fit the model scales with the inputs somewhat differently than XGBoost. And so in a lot of situations you can achieve similar performance with a quicker fit time.

Deploying models to Connect and Shiny

Another question from YouTube was, if I deploy a model to Connect, what is the best practice to bring this into a Shiny app? Do you schedule the model to update and export the data to an RDS in Connect?

Yeah, so when you have a model specifically deployed with Vetiver, I would say the best way and what Vetiver makes really easy is that it can take your model and serve it as an API. And that API can be either a plumber API, can be a fast API. And then once it's serving as an API to have your Shiny application speak to the API is insanely easy. And so that would be our recommendation, mostly because that's what Vetiver kind of suggests. It gives you the tools to actually easily create that the API. So and even if you've never created an API before, check it out. I go to the Vetiver GitHub pages and they have some examples on how to create a very simple API.

Because if you're like me, when I started kind of dipping my toes into the API world, I thought I was going to be way in over my head because it's like APIs, that's super scary. But it's not. Vetiver makes it insanely easy. And then you can take that API, you can host it on Connect, and it will continue to run there forever. So it's always ready to be interacted with.

Linear regression, logistic regression, and time series

I realized I missed one a bit earlier, but it was curious how to achieve linear regression and logistic regression using tidymodels. I want to do time series analysis.

Yeah, so in the demo, the slides where I was showing like what the tidymodels code would look like and then what the code would look like with a bunch of different engines, that code is what we would use to fit a linear regression in tidymodels. And the code looks super similar to do the same for logistic regression. So you can read more about what models are available in tidymodels at the parsnip website. That's parsnip.tidymodels.org.

For time series analysis, there's a bit more to this question. I spoke a little bit to how tidymodels is extensible. Any developer can add on to tidymodels. We try to make it really straightforward to incorporate methods that further the ecosystem without us having to kind of give it a thumbs up. One of the there's like an ecosystem of packages that was developed by somebody external to our team. It's called model time. And model time, again, it's from a user in the community. And there's all sorts of options to working with time series models using tidymodels through that package. That's the model time package from business science.

Tuning parameters and the dials package

Could you remind us what the package is that has the tidymodels default tuning parameter recommendations? I can't remember if it's an add-in or if it's an actual package.

So, the package that contains the information that tells the rest of tidymodels how to tune parameters, that package is called dials, and you can learn about it at dials.tidymodels.org. So, that package contains information about tuning parameters. And anywhere else in the tidymodels, if you say, I want to tune for a particular parameter, the number of trees, tidymodels will ask dials, okay, what do we know about the number of trees as a tuning parameter? And what option should we try out to get that tuning parameter right? And then dials will tell tidymodels what values to try out. And you can add all sorts of custom configurations to how dials will provide those recommendations.

What's coming up in tidymodels

But last question, what's coming up soon in tidymodels? And what are you most excited about?

We have two sort of big efforts that the team has been focusing on throughout this year. And both of them are very close to being released, which I'm super excited about. One of them is better tooling for survival analysis. So the censored package in tidymodels has a lot of the bread and butter code for how all that will happen. And we're working on integrating that with the rest of the ecosystem. And so I think in the coming months, hopefully by the new year, we'll have lots of documentation and some release notes about how that came together.

The other thing that I've personally been spending a lot of time on, and this is coming about after we brought together a reading group within Posit about model fairness. So if a model is predicting outcomes that are somehow related to people, we want to make sure that we understand the ways that a model might predict values differently. For different subpopulations. And there's all sorts of tooling out there for learning about how models understand information about subpopulations. And we've been putting together some functionality so that people can interface with those methods using tidymodels.

So if a model is predicting outcomes that are somehow related to people, we want to make sure that we understand the ways that a model might predict values differently. For different subpopulations.

Well, thank you so much, Simon, for the amazing demo today and staying on here to answer all of our questions. And thank you, Ryan, too, for jumping in here to help with the Q&A as well. I appreciate all the amazing questions in the chat as well. And again, I'm sorry if anything happened to go unanswered, I'm going to go try and grab all those questions and I'll share them with Simon as well.

But I did put up on the screen, if you're interested in joining future sessions, maybe this was your first one. Thank you for joining us. You can add them to your calendar using the short link on the screen. POS.IT slash team demo. And we have these the last Wednesday of every month with a new data science workflow.

But if you'd also like to spend more time with us this week, you can join us tomorrow at the data science hangout. And Julia Silge from POS.IT will be joining us as our featured leader. And so you can just look up POS.IT data science hangout to find out more info about that.

And again, if you are just curious now after watching this, if your team does already use our tools or you want to connect and chat with us, you can also just let me know using the short link I shared in the chat, too. Just POS.IT slash connect us. Also feel free to connect on LinkedIn, too. It's so nice to see so many of you here today. Thank you so much for joining us.

Featured software#