Running unified attribution at scale | Martin Stein @ Conversion Logix | Data Science Hangout
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Welcome back to the Data Science Hangout, everybody. If we haven't had the chance to meet before, I'm Rachel, I lead customer marketing at Posit. Posit is the company formerly called RStudio. I like to add that in here now. We build enterprise solutions and open source tools for people who do data science with R and Python, and I'm joined by my co-host here, Libby.
Hey, everybody. I'm Libby, and speaking of R and Python, I am a Posit Academy mentor, so I help people learn R and Python to do more stuff with data in their everyday jobs. And of course, I also work with Rachel here to facilitate our amazing, beautiful community.
We're so happy to have you joining us here today. If it's your first time, the Hangout is our open space to hear what's going on in the world of data across all different industries, chat about data science leadership, and connect with others who are facing similar things as you. We get together here every Thursday at the same time, same place.
Thank you so much to those who have helped make this the friendly and welcoming space that it is today, and we are so proud of that. We're all dedicated to keeping it that way, so if you ever have feedback about your experience that you'd like to share with me, and honestly, good or bad, or maybe suggestions for topics to dive deeper on, I'm going to share a Google Form in the chat with you right now.
But you can always reach out to me directly on LinkedIn as well. Absolutely, and reach out to each other on LinkedIn. We would love everybody to connect with each other. And we love hearing from you. It doesn't matter your years of experience, what industry you're in, what your job title is, what languages you use, we want you to know that you belong here, and we want to hear what you think, and we want to hear your questions.
So I encourage you to put yourself in the chat, put your name, your role, where you're from, a link to where somebody can find you, maybe your website, and definitely use the chat. It's full of your space for sharing resources and getting together. There are three ways for you to jump in and actually ask questions today, or just provide some feedback or perspective, so you can raise your hand on Zoom. We can call you to jump in. You can put your question in the Zoom chat. We will grab it from there.
And then there's also a Slido link where you can ask questions anonymously.
And because I have you all here, I just want to do a quick announcement, because I let people know yesterday about this month's workflow demo, but wanted to share here on October 30th, which is Wednesday, Ryan Johnson is going to share how you can save time with dynamic and professional PDFs powered by Types, Shiny, and Posit. And so I just wanted to share that with you. We hold those workflow demos once a month, the last Wednesday of every month.
With all that, thank you so much for spending time with us today. I'm so excited to be joined by our featured leader and co-host, Martin Stein, Chief Analytics Officer at ConversionLogix. And I've known Martin as a customer all the way back in 2018, I believe, a colleague, and now a Hangout featured leader. So Martin, I'd love to have you introduce yourself, share a little bit about your role today, and also something you like to do outside of work too.
Hey, Rachel, and hey, Libby, hey, community, it's awesome, wonderful to be here. So yeah, it's like, I think at least 2018, maybe even a little bit early, I was probably one of the first Commercial Connect customers, if I think that might go back to 2016 or so, so quite a while.
So my background is I work for an agency called ConversionLogix, a tech-enabled agency in the space of apartment rentals and senior living rentals. The company provides services for marketing, media services, advertisement, and of course, there's a clear indication that you need data science to optimize out of that, and that's what we're doing.
So my background, I started real quick at the age of 18. My first software company, studied then political science and sociology, statistics, got all into SPSS, and as I was a coder, it didn't take too long to start R, and I've been using R probably for, since 2008, nine, for a long, long time.
Unified attribution and the product launch
So what we set out to do is understand what is the real value of all of those marketing channels and get attribution in there that's better than what the industry standard is today, which is usually last touch attribution and last touch attribution models. And so we developed with the help of the open source community as well. So there's the companies like Channel Attribution Pro, Channel Attribution who do open source packages, which are fantastic, and other packages out there, a solution that allows to do probabilistic attribution, use a smock of chains on one side, Shapley, a reward model inspired by Shapley values on the other side.
I think those are really then leading to outcomes. That's the most important part that are not just like last touch driven where everything, you know, your marketing dollars all tend to go into one channel, but you get a much, much better overview about where your marketing dollars actually really work. And then the next step, when you know what really works and you spend your money, you actually have a much better strategy forward and you can optimize from there and then start optimizing your campaign settings and optimizing the spend and the pacing.
So that's where we started with conversion logics. And this week is actually our launch week. We just bring our product to market and super proud. We did this in a super short amount of time. I think we started in February here and use a whole lot of R and Python, but frankly, without Connect, I don't think we would have been able to do this so fast.
Daily struggles in data science
Yeah, I think all the people who work in the data science capacity or in a data capacity, sometimes it's just like data engineers that come out of data science or the other way around. And then you see people wearing different hats, machine learning engineers, and so on.
So I think all of us that deal with data and that produce outcomes, if it's models or if it's just like ETL, more structured data, have the number one issue. I think it's about 30% who quote, who say this, I think I'm quoting here an MLOps community survey from this week that I just read, about 30% have an issue with finding data when they do their work. I mean, that is a huge, huge issue.
And when you have your data locked up in systems, for us as data scientists, it's like, yeah, let's get me some data. And then I start modeling. But that is usually not the way when you build something that is then down the road, has to run every day, has to be refreshed every day with data, where you have like data validation tools in there. We should talk about this too, because they're really key. That's something we discovered that we need data validation tools. So I think that's number one, that's the biggest struggle is really access to data.
I think people know how to model. I think people know how to use tidy models or a Python environment to do that. So I think that's not really the big struggle. The second part is really data inconsistency. And when data is not the way you expect data to be, and then whatever you build starts choking and then you have to build code around this. And then you get from the 20% that you put in there to really come out with an ADA and a model and something that makes sense, then you got to take care of all of that data issues.
So those are probably the two top issues that pretty much all of us in data science and the machine learning and specifically data engineering affect. And then there are a lot of downstream issues that are, I feel like, how do we put things in production? How do we make it repeatable? How do we show and share our value that we create? And we should talk about this too, because I think that is another big issue.
And the third bucket, that's the last bucket I would add, is teams. How do teams work with each other? And very often you see a lot of issues. Sometimes I'm wondering, because I'm in those teams, is it me? Then I realized that, no, it's not just, it's everywhere. I mean, wherever you go, teams sometimes have different goals.
So you have a business team or you have a data team or data science team, and then might not be aligned with what is the goal that they have to do. Then sometimes they're not clear about who has what responsibilities. Is it your responsibility to take care of data? Is it your responsibility to make sure that what you put out is validated? Does a data scientist, should a data scientist go into some aspects of infrastructure, which is a real, real big issue, because who else would do it for you if you're in a small team?
So I think those are the three buckets here, is really getting the clear and right data, making sure that you can repeat having that same data approach for ETL, what do you do? Then modeling, I'm not talking too much about, but when you have your model done, putting things into production, making that reliable. And last but not least, how do you work with teams? Those to me are the three biggest issues.
Understanding last touch attribution
Yeah, you kept mentioning last touch, and I think I know what that means. I think it means like the credit goes to the person who had the last touch or the interaction that was the last touch. But I was wondering if you could explain a little bit about that and why that's a problem that even needs to be solved.
Last touch, if you think about a customer journey, so what do we do when we search for something? So let me give you an example out of my business background here in apartment rentals. So if you're moving from one location to another, you're looking for an apartment, usually what happens is that you potentially go on a search site or you type it into Google, or you go into something we call an internet listing service, like, you know, some that are there for apartment rentals, or you have social media, you see something on TikTok.
What happens is there's so many different marketing channels, that's what we call a marketing channel, this is a medium. And when you as a user click on one of those, you know, pieces of information could also be organic, a blog post, and then as a next step, say, hey, look, I'm actually really interested in learning more about this. The people who put that information out, they want to know, where did you come from?
I mean, I'm sure when you go to a store and people will say to you, hey, how did you learn about us? That's literally the last touch question, right? So because you said, well, a friend told me, so that's the last step. But maybe you've heard about them before, before a friend or you have noticed them, but you never really thought about this.
So in marketing, we're really interested in knowing the whole customer journey. How did you become aware of the brand, of our brand? Why did you come and see this? Was there a specific reason? So we would like to know. And the reason for that is if we do not only know you're walking into a, let's say, apartment into at the front leasing desk and say, and somebody asks you and say, well, because a friend told me, but you could say, hey, look, those are the five things that I learned about you the last four weeks.
That would be incredible, but nobody does that. So we have to make sure we can measure that. And so what we do is basically take the data that we can get out of advertising systems and individual systems like your website, and then combine that data and then construct that journey. So then we see not only the last touch, the last person who told you or gave you the information to come here, that's what we refer to usually very often as source medium, but we know the entire journey.
And so why is that entire journey so important? That is because at the end of the day, that's what brought you here. And between each step in this entire journey, there's a probability that we as marketers, that we have affected you. And we want to know this because it's our marketing dollars going into that.
So let's put it this way. If there was a YouTube campaign and you might have seen this and you didn't click on anything at this point, but you just watched it. And then later on, you walk into that apartment building and visit, we would like to know if you've seen that. So that brings another real huge problem in marketing, which is first party data and third party data and data that we might have or might not have. We're dealing right now with a world that is without cookies. So we get less of the data that tracks you, and we have to deal more with data that is just aggregated.
So what is this aggregated data that I'm talking about? It's like how many clicks and impressions did a campaign have, a YouTube campaign? We don't know who clicked this potentially until they go to your website. But if you just watch it on YouTube and you don't click anything, we just know impressions and clicks. So that could have been influential.
And so what we do as a problem, what we have to solve is bring this aggregated data together with the individual data that we have in an anonymous form from our website and then basically build an attribution model out of this. So we build the customer journey, then we use basically the unified attribution model approach where we take a reward model on one side and combine this with the customer journeys that come out of Markov chains and then fuse this together. That's basically how it works.
So to answer your question, if you understand the whole journey, you know how to spend the money as a marketer wisely, and you're more efficient with your marketing spend, and you have less wasted spend. That's the problem we're trying to solve here.
So to answer your question, if you understand the whole journey, you know how to spend the money as a marketer wisely, and you're more efficient with your marketing spend, and you have less wasted spend.
Building the business case internally
So in the past, and even currently, what happens is I have an idea, I propose a project, I pitch it maybe to my manager, and my manager, bless him, is like, hey, I'm behind you. I want to help you pitch this project. However, I need you to help build the business case for it, right? And I can think on those terms, but I'm used to the nitty-gritty and thinking about the data problem, right, not the business problem. And so I wondered, from your experience, do you handle any of that? What's your methodology? If not, who do you hand that off to?
So I think that is the question that so many of us face, where when somebody comes to us and has a clear set of problems to solve for them, and they have a clear set of goals, has a clear set of problems to solve with, and it's already vetted, and you just got to get to it. That is an easier case to deal with, because you know, okay, let me know what's the problem, where's the data, by when do you need something, right?
But when you have to get to your point, go out and say, hey, look, I mean, I'm discovering those issues, and I want to make a case about we have to solve something. I'll give you an example. I mean, everybody, all companies deal with a churn problem, a customer churn problem to a certain degree. You do really good work, so you might sit in your company and discover, hey, look, we're losing too many customers, and I would like to help them about, you know, how do we address this?
So, that's a classical situation where you, as a data scientist, as in an organization or in a team, have an idea about what to do. Classically, to me, that has been the situation throughout my career. I have been a data scientist and a chief product officer, so I always cover both areas, the problem discovery and the problem solution, and then the technical implementation with my teams and myself.
So, the way to do that is really get very clear about what's the impact that this problem that you want to solve has. If the impact is quantifiable and the impact is easy to understand for your stakeholders, then I think the first thing that I would do is, like, you know, kick off a little bit of a research project for you. It doesn't have to be big, about just getting your stakeholders, the audience that potentially makes your business decision, if you get the time and resources to solve that issue, to get them actually understand what the problem is.
So, that is the hardest part because you might go into the technical details and how to solve it, but it's not about the how at this point. You purely have to focus on why it matters and why it matters to your organization or your team. And so, that's usually very high level and you don't spend a lot of time in 20 different cases. You just bring up the one, two, three biggest problems. And then for the stakeholders, they have to make a trade-off decision. Do we do this or do we do something else? Do you understand that everything they do is a trade-off decision?
The best method to do this is to have conversations with the business owners. That's what I do. I sit down and listen and talk to them very frequently. Try to understand what the issues are. And that's the same what we do as a marketing company with our customers. Maybe today it's not the attribution piece. Maybe today it's like they have a competitor across the street. And then we would like to understand what it is. And once we understand this and contextualize the problem, that's really the most important thing. And then help them make a trade-off decision, second most important thing. Then you have basically the first step done.
Pitching to external clients
So out of my experience, I deal a lot with investors too. So I think one of the toughest pieces when you go to a venture capitalist and make a case for getting $10 million, right? That's a whole lot of money for them to spend. So your case has to be really, really, really good.
So I think to me, it really breaks down, Jared, into three aspects that I follow. Number one is I always open my conversations with framing the situation about who we are, who I am, who the organization is, so they know where I'm coming from, right? So they know, oh, yeah, this person comes out of this direction. So everybody else you're speaking with can contextualize you and know what to ask you and make sure that you're on the same page.
The second piece, and that very first, usually in an investor presentation, when you go through your startup, it's the same like with the client. It's like you present yourself, say, we are conversion logics. This is what we do, and that's what people love about us. And these are the problems we solve. So you've got to break it down. And that's the same thing in a data science environment.
Next step is give the other side a chance to weigh in. So the main approach that I would – the main way I would describe that approach is consulting or consulting. It's not selling. It's really being more – you being the doctor who listens to somebody's issue. And then you're trying to understand very good, careful listening.
When they have shared their information, then you got to – you have to aggregate it. You have to reduce complexity. If you don't reduce complexity in that conversation, if you create more complexity, you lost it. It's as simple as that. If you just go into one area and you blow that up technically, oh, we could do this. People will not know about what this means, what you're saying. They have no idea. But you got to reduce complexity.
You said what you do. You listen to what their problem is. And then you got to bring this together and say, oh, I see. You have an issue here that you – for example, you have – in our case, your apartment occupancy rates are not 95%. And for some buildings in a highly competitive environment, let's say Chicago. So we can help with that. And then we go forward and say, confirm that this is the problem. Reduce complexity. Focus on one or two things.
And then you say, hey, classically, that's where really trust comes in now because somebody is listening to you. And it's really a decision theory challenge here. It's like they got to learn to understand that we talk about the right thing. Then they got to give you credit, that you have credibility. That's now after you locked in into here's the problem that we can – that you said you have and we can help you with. Now you bring credibility. Now it's about the time, not before. At this point, you say, hey, look, this is what we have done for others. And then people understand, oh, same problem. You have done this. And then at the very end, you can talk about how do we go about this. But that's usually only 10% of the conversation.
A/B testing and causal inference
So I think we see a big change to, well, not a big change, but we see a change to causal analysis and causal inference and so on. And classically, we did a lot of A-B testing. Our organizations did a lot of A-B testing.
And I do think that on the A-B testing side, it's a classic method. As long as you don't violate the peaking problem. We all know what we're talking about in A-B testing. You do it right. And I think for A-B testing in general, it's still a common practice. I think you need data scientists or very, very knowledgeable data analysts to get it right. I'm not a big fan of those super automated A-B testing tools where, you know, you put something forward, you look at what's happening, you peek the heck out of what's happening here. This is not the way to do it.
Causal analysis and causal inference, I feel like, is a really, really interesting approach. I personally think this allows us to do a whole lot more with understanding, you know, where without an A-B test where over time series, over a certain amount of time, things have happened. We're just right now on our side at ConversionLogix getting into causal inference and causal analysis. In my last probably two to three years, it was a mainstay. I think most agencies today should probably have a plan for how to, you know, leverage causal inference and causal analysis. You can do it in R. You can do it in Python.
I think one of my favorites, I have it right here. I'm going to share this with you. One of the great books is by Martin Hoover, you know, so it's like causal analysis. You can see it here. That's one of the really, really good ones that I recommend. And Martin puts out a really good theoretical context about how to conduct causal analysis. So that is my recommended read for people who have not gone into this.
Customer lifetime value
So yeah, Peter Feder is a Wharton professor for those. He has published a couple of really, really great books as well. He puts forward his own method and principle about what metrics to follow in an organization.
So he's really straightforward with what values to take, how to calculate those things, and then puts this in a context that makes sense. If you feel like other VCs out there, not that Peter Feder is a VC, but VCs out there have their own systems about how do you calculate customer lifetime value and payback periods and so on. So now we're going into a more financial modeling part, right?
When you create those values or those KPIs for an organization or a team, you need to understand the context of what is that going to tell you, right? What is the next? How do we put this in context? And so custom lifetime value for those who are hearing this term for the first time, it's really about an approach to understand what's the value for your customer here in your organization that you have acquired. And that tells you what to do, how to sell, how to market, how to retain that customer and how to get new customers because there's a limited lifetime potentially for that.
So to me, if I would be in an organization, here's the tip that I would give you. I would sit down with my finance team. If you have a FP&A team, financial planning and analysis team, they are all over that. FP&A, if you have this in your organization, you sit down with them as a data scientist and say, you know, how do we actually get, what are your challenges?
So you have that team, finance team, and they have a sub-team called FP&A, financial planning and analysis. They do CLV calculations. And you say, hey, look, I mean, I know you might do this in Excel or whatever, in Google Sheets. We can do this potentially. I can help you here. And so I think you want them to explain to you what their approach is. Do they follow a Peter Fader approach or something else? And then you can actually follow up on really the metrics, make sure you have the data and help them.
And to me, help always looks like the best way is to show something of value. Classically, what I do is I put forward a shiny app that I share and connect and, you know, just model something very quickly. And usually what happens is people get excited when they see that.
Don't do it alone because you go down potentially a rabbit hole of just, like, metrics. What I'm saying is, like, there's a reality in your organization. If you have a big organization like 200 people, most likely you have somebody in finance who actually does the FP&A role. And then you talk to that person and then you say, hey, look, what's your biggest challenge? And that's basically then just listening.
And if you have something like Connect where we as data scientists can show value just, like, immediately, this is the best part of it because you just put something together and your job is to really then get this person excited and help them to do more with their time. So the whole company benefits from your effort and your finance person's effort as well.
Getting started with AI
Best question ever because what we're doing here is the best way to do it. Join a community. That's literally the answer. That is the answer. Honestly, I could not give you a better answer to that. I'm a member of multiple communities. I'm a member of the Machine Learning Hangout in Seattle, ML Ops community, the Data Science Hangout that Rachel started many, many years ago, which is fantastic.
So to me, literally what you should be, if you have time to join those meetings, go and participate in those in communities. Go on to the Slack channels. If you have a local meetup, I mean, go to the local meetup. I mean, nothing beats meeting people, explaining what you're doing, and understanding what they're doing.
So it really comes down to two things. Thing number one is that you can understand that you're not the only one dealing with an issue and everybody is learning. I mean, we're all learning, continuously learning. You're not the only one who is learning. I would bet out of the 130 people, 20 people here, there would be only a handful saying, oh, I know everything. I don't need to learn anymore. So that's why people go there.
I remember just one example when Generative AI came around, and I was in Seattle, and a ton of people said, oh, let's go to those meetups. People didn't really know what racks are and so on and so on and all this stuff, retrieving an augmented system. That's when people said, this is what I do. Some brought example applications that they had running on the machines, and then they said, here's my repository, and then you can go there and look at the code. So that's really that exchange that helps you going and helps with learning.
Sharing Shiny apps and infrastructure
Yeah, I just wasn't sure how – I haven't used Shiny in a long time, but I wasn't sure how people were sharing it. I remember when I published it, it's on my local host URL, but if I don't have Posit at this particular company, I forget how to share it independent of me. So it's like, hey, here's a tool for finance use, upload Excel, et cetera. I just don't know about the modern stack.
Now, I think that is – so I think there are a couple of solutions for that issue. Posit has a hosted environment where you can actually – I don't know if it's for free. There was a free portion there in the past, and I don't know what the URL is. Was it shinyapps.io, I think, or something like this? Shinyapps.io, and now there's Connect Cloud as well.
Yeah, so I think I would go hosted, Sol, first, and I would set up my own account there, and you can actually then from your ID, you can use RStudio. I don't know if Positron does it now. I haven't tested Positron on that. But in RStudio, when you publish, you can connect to any target, and the hosted service that Posit offers is such a target. That's, to me, the easiest way, and you don't need to deal with infrastructure, which is really what your whole question is all about. It's like, how do I manage infrastructure when I just want to share something? I'm not a DevOps engineer. I'm a data scientist.
So there are two stages to infrastructure. Let's put it this way. If you have time, if you have a DevOps engineer around you, well, you can go and look into Docker-based solutions and spin up things on that side. You know, there's definitely a way to do that, and you can host this as well on Cloud Run and Google or wherever else you want to do that.
If you do business data, you probably don't have authentication issues. You have authentication issues until you might connect something with Firebase and other servers on Google's side, which I developed for authentication people. But it takes so much time out of you, quite frankly, that you want to go and rather make the case and say, hey, team, that's what I did in my organization. Let's get Connect. And we have Connect running on GCP, and it does what it needs to do.
First, you go to Posit's public service and create an account, and then you have authentication, all of that stuff taken care of. You don't need to do anything. Infrastructure check. And once you grow and have shown this and people like this, you can run your own Connect environment on GCP, and it's super easy to set up. I mean, honestly, I can do it. So it's not that difficult to get all of this stuff going. So that's typically the way that I would suggest.
If you don't want to go the way to use the Shiny app, then there's an easier way. You can use Quarto, and you can use interactive Quarto documents as well, which is really a very cool way to showcase something. So that runs a new machine. You don't need to host anything. And then we can also go to Shiny Live and WebR in a second. But Quarto is a really good markdown approach of creating a document, having some basic interactivity in there. And I think that's probably the easiest way now that I think. So I would do Quarto first. If you're on to developing an app with more interactivity, more app-based than a markdown, then you go to Posit and put it on their servers. And if you're really serious about that, that has business cases funded, then go and get Connect.
Use cases with Connect, pins, and Vetiver
So when we started this session here, I was quoting the number one issue on the data side, remove the three buckets, data, putting in production, having data consistently good, know where it is, use case number one, putting stuff into production, use case number two. And then we talk about the teams on the other side. Let's remove the teams, go to use case one, which is where's my data? Typically use case for us is using pins, pins with Connect. I think there's no better solution.
There are probably more other solutions out there, but pins is great. You can have actually a repository that understands where your data sets are. It's versioned. It can be centralized on Connect as well. You can actually put in local machines as well. You only connect to pins, but that's a typical use case where how we use Connect is when we share data. You have controls over who can see it, who cannot see it. And then you have versioning that comes with it. To me reduces 30% of the first problem we talked about. It's just worth it out of the box.
Second use case is now we talk a little bit about something that once you did your data, you run your tidy models or whatever else, or you run your scikit-learn stuff and you have everything going and you might have a shiny app or streaming app or you name it, whatever you want to run there. It's really sharing and it's the production case. Production cases have two sides. We're going to be very clear about not everything you do is an MLOps case and you have to build an enterprise application.
There's a super great framework that's called Vetiver. That is from Julia and the R version, at least I think is Julia wrote this and the Posit team. And that is another use case where you can store Vetiver models on Connect. So that's the use case we use. So we understand what happens to the models there. So I use everything, the whole bandwidth from the very beginning, to the very end and then share things.
Measuring channel performance
So that's a really tricky question because usually it's not like the success of one channel, but the success of your campaign in marketing. So if you have a campaign that uses multiple channels, right? To me, classically what we have done as marketers, we look at what's our impression share, what's our clicks and click-through rates. And if you have some kind of measurement on conversion, you look at that. So that's the classic thing that we do, right? So we look at each channel separately. Sometimes we don't understand how they connect together. That's what we're working on here at ConversionLogix. How do they support each other?
So now how do you do that? That's the key question. You can use platforms that aggregate that for you, like TapClicks and others. They go out and reach out to your AdWords account, to your Meta account. I mean, you name it, and they put data into their system. So there are a ton of those services out there. Me personally, I would like to have everything in BigQuery. That's where I put it.
So now with GA4, Google Analytics 4, you get for free BigQuery import function. So when you go into your GA4 account, you can actually set up that up to 1 million events a month, get copied into a BigQuery account. So what is BigQuery first? BigQuery is a data warehouse in Google, so it's like a nested data structure, not just like flat, like classically what we know from CSV files, but it's nested. So it's very efficient for storing, uses SQL to query.
You don't need to worry about this because we have Deployer, which can do all of that cool stuff, so we don't need to learn SQL. So the thing that we do here is literally get the data in there and then basically pull the data out and analyze. So now the question, how do I look at this? I compare, first of all. I compare all the stats across all of the channels. Once I connect Deployer to BigQuery and pull this all in and do the typical modeling that I do, it's like compare things based on impressions, clicks, click-throughs, and conversions. That's number one.
Model monitoring with Vetiver
So Vetiver to me is really a, you know, once you build your model and you put it out, you can host it, and you have the versioning, and you get basically taken care of with Connect with Vetiver data being there and being versioned. To me, the most important part about Vetiver is, like, the known data that went into my model and how does my model react to unseen data. I mean, that's literally what you want to compare, and that's the general use cases for Vetiver. That's a classical, you can say, it's an MLOps use case.
What you want to see is, like, if your model produces an inference and produces predictions with data that it hasn't seen, is it behaving correctly, or are there aspects of drift, data drift, model drift, and so on, which basically means are there things happening to data that your model was not trained for, right? And therefore, what your model does is not what you want it to do. And so you need to know when that happens.
And therefore, we aggregate data that comes out of the prediction, out of the inference, and look at metrics that basically says, well, this has a different distribution now, and so on. Things are changing. Once we detect that, most likely something has changed. And for you as a data scientist, at that point, you got to go back and say, well, let's take new data and retrain the model.
One example, I can't even remember pre-COVID time, but now I do. Pre-COVID was we collected data, and then we got into COVID in 2020, and all of a sudden the environment changed. And I can tell you, because I was doing data science machine learning at the time, most of our models failed at that point. They just didn't work anymore, because behavior completely changed. In marketing, it was really traumatic, traumatic. I mean, everything was different, right?
Here's where Vetiver comes into play. When you have your models hosted and they detect that something is, or Vetiver detects, the model doesn't detect this, the systems that ML Ops, those ML Ops systems detect that things are changing, that's for you the first hint to go back and really go to your modeling part and take the new data and compare and start understanding what is different and why things are different.
I know we're getting close to the top of the hour, but that is my number one use case for anything ML Ops, that you want to have an accurate model. And there's a legal side to that, too. You want to make sure your model is also compliant. If you're in a regulated market, like real estate marketing has Fair Housing Act, there are other markets with regulations. You got to make sure that your model is compliant, and those are some of the ways that you can make sure there's model compliance.
Career advice
Don't really think if you don't know anything that you cannot get there. You can get there. Absolutely. Believe in yourself. Yeah, we all have questions. I mean, it's really about how you deal with who can help you. Find a mentor. That's really the one thing. People can actually take you by the hand where you are at the moment and then pay back to the community. So that is literally my number one is, like, don't doubt yourself. If you don't know, ask somebody. Get some help and pay back.
You know, give interns, if you have a chance, give interns a chance to work in your organization. Take a data intern, a data scientist intern, somebody, give them a chance to work on real problems. That's how they learn. That's how we actually continue to be this beautiful community that we are.
Find a mentor. That's really the one thing. People can actually take you by the hand where you are at the moment and then pay back to the community.
Next week Marco Gorelli is going to be joining us as the featured leader, and Marco is a core dev of pandas and polars. So it might be another fun one to share with your team. But thank you all so much. Have a great rest of the day.
