Data Science Hangout | Nate Kratzer, Brown-Forman | Focusing Tools on Adoption, BI Tools & Shiny
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Welcome, everybody, to the Data Science Hangout. I know that most of you have all been here before, so welcome back. But for anyone who's just joining, this is an open space for current and aspiring data science leaders, everybody across the data science community, to connect and chat about some of the more human-centric questions around data science leadership.
And so we really want to create this space where everybody can participate and we can hear from everyone. So there's a few different ways that you can jump in and ask questions. You could just unmute yourself and ask questions live. You could put questions in the chat. And we also have a Slido link, which Tyler will share in a moment here, where you could ask anonymous questions too.
Also, if you put a question in the Zoom chat and you want me to read it out loud instead of calling on you, you could also put a little star at the end of it, and I'll know to just read that out loud. But just wanted to make a quick note that the session will be recorded and shared up to YouTube as well for anybody who missed it. But with that, I'm so excited to be joined by my co-host for today, Nate Kratzer, who is data science manager at Brown Forman. And Nate, I'd love to just have you introduce yourself and maybe share a little bit about the work that you do on your team today.
Great. Thanks, Rachel. And thanks to everyone for being here. Looking forward to getting to talk to all of you for this lunch hour, or at least lunch hour where I'm located. Some of you may be joining from other time zones. So I'm Nate. I'm a data science manager at Brown Forman.
Brown Forman is the company that owns Jack Daniels, Woodford Reserve, and several other liquor and spirits brands. So I work with two different teams. One of them is a team that works on actually making the liquor, which lots of fun stuff comes out of that. We get to talk about how to produce the barrels, just all sorts of actual engineering processes. That team largely uses Python, Tableau, and SQL to get their job done, though they are also heavy users of the RStudio projects within the Python framework. I also work with a team that does pricing, and I came out of that team doing a lot of pricing work, estimating price elasticities. That team largely builds stuff in R and also uses SQL and Tableau as well, and is mainly concerned with what happens when we change the prices on the shelves, and how does that affect sales and profit and everything.
Growing data science at Brown-Forman
Awesome. Thanks, Nate. So while we're waiting for some questions to come in from the audience, I'd love to just ask you maybe what's something that you're most excited about lately with regards to data science? I'm excited about the way data science is slowly growing into our organization, I think. When I started at Brown Forman, which was like four years ago, we had three data scientists, and it was sort of just this side project that the organization was vaguely considering. Someone said, you should put some money into data science, and there were some analysts already there who were interested, and so they formed this very small data science team. I think maybe the second person hired from the outside, that wasn't just like a convert from within. And now it's grown, and we've also grown in a way of having data scientists embedded in other parts of the organization. And so seeing that there's an actual effect from all of this work, and that what started as sort of a, well, let's just try it. This is a thing we sort of are hearing about. There's some hype too. Okay, yeah, there are actual decisions being made off of a bunch of the products that the data science team has built.
That's awesome. What are some of the, I'm just curious, what are some of the decisions that have been made or big impacts to the business from data science? Yeah, so they vary from things that are more visible to you all, like what sizes should we sell on shelves? You know, like we've looked at things like how much difference does it make, pricing, how often we should promote, but there's also stuff we do on the back end a lot, like we've looked at.
One of my favorite things to do is to get to actually run experiments in production. And our part of this is normally people asking like, okay, if we want to experiment with a different shape of bourbon barrel that will hold a bit more, how many do we need to test to make sure that it doesn't have a negative yield impact? And that we're still getting the same amount, and then what impact might this have on color? How can we track all of those things on the back end?
Cool. So did you end up changing the barrel size or shape? You know, that came to mind because I was recently asked about how many we need. I don't have the results of the experiment yet. I just figured out literally how many of the larger barrels we need to make in order to test it.
Data sources and ingestion
That's cool. Seth, I see you just put a question into the chat if you want to ask live. Yeah, sure. As a recurring question, I'm very curious about, obviously we can't do the data science or analysis without the data. So I'm just curious where you guys generally get your data from and what processes you guys have for kind of ingesting it, prepping it so that you can kind of do the magic.
Yeah. So there's a lot of different sources that we could talk through. So the pricing work that I've spent the most time with is honestly bought from Nielsen within the United States. Liquor has a three-tiered system. So we sell to a distributor who sells to a retailer who sells to a customer. So like I don't have any direct customer data, right? Like I have no idea who's buying what. So we get some aggregated data from Nielsen, which has bought it from those endpoint retailers in terms of how much they sold and at what price. And they get a lot of it from retailer scans and then sell that back to us.
We have a data engineering team. So the process there looks roughly along the lines of Nielsen sends a data extract in some form of text file. The engineering team ingest it. They do a few joins for us ahead of time, expose it to us on a SQL server. And then we pull it down.
We've got an R package that works on this. So we've got a script that essentially calls the functions we've written through pulling it from SQL, cleaning the data, transforming it, running the model. And then post-model, we also prep the data a bit before for things we want to do with the model of coefficients. So that we're not just saying, oh, your price LSD is this, but we're also breaking down how it matters. We also get a lot of data internally, more on the production side. And that's also transformed rapidly. 10 years ago, our source of data would have been a clipboard that someone was writing on by hand. Actually, even five years ago, that would have been the case some places.
We've had issues that we've tried to deal with. We want to know how big each barrel is. So we had a machine that rotates the barrel and in theory measures it with lasers, but occasionally this gets miscalibrated because I don't know, someone hits it with a forklift or just machines go out of calibration over time. And then we have to figure out like, what is the offset for this? What should it be? So there's that end is both. I mean, depending on your perspective, I think it's really fun in that we get to work with everything. Like what are the sensors in the warehouse? Where are they located? Temperature, humidity, like all sorts of fun, direct data collection, but also it means there's a lot of data engineering for that team.
Getting buy-in from management
Thinking about even five years ago, people using clipboards and tracking the data that way. I see Airfax had a question in the chat and said, curious to know, was there any particular project that helped to get buy-in from top management to expand to the data science route?
Yeah. So I think the first one was actually the pricing. And the reason for this is the two types of projects that I think have gotten the most buy-in have so far been ones that directly replaced other things. And therefore it's easier to see what they're doing. And so they'd either replace an Excel process. It used to be the case that we might have an analyst spend a week figuring out a price elasticity for one model or for one market. And now once we have the process in place, we're using the same model in about 20 different countries and each country has multiple markets within it, et cetera. And so just the massive speed up of doing the same work with more accuracy, but also with more speed. And then when we've been able to replace consultants with data that it used to be the case, we would pay an outside analysis firm to do the analysis instead of internally setting that up.
I think as I was talking through that, there's one third case that I forgot to mention that's possibly most relevant here. Our production team, when they were first introduced to Shiny apps, because they'd been trying to manage everything and sort of slowly going digital through going largely from patent paper to like Google Sheets occasionally. But it was very easy for one of our first data scientists in production to just spin up a Shiny app that would show them the things they needed because it could pull in data from multiple sources. It could do all the formatting within the app. It could give them a single place to go. So there was a pretty early transition, I think, within production to Shiny just proved out its value almost immediately in terms of, especially for the R&D team, the things they could look at. And I think one of the very first Shiny apps we deployed was just something where they wanted to run PCA analysis on some chemical results. And we're just like, OK, here's the thing where you can upload your CSV and then all of it will be run and you'll get all the charts you want right away. Right. And that just instantly saves hours of people's time. So that got a lot of buy in as well.
And I think one of the very first Shiny apps we deployed was just something where they wanted to run PCA analysis on some chemical results. And we're just like, OK, here's the thing where you can upload your CSV and then all of it will be run and you'll get all the charts you want right away. Right. And that just instantly saves hours of people's time.
Calculating ROI of data science
That's great. Somebody had just asked me yesterday how to calculate ROI on switching over to data science tools, and they were just asking me if I had examples from different customers. But I feel like I never hear things in terms of dollar amounts of this is the exact ROI and that's so hard to calculate. And I was just curious if you had done that at all. Do you think of what the consultant's hourly cost was or what it costs in Excel?
Yeah. So most formally, we have a project that tracks us through a Shiny app, of course, in production, but it tracks specifically process improvement. So that's that's one that that's easier to demonstrate, like if we run an experiment and it changes the process or it allows us to increase the yield coming out of our mills. I should say, I said Brown Forman is a liquor company. We also own the Cooperage that makes the barrels and we also buy the wood directly. So like that whole barrel making processes is within our scope. And when you're starting to look at those also warehouse conditions. So climate change has reduced bourbon yields, essentially like hotter conditions tend to mean we use less bourbon because more of it evaporates while it's sitting in the barrels. You know, so we've looked at things like how do we get back to that normal state? So when you can put it in yield terms, we don't have an overall ROI. We have, you know, like occasionally we get estimates of like what our work would have cost if it went through a consultant, but we've never actually like tried to add all of that up.
Tech stack and computing platforms
Yeah, I don't often see it, but they're they're asking me for it. So I was like looking through like meetups and other recordings trying to see if people had mentioned that. I see Steve had asked on Slido, out of interest, what are the main computing platforms that you use?
Yeah, so you're just looking for like where we actually do the math. We have RStudio Workbench on a fairly giant server, which is where we do our development. We are transitioning to also using a connect server for some of the stuff that's in production because it will allow us to automate and also it will allow us to separate production from dev. But maybe I should back up a second and just explain our general tool set.
So we have RStudio Workbench where all of us can can log in. You know, it's set up on a Linux server with like a terabyte of RAM. We have nothing in the cloud. We do everything in house, so we need a fairly large server to accommodate all data scientists and any of the work we could perform. And generally that process will be we have a SQL database, a lot of stuff in Cloudera, although production has some of their own SQL servers that I think are Microsoft SQL servers still. And we'll wind up, you know, connecting through ODBC, pulling in our data, running the calculations in either R or Python, and then we will push them back to a database. And then from that, so we'll have our own section of the database that we run as an advanced analytics team. And then we will build the front end app in either Shiny or Tableau off of that transformed data.
I see Arifath, you just asked another question that yeah, you mentioned I should read live. So if possible, could you please shed some more light on the kind of post model work that you perform? Sure. So, you know, for our price elasticities, like we build a model of the entire market. So it's not just price, it's also distribution, how many stores is it in, how is the category performs in like whiskey doing, how is the overall spirits market doing, etc. And, you know, one of the things we do is we go back through the past year and we say, okay, if we compare the last 13 weeks of sales to the same 13 week period a year ago, and we apply our price elasticity to the actual pricing difference that we observed historically, what's the impact in actual case volume. So we're, we're presenting our users with something where they can say, oh, if I'm looking at like, this most recent week, you know, rolled up to account for some noise. But if I'm looking at what's happening now, not just what was the price elasticity, but how did that actually impact me based on what was happening in prices at the market.
And we've also, you know, rolled out something where, you know, using Shiny users can upload like, this is my pricing plan for next year, what would your model predict is going to happen. Right. So the idea is that we're really as much as possible, giving them that part of that is because users would try to do these with the, do these things with the price elasticity coefficients on their own. And not only does that take a lot more time if a bunch of different people are, you know, repeatedly doing the same math and different accessories, but it also was pretty air prone. And people did not always understand the model. There are good reasons to, you know, use the entire model as a whole and not just look at like, what would happen if I changed the price, but actually enter a plan for, you know, price, competitors, price, whiskey segments, etc.
Scaling the team and hub-and-spoke model
Thank you. I was thinking about what you said, how you were one of the first data scientists of the team on boarded from external or externally, but now you have about 14 data scientists. So I'm wondering what that growth looked like and how you scale data science out across the team.
Yeah. So I mentioned initially, the team came in. So I think data science and R also came to Brown Forman from a financial analyst and someone in the production team at about the same time. And they were completely separate working independently. And then, you know, we got a few more, a bit more interest in that initial pricing, pricing work within finance. I was involved in the R user group in Louisville, and I met, met someone who was doing that work, who had started in finance and internally was sort of the catalyst there. So wound up on that team, which slowly expanded. Some of the expansion was combining with other teams that were already interested. We had a team that was doing a bit more visualization work and ultimately combined with them to get like this core group of six. And then a big part of the expansion was getting spoke rules. So we run on a model where there's, there's a hub of core data scientists and they are, they are just data scientists. They report into our advanced analytics team, but we also have folks who are say part of our production team, part of our U.S. commercial analytics team, part of our Australia pricing team, you know, wherever there are managers who, who want to have folks doing data science, but also directly reporting to their team. So in those cases, what they get is they get to set the agenda really in the projects that are worked on. But what they're provided with is we make sure that they have access to, you know, one, the tech stack, but also to all of the data scientists in the hub, so that they're not just this isolated person working alone, recreating the wheel every time.
And that, that is a pretty big source of growth because it is taking existing spots and making them the third source I'd say is we've been fairly successful with year-long internship programs. And we have three or four folks who have joined that way where we've created new positions after they've done a year-long internship.
Marketing data science to colleagues
So I was just wondering how much of your role is you going outside of your team saying this is where you can improve and trying to get them and persuade them to make that improvement. And that's how much of that is them going, we've got a problem here. Nate and his team have solved a problem for someone else. They can probably help solve our problem and then go to you.
Yeah. So initially when we first like formed an advanced analytics data science team, there was a bit of saying, Hey, this is stuff we can do. Please, please come to us with problems like that. And that was our director at the time, at the time I was a senior data scientist, not really as involved with like soliciting work from the rest of the organization. And now we're at a point where we have enough work without soliciting it from the rest of the organization, where there are definitely times where I think it would be nice if the rest of the organization would do things in more of the data science way, of course, but like we have enough people who want our help that I'm not going to like push data science on people who don't want our help yet. So we've been pretty much able to just work with the problems that people have.
We've also, so I guess the one exception to that perhaps is we've been working on adoption. So, you know, like within pricing, there's, there's a lot of support from corporate and for some of the high level folks, but then you get into the question of, well, what if we're reaching out to salespeople on the ground and to some of the smaller countries, smaller markets doing some trainings there? And I mean, that's largely like, again, our, our partners who actually do pricing, reaching out to those other folks, but we have been trying to focus a lot of our tools on adoption. So that comes both in the occasional training meeting, but also documentation, usability of tools. Most of our revisions are focused not now on adding new features, but on making existing features clearer, making it easier for end users to work with.
Shiny vs. Tableau: when to use which
Nate, I see there's two questions that are kind of similar around Shiny and Tableau. So one said, you've mentioned Shiny and Tableau for your team. What goes into deciding when to use which? And then I see that Ian also added, does it have to do with end user or functionality, for example?
Yeah. So it has to do with, with a lot of push things. So, I mean, one of the first criteria, of course, is if it's not something you can do in Tableau, then we're going to use Shiny for it. And then there's no decision. Another one that, that sort of shortcuts it is we think about who not necessarily at the end user is, but the end maintainer. And so the reason the end user matters less to us is we have a reasonably consistent sort of design across Shiny and Tableau. Like folks generally know, we're going to have some filters on the left. We're going to have a nice header telling you what it does. It's going to incorporate a little bit of our branding, and then we're going to have some graphs in the main panel that react to the filters on the side. So we've never really had trouble with users being confused anymore by Shiny or Tableau.
And then we have a corporate center that links dashboards. And of course, you can also just send people links directly. So it hasn't really been a problem for end users. Where it does matter sometimes we as a team are responsible for initially building dashboards that we want other folks in the business to maintain. And if we want someone else to maintain it, then we're a lot more likely to use Tableau. Of our team of 14 data scientists, we probably have five that are pretty good at building Shiny apps and can build and maintain Shiny apps. But that's it for the entire organization. Whereas we probably have about 100 folks who can at least build basic Tableau apps. Now, although some are much more advanced and can build much more complex things, but just not the user of the dashboard, but the user in terms of the people maintaining dashboards, there's a much bigger base for Tableau. And so that can come into consideration.
The functionality has probably been the biggest thing. So using Shiny, you know, you get version control, you get the ability to be a lot more flexible with your data and your setup. You get the fact that you can really go end to end within a Shiny app. So in many cases, it's a lot faster. I can often, if it's, you know, like if it's a small enough data transformation, you know, I can just work within R the whole time, quickly build a Shiny app, look at my results right away. I don't have to wait for this step of even like, oh, I'm going to output this even to a CSV or to a database and then run Tableau on top of it.
So there's, I suppose it's a bit less of a consideration, but still, I think a relevant consideration of like, what does the person building the dashboard want to use? And what are they fastest at? You know, Tableau can make some data transformations much more difficult than if you're working in R or Python, like it is set up as a visualization tool. And I think Tableau is in fact an excellent visualization tool. The problems I observe in organizations are when they try to push tools too far outside of what they're designed for.
The problems I observe in organizations are when they try to push tools too far outside of what they're designed for.
And so like our prominent, you know, our most prominent Shiny apps where we just couldn't do it in Tableau is our forecasting work, where we want users to be able to interact with the forecast and like set some of the predictions. And so that's where Tableau comes in. And then we're going to run the forecast in the background. And then while some of our pricing stuff is, is in Tableau, you know, some of it where like, if we want users to be able to upload any sort of data and then interact with models and so forth, that's all stuff that's going to wind up being in Shiny. We just have a lot of additional flexibility that way.
Would you be able to, just for my understanding, to be able to give an example of what's something that would need to be a Shiny app because of the model or what's like an exact example versus something that would be in Tableau? The two biggest ones I've seen are needing users to upload data. You know, like we tried something with some financial gaming in Tableau where people had to like enter by hand into input boxes, just like everything they wanted. And it just didn't get used because it's a pain to enter data that way. You know, like, whereas we've been able to give them in a Shiny app, like here, you can download last year's data as a default, change the things you want to change, upload it and see what our model says. The other one is, is connecting directly to outside APIs. Tableau probably has some of this, but our institutional Tableau, we can't add Tableau extensions. So we really are just dealing with like base Tableau. So like we wanted to pull Google Trends into a dashboard, and this is easy enough in R. There's a package for it. We pull it in, we make some nice graphs, and that's something we just couldn't do in Tableau.
Thank you. Yeah, I think it's helpful to just understand, like, because both are great tools, but understanding when you would use one versus when you use the other and being able to communicate that out to the team too. Yeah, and I mean, the more calculations in general you try to put into Tableau, the harder it gets to use and maintain the workbook. You know, like when people have to use, especially other people's Tableau workbooks, going back through it and like unlayering and figuring out the dependencies and so forth of, you know, which calculation relies on which other three calculations, and what is the actual math here becomes fairly tricky. Like it's not the best tool for all of your data manipulations, even though it can do some of them.
Data science in marketing and causal inference
Hi, Nate. I find this talk super interesting. I actually used to work for Irish distillers. So I was working on all the Irish whiskeys back in Ireland in the marketing area. And I kind of noticed that they were really trying to make a big effort to bring in AI and ML, particularly around our packaging. And I was just wondering, does either your teams do anything in the marketing and advertising space in relation to data science and analytics?
Yeah, so hopefully I won't get myself in too much trouble with my answer here. So I have talked to marketing and analytics. And the truth is usually when they come to me with a question of like, can you tell us what was the effect of this marketing campaign? My answer is no, I can't. Because every time I've tried, I don't just say no up front. I'm like, OK, what data do we have? Let's take a look, et cetera. But my answer does wind up being no, there's not enough. The short-term effects are usually too small to capture. The long-term effects, well, there's just too much going on with the long-term.
But I also think it's interesting that you brought up where they want to go with AI and machine learning, because that's not at all really where I'm pushing them. I'm pushing them to run experiments. I think they should do A-B testing. I do not think adding AI or ML is going to fundamentally solve the problems we have of not having good data, having too many things going on at once. Yes, you're going to get some sort of pattern, but fundamentally, I think in order to actually evaluate marketing, you need to collect better data, and you need to actually be intentional about causal inference. And the easiest way to do this is going to be some sort of A-B testing.
I'll actually say more broadly, while AI and machine learning gets a lot of hype, I know the common joke that 80% of the time, the business just needs a SQL query with a group by, but the other 20% of the time, they need something more advanced. I think what they need is causal inference. Most of the time, the business is not, and this does vary, but at least in my line of work, most of the time, the business, for me, wants to know, did my campaign cause something to happen? Did the price change cause something to happen? There are businesses I know where the prediction is fine. Is this part going to wear out in so many years? There's some stuff, and obviously, if you're dealing with image data or text and so forth, AI and machine learning is a huge revelation there. But for a lot of your day-to-day business questions, you need data to be better organized, you need research design, and you need some understanding of causal inference.
But for a lot of your day-to-day business questions, you need data to be better organized, you need research design, and you need some understanding of causal inference.
That's a great question, Laura. I had a follow-up on that, too. If that is what you need, but you don't have that today, how do you communicate that to the team or teach them maybe what data they should have? Yeah. We've been working on that, actually, with marketing and analytics, and a lot of this is internal education. We'll see how well it works. We're moving a little at a time. Our basic pitch has been to, first of all, just explain what A-B testing is for folks who aren't familiar. I think it tends to be a good entryway to understanding research design because as research designs go, it's on the simpler end. We're going to have two groups. We're going to compare them. You're going to need us to help you set up the groups. That I also will say you need to stress because there have been occasions in the past where folks did think they were doing some sort of test, and what they did was they said, we're going to run a huge media campaign in the markets that are performing best right now and not in the other ones. They were doing this for strategic reasons, and it makes sense perhaps as a strategy, but it also completely invalidates any sort of test if you're shifting your funding based on which markets are performing well and not as a way to test the actual effect of expenditures.
Then we've also been trying to get them to just start small and on things that are less threatening. Any sort of test that could result in funding being cut, there's always going to be pushback on that, right? Right now, they're getting the funding. They're doing the marketing. Things are going well. Why do a test when that can pretty much only result in bad news for them? We're trying to introduce this as a way to just at first test two different creatives against each other. You're still going to run something, but why don't we run the more effective of the two creative materials? The hope is that that's going to get some research design involved, but this is stuff we're working on now, so you'll have to ask me in a year or two if this actually worked at all or if it just ran into a dead end.
Building solutions that scale across problems
Thanks, Nate. That's really helpful. Frank, I see you just put a question in the chat if you want to ask that one live. I do. Thanks, Rachel. Nate, I'm curious. Is all the work that your teams do one, what I said, one problem to one solution, meaning you get with your stakeholders or your users and say, hey, what's the problem you need to solve? Okay, great. Understand the problem. Let's build a solution, right? Maybe you maintain that. Maybe it's a one-time analysis. Is there anything that you've worked on and your teams have worked on that you've been able to say, oh, there's a problem here, here, and here? They have very similar characteristics. Can we build something that can serve all three of those groups?
So I'm wondering if anything pops to mind. Yeah. So one, certainly we have the same thing with stakeholders where they have one problem in mind that they want us to solve. I do think our price elasticity work does fall into this category. So one of the things is we initially insisted on, well, we're going to need a bunch of data because if we're going to understand the effect of price, we need to simultaneously understand the effect of other things that are going on. So as it turns out in our quest to just model price elasticities, we also have a way to think about competitive sets, like who's actually competing with you on price. We have a way to think about distribution. We have a way to think about, we've also used it for bottle size changes.
And then one other issue that we have that I'm trying to solve with modifications of the model, and this may sound weird, although all of you are in the industry, but like defining what a product is, is a non-trivial task. That's why I said it would sound weird. Like you think, oh, it's 750 milliliters of Jack Daniels. But if we change the design, is it the same? If we change the color just a little, is it the same or different? If we pair it with a shot glass, is it the same or different? Well, clearly it's different. And clearly it also is going to take away sales. So what do we do when it comes to the gift packaging essentially that we model all of the time? Is that a separate problem or can we combine it in with our price elasticity models?
So what we have tried to do, I think, in the spirit of this is we really are trying to get to a place where we have a model of the market and it answers a bunch of questions because it's taking a fairly holistic view of the market. So it can help us answer out-of-stock questions. It can help us answer questions about gift, about promotions, about all of this. And it's also one model. So from like zero to a hundred percent, where do you think you are from where you want to be? I'll go with 60 maybe. We actually have, I mean, most of this stuff works. We still have a bit of an issue there, but we are able to use it to look at counterfactual situations, look at actual situations. We have gotten fairly good adoption of let's look at the market this way. Right on. Cool. Thank you.
Team structure and specialization
Thanks, Nate. Arafat asked a question earlier that was how do you do different work? So for example, data process, collection, model building, API development in your team. Do you have dedicated people for certain tasks or is it more like everyone is doing some of everything?
Yeah. So this is a great question because I know it's a very ongoing debate in like how to organize data science teams and if we should have specialists. And the answer within our team is some of both. We do want people to be able to go fairly end-to-end on data science products and also to get a chance to do different things. Like people stuck only developing front-end dashboards are people who are going to get tired of developing front-end dashboards eventually almost no matter how much they like it. So we do want people to get to rotate. Probably our biggest restriction is like we've got three people on our team who can actually push stuff to the database. So like in terms of administration, we do have some stuff there. We do have some folks who are a bit more on the data engineering side. And while we encourage people to develop skills across all things, certainly like not everyone is equally proficient across R, Python, Tableau, and SQL. And so what you have will depend on what people's skills are and what they're trying to develop.
But even across the tech, I mean it looks like you're also talking about you know process and collection, model building, experimentation, maintenance, etc. Yet for the most part, we tend to divide that by business area rather than by tech within. So like someone in production will actually be out at the production site thinking about the sensors and how they get that into the database. And then they're also going to build the model based on the temperature data collected from that sensors. And that doesn't mean like you know within the hub we have some people who specialize more in modeling and they're going to talk to those people about like what's an appropriate statistical model here. But yeah they're going to code it. They're going to own it. They're going to be responsible for usually maintaining what they build.
Forecasting and time series
Awesome, thank you. I love thinking about how like your data science team is actually like right there where they're making the whiskey too. So there's a few other anonymous questions too that came in earlier. And one was when you were talking a bit about marketing experiments, someone said could you please share any actual examples of the experiments run? I would love to because I would love to get marketing to run an actual experiment.
So no this is still something we're trying to get them. So I think one of the issues here is you know I've said hey I've worked through this data. I've built out the model and you know when I do all reasonable things that you can do to the data I just don't see this effect. I can't detect it. Doesn't mean it's not there per se but so while I can say that internally a consultant does not have to say that. They will tell you they can in fact see the effect of your marketing campaign. I had a lengthy back and forth with with the consultant asking in their about their methodology and you know essentially they're overfitting the data. You know like they're getting there by well yeah by pretty blatantly overfitting the data. But then you know we're in a situation where the consultant is giving a report that says hey what you're doing works and I'm giving a report that says hey I don't know you need better research design. And in that situation you know you can see where the the consultant's message is is a bit more appealing. I think there's a chance we are eventually going to get to to some internal research design but we are just not there yet.
Thank you for the openness there too. Yeah Zach I see you're talking asking a bit about R and Python together. Would you mind asking that question live? Yep sure. So a bit of background on me. I've only started coding in the last well just over a year now so I focus a lot on R. I can build quite efficient Shiny apps and do quite a lot in R but I've not yet expanded to Python. I was just wondering as you were just talking about like Python and R together like in the same team would it benefit me as someone that's looking to go into this industry to learn Python as well as R or do I just specialize in R as much as possible and get the skills required there?
Yeah so very directly I will say specialize in one or the other and my my personal bias would be R unless you are very much into machine learning and AI but if you want to do data work do statistics work wrangle data and be able to present it to stakeholders just Shiny markdown dplyr are things that I don't think are quite matched in the Python ecosystem yet. And while the Python ecosystem has all sorts of great things for you know web scraping machine learning etc mostly you can do both in either language but I do think there are still enough differences where you know I'll call those out as perhaps a reason to choose one over the other.
So the main reason we support both R and Python is for separate data scientists you know on the market being able to have data scientists and recruit data scientists if what they know is Python or if what they know is R. We do occasionally use both so one example is we wanted to work on automating the scheduling of the tanks that hold the liquid before they're dumped into bottles and it turns out that Python had a pretty good package for doing some of that sort of automated scheduling that R did not have but we wanted to deliver this as a Shiny app and so we used the reticulate package you know so we get the data in from from R through the Shiny app and then we have a Python function that does the actual scheduling on the back end returns the results through R. A lot of our Python folks have wound up learning a bit of R and that's largely to present their final results because again like we use R Markdown for reports pretty heavily we also use Shiny apps and there is a fluidity to that you know even for the Python folks like sure they could maybe do everything in Python and then present purely in Tableau but it doesn't mix as well for a formal report if you also want like the sort of writing that you have in Markdown so so they'll dump Python chunks in but here again this is mostly because that's what they're familiar with so R is really forced by the communication piece and Python could be forced if there's a package you really want to use but also just if that's what you're already familiar with.
Sysadmin relationships and infrastructure
So how do you cope as a data user also there's people presumably looking after your big server which is who are doing our updates and major upgrades to RStudio how do you cope with that or do you do you let them change the system or do you have some way of parallel running what's your what's your relationship with your sysadmins I'm speaking as a sysadmin. So we've got a guy who is you know supposed to be running SAP but did a lot of work with Linux servers in college and was willing to to help us so we have a part-time sysadmin who runs all of our stuff you know like we we work together pretty closely he's always very helpful we probably do not have everything locked down as well as we potentially should.
You know we a couple weeks ago ran into a an issue where with our versions between like 3.6 and 4.0 where there was a miscommunication about like should we be running it on which one. I briefly mentioned like one reason we want to switch our production work to RStudio Connect instead of RStudio Workbench is because Connect is going to by default produce an isolated environment and it's going to give the user a lot more control over what's in that environment. So you know within Workbench like yes we can use tools like RM as a user to like have our own libraries have our own systems within there but there is also a base library and right now like that base library is what's used by the automated jobs and so we we need all of our automated jobs to be able to run on the same set of packages right now so there's a reason we're going to change away from that as the number of automated jobs increases but generally yeah we just have a really helpful sysadmin who's doing this as part of his job and we could not do it without him.
So yes thank you on behalf of whoever your data scientists are we appreciate sysadmins.
Love that so taking that a bit further then if somebody is within their team trying to find who should be managing those that toolset or like communicating with sysadmins steve or nate do you have any any advice for for those listening in. So i'll i'll do what we're trying to uh trying to do we try we have a teams uh microsoft teams channels where we're actually managing that so we'll have a whole group where we'll do announcer messages on about major services and things but then we'll have a support group where people will ask questions and we'll provide have that as a rather than a ticketing system which is sort of no private between the person raising the tickets and the issue what we try and do is try and get people to publicly ask their questions usually in the hope that somebody else will answer the question for us and thus offload the work off us but also probably have uh you know within within teams again we'll have a like a little management channel where we talk discuss you know do we really want to install this new version of r or what do we do about uh versions libraries which are incompatible with what.
And also forever what we're trying to do is look forever at our infrastructure and say have we got the right sort of mix of machines and we're currently looking at migrating over to the cloud and uh more automation for building RStudio so using terraform and ansible and trying to work out how to do all that so it's a continuous journey. Yeah that answer makes me happy because it's the same um google instead of microsoft but yeah we have like a server use channel and that is how we coordinate with our our team and our our sysadmin.
Evaluating forecasts and recommended resources
That's great thank you i know we're just at the top of the hour here do you have time for one more question nate yeah okay awesome shannon i see you had asked a question a little bit earlier can I turn that over to you. I could also or shannon maybe had to drop but um the question was do you ever go back to evaluate your predictions or forecasts to actuals. Yes we do um we do that pretty consistently our our most formal process for that launched in like may so that that doesn't have a ton but when we build out our forecasting in the first place we go back and um you know look what would this method have forecasted if we had done it you know 10 years ago roll forward a window a year or so. Um and if you look at like the forecasting principles and practice i see someone has has dropped a link there also there's a third edition you just need to change the two at the end to a three um and you know it talks about how you can essentially cross-validate your your forecast by rolling through what it would have been and we've also compared it to like the forecast that uh internal teams made um we were able to get a hold of some of those and compare like the model forecast to the human forecast as well.
Great thank you so aside from that forecasting principles book are there other resources you'd recommend we take a look at. I like you know reading our weekly and data elixir as as newsletters um of course the RStudio blog to keep up with some of the the new open source stuff that's coming out there um i think so many of my my actual recommendations though would be you know scattered of like what would i recommend to a new user what would i recommend for forecasting what would i recommend for causal inference or price elasticities etc.
So yeah that makes sense well thank you so much nate i think we've answered all the questions from everyone unless there's anything i missed please stop me now. Thank you so much for for sharing your insights with us and answering everybody's questions um the last thing i just like to ask if anybody had follow-up questions for you what's the best way to get in touch is it linkedin or twitter. Uh probably twitter i guess um i can just put my my handle in the chat if that works perfect. Well thank you so much nate really appreciate it and have a great rest of the day everyone.
