Resources

Data Science Hangout | Tanya Cashorali, TCB Analytics | Saving millions with a Shiny app

video
Jun 22, 2022
1:03:46

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi everybody, welcome to the Data Science Hangout. If you're joining for the first time, I'm Rachel, and it's great to meet you. If you've been here before, you've heard this feel a bunch of times before. But if this is your first Hangout, this is an open space for the whole data science community to connect and chat about data science leadership, questions you're facing, and what's really going on in the world of data science.

So the sessions are recorded and shared to the RStudio YouTube, as well as the Data Science Hangout site. So you can always go back and rewatch and find helpful resources too. I'll say it up front as well, because I always forget, we do have a LinkedIn group as well for the Hangout. So if you ever want to connect with people there or continue a certain discussion, you can use that too.

We always want to create spaces where everybody can participate and we can hear from everyone. So there's three ways that you can ask questions today. All this is audience led. You can jump in by raising your hand on Zoom. You can put questions into the Zoom chat. And feel free to just put a little star next to it if you want me to read it out loud. Or I could just call on you to introduce yourself and add some context. And then also we have a Slido link where you can ask questions anonymously too.

But with all that, today I'm so happy to be joined by my co-host. And I'll say, so excited that I even announced on Twitter a day ahead that I thought yesterday was Thursday. But so excited to be joined by my co-host for today, Tanya Casciarelli. Tanya is CEO and founder of TCV Analytics. And Tanya, I'd like to kick these off with just having you introduce yourself and sharing a little bit about the work that you do.

Tanya's background and career path

I got my career started, I dual majored in computer science and biology. So I started in the bioinformatics world. So probably unsurprisingly, that's how I got started using R. It was about 2005 or so, believe it or not. And I was working at a children's hospital informatics program for a bunch of really smart PhDs and had no clue what R is. He told me, go home this summer and read about R. So I actually printed like the entirety of the CRAN documentation and took it home and thought, oh man, this is like really, I thought it was just, I was in over my head, but yeah, I had some really great mentors and learned a lot.

And then I worked in biotech for a few years, a bunch of startups, and then I did some non healthcare stuff, a telecom startup, found myself back in healthcare at Biogen, which is a big, my first big company that I worked at and always knew I wanted to consult and kind of do stuff on the side. So I was doing that and only lasted there nine months, big company, just like wasn't for me, even though I learned a lot and had a great time, met a lot of cool people. And then I started TCB analytics in 2015 or so. And we have clients kind of all industries, pharma, we do, we work with financial trading companies. We worked with the department of defense and we did some pretty cool AI machine learning stuff for the army. So really all over the map and just having fun because we get to kind of pick and choose our projects and I get to work with friends.

Yeah, I, you know, I started, I was thinking about like what tools and stuff are exciting, but I don't even think it's tooling or anything like that. It's it's really that it's becoming more widespread and known and I feel like it's getting evangelized more and the more people I talk to people, a lot of you have probably had this experience before, like what the heck is data science? I have no idea what you're talking about. And usually I use the sports example because everyone kind of knows the movie money ball and fantasy sports.

Like I thought years ago that CEOs at every company should be familiar enough with data and the basics of data analysis, the same way that they were expected to be familiar with email, like 30 years ago, everyone started using email. I think data is getting to that same point.

Consulting challenges and working with clients

It's always people. It's never R, it does everything. I can make R and Shiny do anything, but it comes down to, I think, I could tell when a client kind of knows exactly what they want. And most of our prospects and customers come to us saying, we know we need this. We love Shiny. We want to do things with Shiny and data. But I think the hardest part is getting people to understand why they're doing what they're doing. There's so many times where people just want to put all the data into a data lake or let's join all these tables and I'm like, why?

So I'm always asking, well, what questions are you trying to answer? And more importantly, what action will be taken? And this sounds probably like a trope to everyone on this call, but it's amazing how many people and clients just, they want to check a box or they want to do the fun data thing, but they don't have any valuable business reasons for doing it.

So I'm always asking, well, what questions are you trying to answer? And more importantly, what action will be taken?

And then another thing, it's always scoping. I feel like with data products, it's much different from a web application because there's so many possibilities. As we're building things, we constantly run into situations and scenarios we could not have predicted until we start building it. So that's always tough, but we try to mitigate that by doing either retainer-based contracts or being really conservative with the fixed price and really scoping it out well.

Managing scope creep

So it comes down to, I think if you can itemize everything, you know, whether it's in GitHub issues or just JIRA anywhere, and then really prioritize that list. So I find it's really hard to do in a phase one because you're learning so much. So we try to, we try to estimate in like an exploratory sometimes one month long even to properly scope. And that always, you know, it's never perfect, but I find for the second phase, and we just did this with a project. We know a lot more and they started asking for so many things. We kept a backlog and then we have our consultants go in and estimate.

And we always pretty much, I wouldn't say double, but we add some padding onto that for uncertainties. And then you just throw it back to the client and say, look, you could purchase, you know, block of hours, 50 hours, a hundred hours, but here's the list. And now you put the onus on them to prioritize and say, okay, well, we definitely want this and this. And then even on top of that, I typically recommend another block of miscellaneous hours for like unknowns that come up.

So we try to balance that with every project and I've just learned more and more over the years of doing it. It really just has taken a lot of trial and error and I've messed up for sure and learned lessons from it.

So one thing I like to do is, again, I always say, I try to say like what are people going to pay for? Like what are you building that your clients actually will put money towards? Number two is, again, what will they do with this information? Can they take action from it? An example I can think of is like when you log into E-Trade or Vanguard and you look at your stocks and it says that you've got a, you know, 100% gain on this stock, well, what is that? What do I want to do about that? I want to sell it. So typically there's a sell button like right there. You never have to like look and browse through five more pages to get to the sell button.

So I think that's a couple of questions I try to ask consistently is, you know, what will your clients pay for and what decisions are they trying to make? And then I'll do that same exercise with the people that are really like driving product management or the data analysis. So we'll actually sit with a mock-up, Balsamiq or, you know, name your favorite mock-up tool and we'll talk through it and we'll lay it out. And it's amazing the questions that come up and once people start thinking about it and seeing it and they think, oh yeah, actually we need a filter here because I didn't think that, you know, the date range would matter.

We need a minimum threshold filter because we have so few data points in this state. But those are things that I learned as I build things in Shiny, to be honest. Just building it and showing it.

Handling difficult client scenarios

Yeah, that's like kind of where the art comes in, I think, is and also getting the right people involved on the client side. So if there's a subject matter expert, I always try to work with them. Or if there's someone that I know is a decision maker, because, yeah, it can be tough.

I think it's important not to also rush to building something. There's definitely rapid prototyping, but then there's like, well, let's actually take a minute and think. I try to list out all the most common questions and use cases and then also understand who are the end users. Because if this is a CMO using it versus a data analyst, one's going to be a very exploratory design or dashboard, and another one's going to be explanatory. Very simple, high-level KPIs, maybe some drill downs, right?

That's why I like the Shiny prototyping, actually. Because you can kind of iterate until they say, OK, this is right, versus like a full software build that could take months. And then you're kind of stuck with this big React thing and custom D3, and it's not the right thing.

Yeah, I'd say it really comes down to I keep revisiting the SOW and the scope and saying I literally will map it out. Here's what we said the deliverable was. Here's what we did. Here's a deliverable. Here's what we did. Here are some things we even did that were out of scope. So hopefully, you can get to a point where you can justify the end phase by using that SOW.

Now, the problem is that we've had is it actually wasn't that clear. SOW is very high level and vague. So then you run into a scenario where you're doing a whole bunch of extra work, and you're cutting into margins, and that sucks. So you try to wrap it up as quickly as possible with the minimum viable product or solution, and then go on to the next phase.

I always try to tell people, they're paying for your experience. Just because it takes you a couple of days, it would take their team maybe never, or a few weeks or months. So always value-based price as much as possible and not just your time. Because you've been learning, we've been learning this stuff and researching and doing it for years and years and years, right? That's what they're paying for.

Sharing apps and biggest client wins

I'm a huge fan of ShinyApps.io. So I have probably 10 client apps on ShinyApps. We've been totally good with just the, I think, $100 a month package that offers a certain number of hours in authentication. So I just, it's a matter of literally inviting someone via their work email address to the Shiny app. And we've had as much as, like, probably 30 people on ShinyApps and totally handled everything great for one app.

And then another, some other client apps where there's only a few people, but they're power users and they're constantly using it. So ShinyApps has been, I haven't needed ArcConnect yet, but we do help clients that have their own ArcConnect internally. And then we just work with their internal ArcConnect team.

I think that there's been a couple, one of them, it doesn't seem as, it doesn't sound super cool, but there was a small pharma company that was analyzing their clinical trial data and trying to understand what's called protocol deviations. So when a patient either disqualifies or doesn't get enrolled and something went wrong in the trial, whether it's like wrong inclusion criteria, like they had the wrong cholesterol measurement and they were accepted or something like that, they were doing all this super manually generating reports in Excel and putting together a PowerPoint weekly. And it probably took like 20 hours of multiple people's time.

We made a ShinyApp and turned that into like pretty much no time at all. They sent us the data, it was updated monthly and they were able to take that to, you know, senior management and clinical trial managers and make decisions based off that data super quickly and understand which clinical trial sites were having problems, what were the biggest problems. And that is just, I think, huge versus like hand crafting all these charts over and over and over and wrangling the data. So we've had a lot of wins just from moving someone off Excel into a reproducible like R pipeline and then displaying all the results in Shiny.

Another similar one I wanted to mention is another pharmaceutical company. In drug manufacturing, there are a lot of things that can go wrong. And when there's a contaminant or something in a batch, it can basically, it's millions and millions of dollars of company loss because they shut down manufacturing completely until they identify the problem. And that involves going up and downstream of these different kind of drug products. And, you know, two of these molecules make something and then it goes into something else and something else. So it was taking a team of like five to 10 people, sometimes six months to identify the problem, meaning you're not manufacturing drugs at that time.

We built a ShinyApp that built out a kind of D3 directed graph. So it enabled one person to go and type in the drug compound, see everything up and downstream. And it took one person now maybe a week or several days to identify that problem. So that's, you know, that's an instance where you could say this is literally millions of dollars of savings from a ShinyApp, right? So that to me is cool. And that just shows the power of R and the ability to really streamline manual processes like that.

So that's, you know, that's an instance where you could say this is literally millions of dollars of savings from a ShinyApp, right? So that to me is cool. And that just shows the power of R and the ability to really streamline manual processes like that.

Types of work and the AI vs. machine learning distinction

So like everything, basically I would, I always tell people, I would say we really only do machine learning, statistical modeling, maybe 20% of the time. A lot of work we do is, like you said, data summaries, visualization, automating manual processes and moving them out of Excel, a lot of shiny app, shiny development work.

I could give an example about the Department of Defense. They had budget for a big AI machine learning effort. We did actually end up doing some pretty cool machine learning for them. But like you said, a lot of companies think they need machine learning or AI, and they really don't, like they just need some summary statistics. And we can talk them off the ledge with that by usually just showing them what can be done. And then they see it and they're happy. They realize, okay, maybe that was, maybe that's down the road.

But for the Department of Defense, they had an inventory problem. So every time anything is purchased in the army, anything from back to front, anything from batteries to laptops to Apache helicopter parts, it's all entered into a database. And it's entered pretty much manually on the field from a laptop. So there's a lot of typos and fat fingering, zeros that are O's, ones that are L's. And we're talking like 1.2, I want to say it was billion records or something. So there's no easy way to go in and just fuzzy match or deduplicate. So if the army doesn't know what's readily on hand in the inventory, how can they be ready for certain situations?

So we were brought in to do that, did some really cool, we used some genetic algorithms. I don't remember off the top of my head exactly which ones, but ways of comparing different fields and then quantifying things like the keyboard distance from different keys, common typos. And then using Levenstein distance for fuzzy matching and combining all these things into a very custom machine learning solution for them. And then using a shiny app to expose the results. So we found a whole bunch of items that were clustered together because they were very similar, but they were duplicates. And so then we worked on scaling that with them and helping them find all those.

Now the AI part, we kept explaining them that's way off in the distance, right? You have to do the machine learning first, have a human in the loop. That's what the shiny app did. It helped us then visualize the results and then agree with them and say, this is right, this is not. And that we kept explaining to them like that builds up your training set. So have someone come in, there is manual work involved. This isn't a magic button where a robot comes in and just does it all. So I think it's the hump between machine learning to AI that's really complicated because there's no way you turn on a switch and just let the machine automatically do this stuff, especially if it's very important, let's say healthcare related data.

Sensitive data and HIPAA compliance

We've struggled in the past with, you know, especially with the drug manufacturing work we're doing. I had to do some research on making sure it was all, I can't remember the word now. It's not, well, FDA, you know, there's FDA regulations and there's all kinds of things you have to comply with. But for the most part, we haven't, you know, run into any issues where we just literally can't do something. People are typically trustworthy now or trusting of AWS and they have HIPAA compliant clusters. There's also a GovCloud. So we use that for the DoD. Luckily, the data we're working with, nothing was classified. So that helped.

But yeah, it's, I think there's a lot of stuff out there now to support it. R can be validated in a clinical environment as well. And usually it involves just writing proper unit tests and making sure things are reproducible. So we've had the discussions, but it's never been like a showstopper. And usually it comes down to, well, we need everything in our cloud. So they have their own RConnect and it's behind their VPN and firewall. But yeah, the AWS HIPAA clusters we've used.

Package management and Shiny deployment

So actually, most recently, we started using Gollum, which makes production-grade Shiny applications and gives you a whole bunch of stuff like testing and documentation. But we didn't have a problem. We built our own package that did some custom D3 visualization for a pharma company. And we didn't deploy it on Shiny Apps, but they deployed it internally to our Connect server, and there were no problems there. So I haven't had any issues yet, or I haven't deployed my own package to Shiny Apps. But using Gollum, we had no problems deploying our own package to our Connect.

Visiting client sites and contextual knowledge

Well, during a pandemic, of course, it was all remote. But yeah, we're a remote-based company. However, yeah, I used to go into client sites probably at least a few times a week. And oftentimes, yeah, I do think it's helpful. Not only can you understand how the data is being generated sometimes, but understanding the politics of situations too, where people work, how they work together, how often they're talking. There's just a lot of things you can pick up on when you are on site.

And of course, whiteboarding things is really nice. But one example I can think of that was pretty cool was we work for a health care devices company that creates the... They manufacture the rapid blood testing kits in hospitals. So if you're at a hospital, hopefully you're not, you get your blood drawn and they have the results very rapidly. That's the company that we do work for. And their problem was that they had all these faulty devices going out into the field at times, make things happen with hardware fails. But there's a ton of sensor data coming off of those machines. And so we were brought in to do some anomaly detection, identify some spikes in particular proteins or problems before that hardware, that blood testing machine gets out into the fields at the various hospitals.

And I got to tour that lab and it was a massive place with just all these machines. All these machines and they show me where the data's coming off of and how it gets put into a database. And I think just having that contextual knowledge and meeting with the people you're working with is invaluable. And I was definitely a person that did not enjoy going to an office every single day, but I really loved being able to go to different client sites a few times a week. So I do miss it after this pandemic. Now I think I'll appreciate it even more, but yeah, it's super valuable.

Starting your own company

Well, there's been challenges. The first, when I gave my two weeks, there's obviously a huge panic. Like, wow, the checks are going to stop coming in. And there's that panic that sets in like, OK, this is really all up to me now. Now, if you have a spouse, of course, that helps. Like I had health insurance because I was able to go on my wife's insurance. And obviously, that helps. And I saved a bunch of money before making the leaps. I had an emergency fund and all that. But that's scary. I would say that's the hardest part is really just making the leap. But two weeks into it, I never looked back. I was like, this is it. This is awesome. It's working out.

And you just got to keep going. And it's nice to know it's all kind of on you. And you can't really blame other people now if things fail. But it's nice to know that you're kind of in charge of if you want to make more money one year, you work a lot more. If you don't care and you're OK, then that's fine. And I'm definitely a person, I'm enjoying work-life balance a lot more now. So I try not to overload myself.

But in consulting, it's also waves. It totally always happens that suddenly we have like 10 leads come in. And sometimes I have a hard time saying no because it's such an interesting project. And I want to work with these people. So I'll pull in more consultants to help if I need to.

And then also not having stuff like this, which is why I'm so glad that you do these Hangouts. Not interact with people every day in an office, obviously. Sometimes it's nice to just talk to other data nerds. So I have some friends that I talk to that also consult and will kind of commiserate about problems or ask for advice. And the other thing I did was I started a Slack community of data scientists. So when I went out on my own, I knew I wanted to keep in touch with all these awesome people that I met throughout the years, whether at conferences or at jobs, college friends even. And so I invited them to a Slack community. It's kind of grown organically over the years. It's called Friendly Tech Space. And you can feel free to message me if you'd like an invite.

Evangelizing R and imposter syndrome

I think shiny is always the game changer. People see that. And they're like, ooh, you know. I'd love to hear from someone else how you've evangelized R, what you've done being the solo R person.

Yeah, and I think we've taught a lot of R, too. And one thing that makes people kind of get it is we do this exercise where you give half the class a task to do in Excel, and half the class does it in R. And then we just change a parameter. We say, OK, now do the same thing, but for this baseball player instead of this one. And the R side is done in like two seconds. And they just change one thing at the top of the script. The Excel people are like going to get the data again, and having to wrangle it and do things. And it really gets people, I think, to like, oh, I get what scripting is now and why it's fast and reproducible.

Yeah, I think if anyone has an experience that they're doing something wrong, like, uh, cause I mean, there's so many talented people out there and like, you know, at my first job, especially, I, like I said, it was all like PhD level statisticians from Cornell. And, um, I just, you know, I'd be in three hour long conversations about them debating and use of a various certain prior for Bayesian algorithm. Anyway, I, um, yeah, of course. And I think, uh, what you have to realize too, is what people are putting out there and people on Twitter. That's, you know, oftentimes they're focused very much on one thing or it's their entire career and their job is based around it.

And, um, I've learned, like, I don't have to be the best to any one thing, uh, but I can focus on what I am good at. And if that's, uh, helping, you know, with the technical communication between the business and the tech people, then I'll focus there. Like I'm, I don't have to be the best coder. I can bring in someone that I know is a D3 wizard, or I can bring in someone that I know is great at optimizing code. Um, I've always felt like I know enough to be dangerous. Um, but yeah, I think everyone deals with it. And at the end of the day, you know, there's way too much to know to absolutely feel like you're an expert at almost anything.

And at the end of the day, you know, there's way too much to know to absolutely feel like you're an expert at almost anything.

Like I, when I do time series forecasting for a client, it's, I use packages that are available. I, I try to, you know, consult with, if I need to consult with a statistician that I'm one of my good friends, I trust I will. Um, but there are people that spend their entire lives, like researching just one segment of time series forecasting. Right. So, um, yeah, just keep learning and it's possible to know it all. And, and remember the, keep the confidence that you should have, because if you're using R at all, you're already way ahead of, you know, most of the population.

This was great. Thank you for having me. The questions are awesome. And, um, I could talk about this stuff all day. So I didn't even know about this event for some reason. So I'm going to probably try to join on Thursdays now when I can.