Resources

The importance of relevant analytics | Nechama Katan @ Pfizer | Data Science Hangout

video
Oct 22, 2024
57:28

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome back to the Data Science Hangout. If we haven't had the chance to meet before, I'm Rachel, I lead Customer Marketing at Posit, and I've learned from a friend at the Hangout that it's helpful for me to actually let people know that Posit is the company formerly called RStudio, and we build enterprise solutions and open source tools for people who do data science with R and Python.

And Libby, I'll have you introduce yourself too. I am Libby. I help out with customer community in the Hangout over here at Posit, and I'm also a Posit Academy mentor. I help people learn R and Python to help them work with data better in their jobs. And I am an avid sewer and reader and watcher of science fiction things.

Happy to have you all joining us today. The Hangout is our open space to hear what's going on in the world of data across all different industries to chat about data science leadership and connect with others facing similar things as you. And we get together here every Thursday at the same time, same place.

At the Hangout, we love hearing from you no matter your years of experience, your titles, industry, or languages that you work in. And we really encourage you to connect with other attendees here at the Hangout in the chat.

And there are three ways to jump in and ask questions today because this is a community question-powered conversation. We really, really want you to ask questions. You can raise your hand on Zoom. We will call on you to jump in live. You can put questions in the Zoom chat, and if you can't ask them yourself, if maybe your mic doesn't work or you're somewhere loud, feel free to just put an asterisk next to it so that we will read it for you. And then we also have a Slido link where you can ask questions anonymously.

And with that, I am so excited today to introduce our co-host, which is Nechama Katan from Pfizer. She's the Director of Innovative Data Analytics. And Nechama, I would love for you to get started by just telling us a little bit about you, what you do, and what you like to do outside of work.

Introducing Nechama Katan

So first of all, thank you for having me. I've known about Disconnect for a while, and I'm excited to talk to all of you and to get kind of your feedback as well in what these works. So I work in the clinical trial data space, and my role is coming up with technical solutions and mentoring people around data cleaning, particularly for space monitoring, which is a statistical process to control data coming out of clinical trials.

Outside of work, I have a number of hats I wear. I have way too many children, literally. And I also do a lot of creative type of stuff. So woodworking, fixing up houses, remodeling kitchens. This week, I'm crocheting hats. I have a friend who unfortunately passed away of cancer recently, and she used to crochet hats to donate to homeless shelters and other types of support groups. And so I'm making hats in her memory. It's a Jewish tradition to do things to keep people's memories alive. So the hats are all for Cindy Bailey.

So that's kind of who I am, where I came from. I have undergraduate degrees in math and statistics, undergraduate in math, graduate in math and graduate in statistics from New York schools, and then ended up at Intel because I was living on the West Coast where I did process control in their factories and performance analysis. I bounced around a number of different industries, but kind of my common theme in my career has been helping people use their data, be it manufacturing data, banking data, pharmaceutical data in my case.

I'm fascinated by why it is that people have data and then don't use it, myself included. We all know how to be rich and thin, and we don't do it, right? So I'm as interested in the analysis as I am interested in how do you present analysis in a way that action will actually happen? So I spend a lot of time trying to think about that, and why is it that people are so adverse to looking at data and understanding it?

The path into data

So let me start with how I ended up as a math major. My mother was a philosopher. I didn't want to do the hard sciences. I was told I couldn't write because I had a learning disability, so there went all the humanities and I didn't like physics. And computer science, because of the learning disabilities, was outside of my scope. Prior to modern computer environments where everything is color-coded, if you don't read syntax, you can't code. So I'm old enough that a VI editor and I were not friends. So I ended up in math by default.

I think I'm the only person in the world who ended up with a theoretical math degree because it was the only thing I could figure out how I could get through. I got kicked out of graduate school because of those children I mentioned. They told me I wasn't sufficiently focused. And in Portland, where I landed, Intel was the largest person, was the largest employer. In graduate school, because I went to Columbia, I actually got a statistics degree and had seen one data set with 20 rows of data in it. So I learned data on the fly because it was, I needed a job.

So I called up hiring managers. I called up three hiring managers a week off an Intel listing that I had gotten from an internal source and talked until someone said, oh, you can do statistics, come do process control. And I said, okay. So lots of just kind of what happened. And then I've stayed in data because it's fascinating to me. It speaks to me and I can do it.

Relevant analytics: starting with the business question

So it's easy to do analytics, right? You take a bunch of data and we've all done that. You open up a data set and you're like, oh, cool. I can take a standard deviation. I can take a descriptive statistic. I can plot it. I can do all sorts of stuff. And then the question is, okay, great. Now what?

I'll give an example from Intel. Years ago, I did a cool 3D model of transistor density on Intel and competitive chips. Looked beautiful. It was in math, in a cool graphing software. It looked like a city. And someone looked at me and said, so what am I going to do with this? I said, I don't know. And a lot of analytical people say that. They say, well, I don't know. It's data. I'm not going to do anything with it.

I am really focused now on kind of the concept that technology on its own doesn't do you any good. I hate to say it. You guys can write the best app on the planet. It can make coffee. It can do everything anyone ever wants. But if it doesn't answer an actual business question, it's not going to help.

I am really focused now on kind of the concept that technology on its own doesn't do you any good. But if it doesn't answer an actual business question, it's not going to help.

So what typically happens in most data spaces is that people spend all this money on this technology. Nobody knows how to use it. And then the fact that no one uses it gets blamed on bad change management. And well, the users didn't want to change, right? You've heard this, right? If the users would just use it, it would have fixed all their problems. And the users won't use it because it's about 80% correct. In most cases, if you're really, really, really good and you're not talking to end users every single day, you're going to be at best 80% relevant. And that's not enough for them to make business decisions.

So I'm talking about a model nowadays of start with business context. So what are the business questions that the business needs to answer? And then what are the data that actually supports those questions? Not the data that you can get.

So let me give an example from a conversation I had with someone outside of work. Call center asked a question, how many consecutive calls do you have? How many contiguous calls? How many calls do you have at the same time? Because they needed to know how many call center employees to have. So the data people gave them what? They gave them the average duration of a call. So how many calls do you have in 15 minutes? How many calls come in in 15 minutes was actually, I think, what they were given because that was what was easy to pull out of the data warehouse. But the number of calls that come in in 15 minutes on a call center didn't capture the long tail, the extremely long calls that were for customers who were stuck someplace and desperately needed to talk to someone. So they hired the wrong number of call center people. And the business was now in a problem because they couldn't meet their customer needs, right? They had a great answer to the wrong problem.

So you need to understand what are the questions and prioritize them. Not all questions are as important. You need to understand why that question was important, right? You needed to understand why that question was important as well. So you have to know the priority and know what your data is. Once you've got your highest priority questions, you answer the highest priority question for which you have data first, and then you build a data pipeline to put in all the other data that you need.

Business people vs. programmers

So I love computer science programmers who can program. The challenge is that if I take a consultant, let's take a consultant rather than a generic full-time programmer. So if I take a consultant who's working in my environment and I throw a problem at them, I need to give them every single little detail. I was on a call earlier today. Someone was like, oh, I've got three data files, and they asked the data guy for access to the data and the data files, and the guy's like, I don't know what you're talking about. These are end system files. Some of them were manually made up by the business. So he's off asking questions that don't make any sense.

In my experience, if I have a business person or a business curious technologist, that's the same thing, right? So someone who's curious about technology, and they are curious about the business. Someone who's curious about the business and knows some technology, what they are able to do is program what needs to be programmed, and they don't need to have a full-time business person babysitting them. And that full-time business person doesn't understand the technology, so then the relationship gets really frustrating, right?

You've got a person who's using Excel, so think horses and buggies on a country road, and you've got a guy driving a Porsche, and the guy in the Porsche is saying, well, where do you want the radio? And the guy in the horse and buggy is saying, radio, what's a radio? And the third time they say, what's a radio, and the guy says, oh, don't worry about it, I'll just put it to the right of the dashboard. They stop asking the question, and so now you've got this car that makes no sense, but it's the requirements, but you gave me requirements. I didn't know what requirements I was giving you.

And so in my experience, a person who's doing writing code 30% of their time, who's willing to write code, will outperform a full-time programmer every single time.

And so in my experience, a person who's doing writing code 30% of their time, who's willing to write code, will outperform a full-time programmer every single time.

Teaching statistics

I think that for each statistical concept, you need to have a very simple canonical example. So let me take an example. What's the average, right? So what's an average? Well, it's a physical measurement. I'm looking at my crochet hook, and if I put my finger here, that's the average, right? There's half the amount of weight is on each side of the crochet hook. So now I know what an average is.

But you really need to bring in data and then exercise it and plot it. There is no data without a, there's no statistical analysis without a data plot. And if you plot all of your data and then run your statistical analysis, the data will make sense. Or if it's wrong, you'll see it. So start with a simple example. Always have that simple example in your head, right?

And then find data that means something to you, right? If you're into sports, do sports. If you're into woodworking, do woodworking. Find something that you can understand the data, because the problem with statistics is not the computation, it's understanding the underlying data. And if it's something you care about, then you'll know about it.

Acquiring business background

So the first one is be curious, be curious, be curious, be curious. Most of you guys, I am guessing, because most smart people were like this, asked a ton of questions when they were kids, right? And around five or six, you were told to stop asking questions. You were in school and there was no time to ask your teacher a million questions.

So go back to your five or six year old brain and ask questions and ask questions and ask questions. Go find someone in your business calls who seems to be interested in the data and say, hey, can you explain this to me? Can you tell me what's going on? Would you walk through it with me? It turns out that business professionals are the same as data professionals. Almost anyone will do anything for a free cup of coffee. And you can do that over Zoom. Send them a Starbucks card over Zoom and say, hey, can I pick your brain for a cup of coffee?

Can I buy you a cup of coffee and pick your brain and ask questions? And then at the end of that, ask who else you should talk to because they'll give you a list of other people to talk to. So just build a tree of people to talk to. But the most, most important thing to do is after you finish taking someone's 20 or 30 minutes and decide how long it's going to be and end the call at that time, ask them at the end of that call if there's anything you can do for them. Because all those say no, so it's a cheap question to ask. And if they don't say no, then you're helpful to them, right? And they will never forget you because almost nobody asks them if there's anything that you could do for them and then send them a thank you note.

Stimulating creativity and productivity

The first thing I'm going to tell you guys is get eight hours of sleep at night, preferably consecutively, get some exercise and do not work more than eight hours a day on average. You will not be creative if you're constantly working. And the way you get creativity is to stop, stop and do something else.

So I have solved more math problems or other types of problems, learning woodworking. Over COVID, I learned woodworking. And one of my biggest first insights was my daughter, seven years old at the time, looks at me and says, can I use the power tools? She had brothers, for some reason, no one had offered to let her use a power drill before. And I was like, whoa, if my daughter doesn't know how to ask for something, I wonder who on my team isn't asking for anything. I went back and turns out no one on my team was asking for anything.

But the biggest thing with productivity is to stop, stop, take breaks, don't do anything. That does not mean doom scroll and scroll on your phone, like shut your brain down. If you're scrolling on your phone, you're not stopping, you're running on idle and getting nowhere. You're just burning gas, creating fumes and doing no, getting absolutely nowhere. So you need to just stop. At some point in your day, you have to say, I'm done, I'm done, I'm stopped. Read a book, crochet, talk to a kid, play pickleball, whatever it is that you do, walk around the block. You'll be way more creative the next day than if you work through a problem.

Let me give one more thing to do though, is pick your optimal time of the day. Pick an hour and a half, get on a tool like Focusmate, decide what you're going to do and think for an hour and a half. You will get, most managers do less than 11 minutes of work an hour. So someone's done the math, you can do an entire month's of work in one day with these power psych focus sessions. And it's 75 minutes, no one can focus for more than about 90, so 65, 75 minute power session of, I know exactly what I'm going to do, shut off everything on your screen except for that and focus on it. Then you can go do the meetings because you've already done your day.

Making data science work visible and impactful

So a 20-minute presentation should have no more than four slides and shouldn't have any Greek on it at all. It probably shouldn't have any algorithm names either. I hate to say that, right? I've sat through a lot of presentations where people's goal is to show how smart they are. Your goal is to show how relevant you are. So I talked about business context. There's business context, there's this cool data integration stack, right, which Posit sits into in my environment, and then there is seamless integration. So you need to take your business problem, your cool tech that you've solved, and then show end-to-end how it changes how the business does its work and how it has impact.

And you need to go way to the beginning and way to the end, and you need to talk in terms that senior leaders are going to understand, cost, time, money, quality. And that's it. Nothing else, right? Not cool AI models, not whatever. You can throw in AI every so often because everyone wants to hear AI, but actually at this point, anything that involves a mean computation has been called AI.

But you need to frame it in that larger business context of what is the impact, and then how does it fit into a workflow? You will never have any impact because there's so much friction that nobody's ever going to turn their back and go off in this tiny little corner and get it back to the bigger picture. So make sure that your work is relevant.

So when I take over a new data science team, I go shopping. I talk to all my friends and say, what do you need help with? And then I find the project that has the most impact with the easiest to get to data and the person who is going to work with me the most. So there's a great book called Cascades on how to drive change. And what he says is, look, don't go worry about changing people who disagree with you. Go find your closest five friends. You get 10% of the organization to agree with you, 10 to 20%, and everyone's going to have to follow.

Data is not true by itself. Everyone's like, oh, the data will speak for itself. It's not true. They're lying to you. Data is only useful with respect to a question. If you ask a question and then go look for data, then you'll find something meaningful. If you just open up data and run an unsupervised learning thing, sure, you'll find something. But what you're going to find is that you're going to find correlations that don't make any sense.

You're going to find that meat eaters have fewer children. That does not imply that meat consumption, that's how I always teach correlation, go look up by country meat consumption, inversely correlates to the number of children people have, because the minute you educate women, with the exception of yours, truly, they stop having children. And so meat consumption is a sign of wealth. Wealthier countries have fewer children per capita. But that does not mean that meat consumption is a contraceptive. So now you'll never forget what correlation means.

Breaking down silos

Reach out to the most junior people and talk to them. No one wants to talk to the senior important person, but talk to the not important person, talk to the person who doesn't say anything in the meetings, or who says something and gets shut down, be kind, come up with ways to do favors, buy people coffee, send them Bravo cards or whatever it is that you do within your organization, like be human, right? Most people have stopped being human with the hybrid workplaces. We don't have to be face to face with anyone anymore.

And just be nice to people. That's the first thing. And then be curious and ask questions and ask questions and ask questions. But if you go to someone and you see they're struggling in another silo, say, hey, you know what I saw? You really didn't understand this in a meeting. Let me help you. Can I help you understand where to find the data or how to do something? If you lead with an offer of help, almost no one says no.

And really, at the end of the day, I'm successful at Pfizer and any senior person is successful at Pfizer because of my ability to get things done outside of the official process. It's all about the relationships. It's all about the unofficial relationships. It's not about your title. The person who can help you the most if you need to influence a VP is who? His or her admin. They will stop you from being able to see that person. If you piss them off, forget it.

The tool stack

So my team is sitting on top of a data lake that I believe is Spark, but I don't actually, Snowflake, I don't actually care where the data lake is because that original data source has been morphed. And we're sitting in Databricks. So we use Databricks to control our data, and then we attach it to Posit Workbench and Posit Connect with GitHub, Jira, Confluence, et cetera, around the outside. And that allows us a couple things. The first of which is the Databricks environment gives us access both to the very controlled data and read-only and a read-write space that's common shared space, which is incredibly important because we will never have all of our data inside of the data lake, the official data lake.

We use Posit because it's qualified and validated in our environment. And because I work in a large company, I have to partner with my IT department, and I actually have my environment called SPAR, S-P-A-R. And so Streamlined Platform for Analytics and Reporting is, I think, what SPAR stands for. And I convinced my IT department to own and support that platform. So they own the Databricks, the Databricks workspace, the Posit connections, and the GitHub connections to the Posit. And then I use their SDLC, a modified version of their SDLC validation process. So I validate my own code, which is a tenth of the cost and a hundredth of the time of having someone else do it.

And now I've got a fully business-led, business-driven environment that's completely supported by IT, which is kind of like the Holy Grail. Before this, what we would have had to have done is go to an IT department and paid them $100,000 per app to push it through a validation process and add no value. And so that meant that we couldn't innovate anything because an app that took four weeks to write cost three employees in India to validate. And I couldn't justify it. So we never validated anything. And now we're validating things left and right.

AI, LLMs, and the future of data science roles

Right now I have business users who I put in front of Databricks. Now I don't put it in front of chat GPT. I put them in front of the Databricks AI assistant because it's tied to Unity catalog. So it understands the data and it only writes SQL and Python. But it basically writes SQL and Python for us and they can go and do exploratory data analysis.

So the answer is the business people can go in and do it. The good news is that because the business people can go in and do that, they are able to frame what they need much better. So they can go in and build a proof of concept, Starman MVP that gets them 80% of what they want and it's not robust. Then I hand it to my programmers in India and say, here is my proof of concept MVP. Can you please show them how to write functional code, test driven tech, you know, development, put unit tests, all that type of stuff into it.

The people who were business last are going to have to become business first or they're going to have to become deeply technical. Just people to spit out SQL, no. And the good news, the good side of all of that, you guys, is that just spitting out SQL or Python is boring.

It's fun as an intern, maybe, because, like, I got to do something. But as you get more senior, it's mind numbing. So if you can figure out to Curtis's question of what are the business questions that the business needs to solve and help the business solve those questions and throw a little technology around it, because guess what? That AI assistant that helps my business user helps you, too. Nobody writes code as fast as an AI assistant.

Because no one can keep track of 17 different programming syntax systems and date fields and everything else. So use the AI assistants, embrace them, but embrace the business, because you know what is possible. So you're driving a sports car and you're talking to people in a horse and buggy. You need to explain why they need that sports car. You don't need to drive the sports car. So that's what you can do is you can say, hey, look, I can get you someplace so much faster.

Yes, I want to scare you because coding is no longer a career. It never really was, and it no longer is. As soon as they started outsourcing coding to cheaper geographies, it's not been a career for Americans. What Americans can do and people in the West can do is solve broad problems or wicked problems. I brand myself as a wicked problem solver. It can solve socially complicated problems where no one quite knows what they want, and no one can write a spec, but you can put the pieces together and solve a problem. That's what we can do.

It can solve socially complicated problems where no one quite knows what they want, and no one can write a spec, but you can put the pieces together and solve a problem. That's what we can do.

I can outperform 20 programmers because I can tell one person what actually has to be done, while the other 20 are still off running around in circles trying to get meaning out of data that they don't understand.

Validation and open source in pharma

So there is validating for risk-based validation. So validating against a particular risk, right? So validating against output from an app against a regulatory submission, Mike Smith and I actually solved for now like two years ago. So there's a methodology for that. Call us up. We can tell you what to do. It's not proprietary.

The principles of validation are, say what you're going to do, do what you said, prove you did what you said you were going to do. That's it. Pharma makes validation complicated. And by the way, a manual process is not validated either.

And then become good, good friends with your process and quality people, because that's what I did. So you get agreement from process and quality, and then you get agreement from your IT department, and you build the relationships. And in my case, what I had to do was agree to do all of my data in a space where the data could be shared. So none of my data goes to S3 buckets. My data is all inside of Databricks. And they're like, look, you play where people can reuse what you're doing, will help you stand up a validation process. So it's a negotiation. And it's not, you need to believe it's possible, and then you need to build the relationships.

And what you guys have to know is that Large Pharma is using open source, period. End of story, done. So it's just a matter of finding your senior leaders to talk to the senior leaders of the companies that are doing it. Get everyone to calm down.

Career advice

Okay, I'm going to give you heartfelt career advice. If you want to get promoted, manage up. If, on the other hand, you actually want to solve important problems, manage down. And the reason I say that is that your ability to solve problems is your ability to go get data from your peers. So managing down assumes your peer level. So take care of the people at your level, take care of the people below you, and they will take care of you back. It might not get you promoted, but it just makes for a better work environment. And I think more impact long term.

And really, it's just a call to action. Learn the business, learn the business, learn the business. Once you've learned one business, it's like a programming language. Learning a second business is the same. They all have the same problems. They just change the nouns and a couple of verbs. It's the same sentence structure. I've worked in everything except for supply chain in my career. And I think it's well worth coming up to speed in the business. Don't think of it as a waste of time.