Data Science Hangout | Rami Krispin, Apple | Building your Personal Brand in Data Science

Transcript#

This transcript was generated automatically and may contain errors.

Welcome, everyone, to the Data Science Hangout. This is an open space for current and aspiring data science leaders to connect and chat about some of the more human-centric questions that you all have around leadership. And we'll focus on questions that are most important to you all. So I just want to say there's three ways that you can ask questions. You can jump in live. You can also put a question in the Zoom chat. And I can call on you and pass the mic over to you. But we also have a Slido link as well, which Robert will share in just a second here. And there you can ask anonymous questions as well. But just a quick note that the session will be recorded and shared up to YouTube for anyone who's missed it as well.

And I am so excited to be joined by my co-host for today, Rami Krispin. And Rami, I am very grateful to, who jumped in here last minute. He manages a data science team focused on cost optimization and capacity planning. He's also the author of a hands-on time series analysis book and several R packages. I think many of you may have seen Rami on LinkedIn and Twitter and from various presentations that he's given to the community as well. But I'd love to turn it over to you, Rami, and have you introduce yourself and chat a bit about some of the work that you do.

Thank you, Rachel. Thank you for the invite and thank you everyone to join. So I'm Rami Krispin. I'm a data science manager at Apple. I'm not representing in this talk Apple. I cannot talk on my work at Apple. Unfortunately, due to the short time, I didn't have the time to go through our legal approval. But I will be happy to answer any general questions about data science and, you know, the life of, you know, workflow and life of data science. As Rachel mentioned, I'm really active in the open source community and since the COVID starts mainly on creating packages that are related to COVID-19 data. Okay, and thank you for the invite. Happy to be here.

Definitely. Thank you, Rami. And I know we had just a few minutes to connect before this as well, but would really love to kick off the discussion with asking what is something that you're most excited about in data science right now?

I would say probably like the boring stuff that's going on in data science tools that are coming that related to production. Like, you know, the Docker and GitHub Actions, the combination with R. And it's amazing how simple is you can today you can take those free and open source tools and put your R code in production. So, I'm really excited about, you know, that those tools are just coming. Yeah, I'm less excited about, you know, the other stuff, the more sexy stuff, the deep learning. I think it's also important. But the productionize of your code, I think that's what I'm really excited about.

If you have a certain skills, but you're not rebranding yourself, you are, you can easily can get lost. So, I think that the rebranding is super important.

That's what drove me to start to do the, or one of the things drove me to start to go to the open source and then like also enjoy it. So, yes, I have a few people that I'm mentoring and I always advocate, you know, make sure that you are showing the world your skills. It's, you know, you can be very talented, but if you're not telling it and you're not sharing it, so it's very hard for people to recognize it.

Do you think one of the best ways of doing that is also contributing to our packages or what way was most successful for you?

So, I think there is the person that hired me on this call and I remember when I joined my role as data science at Apple, the first thing he told me that, you know, 1,000 people applied for this role, but you were the only one that we could go and see on your GitHub that, you know, time series and forecasting. So, I think it is, you know, it's the visibility, it provides a great visibility. If you are interested to go to some domain, let's say that you are interested in classification or regression. So, writing about, it's not, you don't need to start to develop packages. You can start with like, you know, there is the daily coding in LinkedIn, which is a great, I enjoy what, you know, those junior people that either in school or right after school doing, I enjoy to see what they're doing. Some of them are very creative and it's provided a lot of visibility and it's also open eyes about their work and it's also challenging you to learn new stuff. So, yes, it is one way, but it's not the only way, you know, writing articles. So, I think that's worked for me. You know, certain people have certain talents, so you should feel what works for you and then go with this route. But I think in any way, creating the visibility is important.

I'll jump in if that's all right. I think I 100% agree. I think, I mean, Rami's spot on with this. I've certainly found like in, for my career, that being in like healthcare and like working with like sensitive data, that ends up being a little bit difficult with like what people can kind of expose on GitHub. Because a lot of my contributions have always been behind the corporate GitHub and in private repositories for obvious security reasons, which makes it kind of difficult. That isn't to say I couldn't like do something and be public about it. And I think that that would be the advice I would give to people. I always like to see, as Rami says, people's kind of like pet projects, no matter how big or how small. But I wanted to just add that nuance that sometimes people come from industries that are a little bit more difficult, like the data might not be so readily available and stuff like that.

Yeah. And this is sometimes challenging to find people that are in this world that they cannot share their code, or they don't have the time just to go and, you know, create their own GitHub, fancy GitHub. So, you know, when you're reading resumes, you should also not bias yourself only by looking at the GitHub, but also try to find between the lines if there are other stuff that those people are good at, or you can identify the skills that you're looking for. But definitely having a GitHub, it's very... I think GitHub is the new resume.

I think that in schools, they should advocate the students to use more GitHub and those tools, because this is the future. It's not just the present. So, definitely. And one of the things, like, you know, I think that beyond the selfish things of contributing to open source, because you want to build your stuff, it's also very novel to help others by building stuff. I really believe in open source. I think it's a great thing to be part of this community, no matter the language, and in particular, the R language. Being part of the open source community, I think it's something I found very novel.

One anonymous question was, how do you get more comfortable with sharing your work and opinions on social media? It seems pretty important in today's world.

So, first, I'm doing a separation between my work and my open source contributions. And being part of a big corporate, anything that I'm putting on open source, I need to get the legal approval from my company. But going back to the question, I'm doing stuff that I find interesting. And I'm also trying not to do stuff that might be controversial. Like, for example, when I started to do the COVID-19, I just started a time series. It was February 2020, which was mainly in China. And I thought, okay, I'm enjoying doing data packaging. It's very common in R. I had some other packages for electricity in the U.S. and U.K., which seems to be another time series. And maybe we can do some forecasts. But then, like, after a few days, I started to realize that it's something I don't want to forecast. Or, you know, because it could be sending the wrong message here. And I should be careful, because this is, like, pandemic is different from electricity. And I don't want to mislead people, and then people will say, oh, this forecast, it's going this direction, and use it. And I just, you know, it just stayed in the domain of, this is just data. Each one can do whatever they want. I'm not going to do any prediction. It's not my role. I realized that, you know, that other folks that should, this is their profession, how to predict pandemic. So, you know, it's the combination between find a way to do whatever you think that it's right to do, but avoid the stuff that I don't know if it's a good answer to. Don't, you know, use the data correctly. That's, like, when I see in the news, you know, people using data the wrong way. It's kind of, like, an example of, like, what we as a data scientist or people that work with data should avoid.

Communicating forecasts and model limitations to stakeholders

Yeah, Rami, thanks for your insights here. One of the things you mentioned, and I think I got this right. And if not, please correct me. But you were saying you create forecasts. And then sometimes the users will say, well, what's going to happen in the future? Can you make tweaks to your forecast? Because I want to consider this in there, too. And, right, I think what you said is your forecast is based on historical data and history. And there's this methodology. And you don't want to go in and manually really dance around too much, because then it breaks your methodology. And if that's what you said, do you have any tips or tricks for us when we're communicating to folks the difference between like, hey, I have this model, I have this method, it works well, we've backtested. And if I start going and taking your intuition and assumptions, user, and like messing around with that, it's going to kind of break how it could be used by multiple parties.

I think that the first thing is that you need to be honest about what the forecast is about, what are the limitations, where it's going to fail. And for example, forecast is looking at the historical and using the historical to predict. But sometimes, you know, like we saw it in the COVID, right, that if you look at some of the economical indicator that either went really up, like the unemployment, or went really down, like the flights, or stuff like this, that no matter what your historical data cannot help you. So I think, you know, in the first place, setting the expectation of what the forecast can do is important. Definitely making sure that you're also providing conference intervals. On top of it, I would say that, back to your questions, that typically, as a data scientist, you would, you want to work with some business analysts that have context about the future, that as a data scientist, that we are only looking at the past. We don't have this context, like a, let's say, if you're working in sales that have some product, and you know that in the future, going to be some campaign, or going to be some shift of a new product going to affect the sales of your specific product. And this is where they will do the manual adjustment. And, you know, they will have to advocate to the business why they made those adjustments. I think when there is big uncertainty, it's good to have more than one option. So you can also, you know, we can incorporate to some regression model. And, you know, by adding some flags, and then say, if I move this flag up a little bit or down, like a, think about like some kind of like a spline, that you can say the spline is going to be longer than, you know, you're creating a longer edge or shorter edge, and to say the different scenarios. So I think having kind of like what if analysis when you're, you know, when this total is not enough, it's probably my favorite option. But I think it's also you need to be honest. And we just called us about what the limitation of the forecast that, you know, some people think that science is a magic. And I think it's important to set the expectation that we're not precision, it is science and there are limitations, right? Can tell us some story, but you know, the future sometimes is as different.

Yeah. I love that. I think a mistake we often make is that we go through classes and we go through courses. And we believe that there's a problem out there. And then there's one model to solve that problem. Like, what technique do I know that I can apply to this problem? And then like, we'll call it a day, we solved it. But what you said reminds me of the idea of, like, we talk a lot about ensemble models, but there's also like the idea of ensemble perspectives. So if we're working with users and we like, Hey, yeah, so there's a forecast model here. It has its limitations. Let's be honest about that. Like you said, but then here's another way to look at it. And here's another way to look at it. I think that is where us as a group on this call can really turn the tide and get users on board.

And I think typically you see the data scientists work better when they have some business analysts that help them to articulate the problem and guide them with, well, some of the insights about the business that just looking at it, you might not have. So I think it's important to combine and get this help with someone from the business.

Explainability of models

Yeah, absolutely. First of all, thank you for your fantastic insights, Remy. My name is Olli. I'm working with data at the Institute of Reykjavik. I had a question regarding when you operationalize models. And so I face that often when we have to have a discussion with our stakeholders. What are these models telling us? Like the explainability of models, like inferring, like how the predictors are actually affecting your response. And I find that to be a fascinating subject. Like when you need to work with stakeholders, you need to have a discussion on, hey, did you know that this predictor is influencing your response in this way? And I can show you in a graph, like if it would be a linear model, it's easy to explain. But I'm curious about this in general, and especially in time series modeling with what you're doing. Like, do you have to have these discussions with your stakeholders at Apple? Do you need to exhibit the outputs or the models in a way that, like how the predictors are influencing the response? Do you show that graphically or graph or in story?

That's a good question without referring to my work or anything like this. But generally, I think that you're going to be two type of stakeholders, or generally two types of stakeholders. The one, they not necessarily care about the type of models. They care about the results and they want to see something like some forecast that they can use. The other type of stakeholders that actually, they want to know more about the models that use them may not be statistician or background with the modeling, but they either need to articulate it to their managers or they need to present it and use it so that they want to make sure that they are understanding the model, the methodology. So I think in this case, it's nice to do some to take the time and have meeting to explain them. I'm using three type of models. So example, I'm using ARIMA. This is how ARIMA model is working, or this is how a linear regression is modeled using, if you are users like coming with a shiny example that you can say, if I'm now changing this parameter, this is how effect the output. This is how calculates seasonality. I think that the best tool to explain people is data visualization. If you combine it with some shiny application or something similar that they can see the interaction of the model with the data. And now when you change some parameter, it's effect the output. I think it's easy to extend and get them on board.

And it presents you with an opportunity to have a discussion with your stakeholders in a way that you can both relate to. And because in my, at least in my work, I'm not an expert in the business domains of my stakeholders, but I'm an expert in data. So we need to have this common ground to have a conversation. And I feel like this is the way in. And not to be overly technical, but presented in a graph and tuning things, it can lead to further insights into the underlying processes in the data, like physical processes.

Scoping data science projects

And congrats on the new role as well. Thank you.

Yes, so if I understand correctly, like what is the, the question is, so the question is, what is the, you know, the transformation from to management? What are the...

So, you know, I know if you have, if you are carrying with you the project that you managed, you know, you were part of the people that did before those, those projects as a individual contributor, I think it's a, make sure that you are, find someone that will continue. I think, you know, this is where I find myself continue to do what I did before carrying my projects, but also start to be a full-time manager. And then like in, in terms of time management was very challenging. So I think thinking about transition of your past project, assuming that you're still maintain them is important. And thinking about your time, right? There's maybe the, at least the beginning, you're trying to do, to do both your, your previous task and also the new task. And then there is a limit of time, how much you can work crazy hours and a long time you will burn yourself. So that's, I feel my main advice, right, to make sure there is a good transition and focus on the important stuff as a manager, like, you know, make sure that your team has whatever they need to success.

Did you always have those managing skills or people skills or did you have to learn them?

I think I have, I had some of them and, you know, down, you know, you always learning new stuff. So I did before my role, you know, my life as a data science, I did some management roles. So working, working with people and, you know, the ability to manage a team, it's something that I think I had before, but also data science will be different. So I was also learning experience for, for my end. I think what, you know, for example, scoping data science project was something that I learned. So there are some skills that I came with and some skills that I learned through the process and mainly related to scoping.

What does that look like scoping and data science project?

I think the first thing is to understand the business problem and make sure that the business problem is defined correctly. And going back to the, you know, what we, the question from Frankie about, you know, working with the focus, but like you need to have some business analyst that, or someone, you know, the terms, you know, business analyst or data analyst that define what this problem that we are trying to solve without it, you, it's easy to get lost. So, you know, scoping is, is translate the business problem into data science features. And I think there are two in the project that I, for my perspective, there are two challenges in a data science project. One is to define the business problem. The other, the second one are more challenging is to find a data that required to solve the business problem. If, if you are able to solve those two, the modeling part, and, you know, it's the easy part and it's like, like the 70, 30 or 80, 20% time allocation between the data and the modeling. That's I think the reality. So if you want to articulate the business problem correctly, it's easy to get lost. And I think that's kind of like the, the most important part when you're starting a project.

Simplicity and putting models into production

Not necessarily causal inference, but more, you know, the users that they try to understand the mechanism of the model. And this is where I would go and sit with that stakeholder and experiment my methodology. We'll try to do it as simple as possible. I always advocate to use, you know, if you can use something simple, use it. Don't, it's not necessary to be fancy. More fancy is not necessarily better results. So, yeah, I don't know, I don't know if there's try to explain like how regression is working to and how we use it to solve some problem or different model is something that I would do. And again, the best method is like to create our markdown with the data and with, you know, going through the process and use some shiny components.

So, you know, usually like when I'm talking in a meetups on production, which I did, I had two of those in the last two weeks. I would say that, you know, if someone comes to me and say like, I can create a fancy deep learning models or some machine learning models, that's great. But if someone will come to me, I can deploy my linear regression in production. So this is a God. So I think that, you know, in the data science world today, at least it's more important that you can be able to take your work into production than don't do fancy stuff. Like if I can see my projection, my progress as a data science, like I feel like that I had a normal curve, like as any other data science in the room here, that when I was just starting as a really junior, I was into this deep learning machine and I was curious about it. And over time, I realized that if I need to run a model like for one hour, two hours to optimize it and probably get the same results as I would run linear regression, it's probably not worth it just for the sake of saying that I did something fancy. And if I cannot communicate it to others, like, you know, deploy it in a dashboard or deploy it somewhere, it's worthless because, you know, you're not getting the recognition. So today I think being like a full stack data scientist that you can take your work into production and it's super important.

But if someone will come to me, I can deploy my linear regression in production. So this is a God. So I think that, you know, in the data science world today, at least it's more important that you can be able to take your work into production than don't do fancy stuff.