Data Science Hangout | Rami Krispin, Apple | Building your Personal Brand in Data Science
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Welcome, everyone, to the Data Science Hangout. This is an open space for current and aspiring data science leaders to connect and chat about some of the more human-centric questions that you all have around leadership. And we'll focus on questions that are most important to you all. So I just want to say there's three ways that you can ask questions. You can jump in live. You can also put a question in the Zoom chat. And I can call on you and pass the mic over to you. But we also have a Slido link as well, which Robert will share in just a second here. And there you can ask anonymous questions as well. But just a quick note that the session will be recorded and shared up to YouTube for anyone who's missed it as well.
And I am so excited to be joined by my co-host for today, Rami Krispin. And Rami, I am very grateful to, who jumped in here last minute. He manages a data science team focused on cost optimization and capacity planning. He's also the author of a hands-on time series analysis book and several R packages. I think many of you may have seen Rami on LinkedIn and Twitter and from various presentations that he's given to the community as well. But I'd love to turn it over to you, Rami, and have you introduce yourself and chat a bit about some of the work that you do.
Thank you, Rachel. Thank you for the invite and thank you everyone to join. So I'm Rami Krispin. I'm a data science manager at Apple. I'm not representing in this talk Apple. I cannot talk on my work at Apple. Unfortunately, due to the short time, I didn't have the time to go through our legal approval. But I will be happy to answer any general questions about data science and, you know, the life of, you know, workflow and life of data science. As Rachel mentioned, I'm really active in the open source community and since the COVID starts mainly on creating packages that are related to COVID-19 data. Okay, and thank you for the invite. Happy to be here.
I would say probably like the boring stuff that's going on in data science tools that are coming that related to production. Like, you know, the Docker and GitHub Actions, the combination with R. And it's amazing how simple is you can today you can take those free and open source tools and put your R code in production. So, I'm really excited about, you know, that those tools are just coming. Yeah, I'm less excited about, you know, the other stuff, the more sexy stuff, the deep learning. I think it's also important. But the productionize of your code, I think that's what I'm really excited about.
Transitioning from individual contributor to manager
While we're waiting for a few questions to come in from the audience, I know something that's come up quite a bit on these calls is how leaders handle that transition from an individual contributor to a manager role. And I know this is something that you've experienced as well. And I'm curious to chat more about this and what it's been like for you.
Yeah, so I think, you know, the starting point is that I love to code. And as an individual contributor, you do a lot of coding, especially in R. And when I started the transition to manager, I thought that I could continue to code, not as much as, but at least like, you know, half of my time. And then during the transition, I realized that you need to give away some of your coding. And, you know, the trade-off that you start to someone that has a passion to coding, it was, I think, you know, it took me a while to recognize that I just need to do less of the coding and focus on the important stuff. You don't want to be the bottleneck that people are waiting for PRs and you're just submitting all days. And, you know, so that was the tough part. But then like you, as a manager, I think your role is to help the other data scientists to do the work. So I think it's also important. So that's, I think, the trade-off when you're moving to do like a data science scoping and general project management. But our time is there doing less coding for my experience.
That was my, you know, my running, you know, my free time to do more. Yeah, I started to do more coding in my free time. It's like the, you're afraid to lose the connection with the ground, right? That you are starting to do other stuff and you're disconnecting from what's happening on, you know, the first time where the important stuff are happening. I enjoy it. So it's the combination of the two. But I think over time it will be, it will reduce that life, right? You need to recognize that you need, if you want to go to management, that's the cost that you need to pay.
I knew from the first place that, that's the, you know, from career perspective, this is where I want to go. But I think I didn't realize, you know, when I saw my manager that is, he's not a data scientist, but he's all day meetings. I thought because he's, you know, he's from the, more from the business sides, but as a data science manager, you still going to do, you know, part of the time coding and be involved. But then like in, in really short time, I realized that I spent most of my time in meeting. So it's, you know, it was, for me, it was the process of one year doing this transition that I, at some point I said, I told my manager, I cannot, you know, continue maintain the project that I maintained before. And I, I need to be full-time manager to not to be the bottleneck for my team. So I feel that's kind of like the, the, the year of transition that I had until I, you know, I can say that now I'm fully data science manager, less doing coding.
What would you say is one of the most difficult things then of, of being a manager?
I think that you, you need to understand better the business as a manager, it's expect more than you before as ICT. So it's another transition. And yeah, I think that's kind of like, you know, it's additional to understand the biggest, biggest picture. It was also something that needed to be done.
Hiring and evaluating candidates
So generally, I'm not, you know, not, you know, that's my opinion. So when, when you are having a team of our users, you, at a certain point, you want to hire people, you know, that you don't want to, during the recruiting, you don't want to evaluate the skills, you want to be beyond the starting day, you know, the interview process that, you know, those people are good, our programmer, and you can focus on the important stuff, right, the feed for the team and how people take a business problem and translate it into a data science solution. So that's, that's, I feel like it's important as a, when you are hiring people is to be able to articulate like that, you know, starting in a job description, to write a very clear job description that you are looking for, for example, our users and what are the skills.
And then, you know, we're looking at, I'm looking at this at the GitHub, if there is, if not, so I'm trying to find, you know, good evidence for people that they, about their R background. And this is kind of like a, one of the challenging things that not all people have the time to go and maintain a GitHub. So, but having a GitHub is providing good visibility about some skills. And the other thing is like, I know that the first question, you know, after we are leaving the R skills and the statistical skills aside, like, I usually, I want, you know, the way I think of it, like, is it, is this person, is it someone I want to sit in the morning and have coffee? It is like the general feed for the team, right? Then the business, you know, the ability to understand the business problem, it's important, be able to be independent. So, that's the thing that I, no matter the role, I think it's important things that you should look at.
Building your personal brand in data science
Yeah. So, you know, we talked about it before. So, when I, when, so, more with a bit of a background, when I graduated my master's, I was kind of like, I was originally planning to go to PhD. And last moment, I decided not to continue. And then I was, oh my God, I need to find a job. And it was a short, kind of like a short notice. And I found myself that I'm, you know, I wasn't prepared. So, I realized that when you're applying to tons of jobs and, you know, barely get response, the way to get, you know, visibility is to create, to rebrand myself. If you have a certain skills, but you're not rebranding yourself, you are, you can easily can get lost. So, I think that the rebranding is super important. You know, it could be through the open source, social media, and other, you know, platform.
If you have a certain skills, but you're not rebranding yourself, you are, you can easily can get lost. So, I think that the rebranding is super important.
That's what drove me to start to do the, or one of the things drove me to start to go to the open source and then like also enjoy it. So, yes, I have a few people that I'm mentoring and I always advocate, you know, make sure that you are showing the world your skills. It's, you know, you can be very talented, but if you're not telling it and you're not sharing it, so it's very hard for people to recognize it.
So, I think there is the person that hired me on this call and I remember when I joined my role as data science at Apple, the first thing he told me that, you know, 1,000 people applied for this role, but you were the only one that we could go and see on your GitHub that, you know, time series and forecasting. So, I think it is, you know, it's the visibility, it provides a great visibility. If you are interested to go to some domain, let's say that you are interested in classification or regression. So, writing about, it's not, you don't need to start to develop packages. You can start with like, you know, there is the daily coding in LinkedIn, which is a great, I enjoy what, you know, those junior people that either in school or right after school doing, I enjoy to see what they're doing. Some of them are very creative and it's provided a lot of visibility and it's also open eyes about their work and it's also challenging you to learn new stuff. So, yes, it is one way, but it's not the only way, you know, writing articles. So, I think that's worked for me. You know, certain people have certain talents, so you should feel what works for you and then go with this route. But I think in any way, creating the visibility is important.
I'll jump in if that's all right. I think I 100% agree. I think, I mean, Rami's spot on with this. I've certainly found like in, for my career, that being in like healthcare and like working with like sensitive data, that ends up being a little bit difficult with like what people can kind of expose on GitHub. Because a lot of my contributions have always been behind the corporate GitHub and in private repositories for obvious security reasons, which makes it kind of difficult. That isn't to say I couldn't like do something and be public about it. And I think that that would be the advice I would give to people. I always like to see, as Rami says, people's kind of like pet projects, no matter how big or how small. But I wanted to just add that nuance that sometimes people come from industries that are a little bit more difficult, like the data might not be so readily available and stuff like that.
Yeah. And this is sometimes challenging to find people that are in this world that they cannot share their code, or they don't have the time just to go and, you know, create their own GitHub, fancy GitHub. So, you know, when you're reading resumes, you should also not bias yourself only by looking at the GitHub, but also try to find between the lines if there are other stuff that those people are good at, or you can identify the skills that you're looking for. But definitely having a GitHub, it's very... I think GitHub is the new resume.
I think that in schools, they should advocate the students to use more GitHub and those tools, because this is the future. It's not just the present. So, definitely. And one of the things, like, you know, I think that beyond the selfish things of contributing to open source, because you want to build your stuff, it's also very novel to help others by building stuff. I really believe in open source. I think it's a great thing to be part of this community, no matter the language, and in particular, the R language. Being part of the open source community, I think it's something I found very novel.
Sharing work and opinions on social media
So, first, I'm doing a separation between my work and my open source contributions. And being part of a big corporate, anything that I'm putting on open source, I need to get the legal approval from my company. But going back to the question, I'm doing stuff that I find interesting. And I'm also trying not to do stuff that might be controversial. Like, for example, when I started to do the COVID-19, I just started a time series. It was February 2020, which was mainly in China. And I thought, okay, I'm enjoying doing data packaging. It's very common in R. I had some other packages for electricity in the U.S. and U.K., which seems to be another time series. And maybe we can do some forecasts. But then, like, after a few days, I started to realize that it's something I don't want to forecast. Or, you know, because it could be sending the wrong message here. And I should be careful, because this is, like, pandemic is different from electricity. And I don't want to mislead people, and then people will say, oh, this forecast, it's going this direction, and use it. And I just, you know, it just stayed in the domain of, this is just data. Each one can do whatever they want. I'm not going to do any prediction. It's not my role. I realized that, you know, that other folks that should, this is their profession, how to predict pandemic. So, you know, it's the combination between find a way to do whatever you think that it's right to do, but avoid the stuff that I don't know if it's a good answer to. Don't, you know, use the data correctly. That's, like, when I see in the news, you know, people using data the wrong way. It's kind of, like, an example of, like, what we as a data scientist or people that work with data should avoid.
Communicating forecasts and model limitations to stakeholders
Yeah, Rami, thanks for your insights here. One of the things you mentioned, and I think I got this right. And if not, please correct me. But you were saying you create forecasts. And then sometimes the users will say, well, what's going to happen in the future? Can you make tweaks to your forecast? Because I want to consider this in there, too. And, right, I think what you said is your forecast is based on historical data and history. And there's this methodology. And you don't want to go in and manually really dance around too much, because then it breaks your methodology. And if that's what you said, do you have any tips or tricks for us when we're communicating to folks the difference between like, hey, I have this model, I have this method, it works well, we've backtested. And if I start going and taking your intuition and assumptions, user, and like messing around with that, it's going to kind of break how it could be used by multiple parties.
I think that the first thing is that you need to be honest about what the forecast is about, what are the limitations, where it's going to fail. And for example, forecast is looking at the historical and using the historical to predict. But sometimes, you know, like we saw it in the COVID, right, that if you look at some of the economical indicator that either went really up, like the unemployment, or went really down, like the flights, or stuff like this, that no matter what your historical data cannot help you. So I think, you know, in the first place, setting the expectation of what the forecast can do is important. Definitely making sure that you're also providing conference intervals. On top of it, I would say that, back to your questions, that typically, as a data scientist, you would, you want to work with some business analysts that have context about the future, that as a data scientist, that we are only looking at the past. We don't have this context, like a, let's say, if you're working in sales that have some product, and you know that in the future, going to be some campaign, or going to be some shift of a new product going to affect the sales of your specific product. And this is where they will do the manual adjustment. And, you know, they will have to advocate to the business why they made those adjustments. I think when there is big uncertainty, it's good to have more than one option. So you can also, you know, we can incorporate to some regression model. And, you know, by adding some flags, and then say, if I move this flag up a little bit or down, like a, think about like some kind of like a spline, that you can say the spline is going to be longer than, you know, you're creating a longer edge or shorter edge, and to say the different scenarios. So I think having kind of like what if analysis when you're, you know, when this total is not enough, it's probably my favorite option. But I think it's also you need to be honest. And we just called us about what the limitation of the forecast that, you know, some people think that science is a magic. And I think it's important to set the expectation that we're not precision, it is science and there are limitations, right? Can tell us some story, but you know, the future sometimes is as different.
Yeah. I love that. I think a mistake we often make is that we go through classes and we go through courses. And we believe that there's a problem out there. And then there's one model to solve that problem. Like, what technique do I know that I can apply to this problem? And then like, we'll call it a day, we solved it. But what you said reminds me of the idea of, like, we talk a lot about ensemble models, but there's also like the idea of ensemble perspectives. So if we're working with users and we like, Hey, yeah, so there's a forecast model here. It has its limitations. Let's be honest about that. Like you said, but then here's another way to look at it. And here's another way to look at it. I think that is where us as a group on this call can really turn the tide and get users on board.
And I think typically you see the data scientists work better when they have some business analysts that help them to articulate the problem and guide them with, well, some of the insights about the business that just looking at it, you might not have. So I think it's important to combine and get this help with someone from the business.
Explainability of models
Yeah, absolutely. First of all, thank you for your fantastic insights, Remy. My name is Olli. I'm working with data at the Institute of Reykjavik. I had a question regarding when you operationalize models. And so I face that often when we have to have a discussion with our stakeholders. What are these models telling us? Like the explainability of models, like inferring, like how the predictors are actually affecting your response. And I find that to be a fascinating subject. Like when you need to work with stakeholders, you need to have a discussion on, hey, did you know that this predictor is influencing your response in this way? And I can show you in a graph, like if it would be a linear model, it's easy to explain. But I'm curious about this in general, and especially in time series modeling with what you're doing. Like, do you have to have these discussions with your stakeholders at Apple? Do you need to exhibit the outputs or the models in a way that, like how the predictors are influencing the response? Do you show that graphically or graph or in story?
That's a good question without referring to my work or anything like this. But generally, I think that you're going to be two type of stakeholders, or generally two types of stakeholders. The one, they not necessarily care about the type of models. They care about the results and they want to see something like some forecast that they can use. The other type of stakeholders that actually, they want to know more about the models that use them may not be statistician or background with the modeling, but they either need to articulate it to their managers or they need to present it and use it so that they want to make sure that they are understanding the model, the methodology. So I think in this case, it's nice to do some to take the time and have meeting to explain them. I'm using three type of models. So example, I'm using ARIMA. This is how ARIMA model is working, or this is how a linear regression is modeled using, if you are users like coming with a shiny example that you can say, if I'm now changing this parameter, this is how effect the output. This is how calculates seasonality. I think that the best tool to explain people is data visualization. If you combine it with some shiny application or something similar that they can see the interaction of the model with the data. And now when you change some parameter, it's effect the output. I think it's easy to extend and get them on board.
And it presents you with an opportunity to have a discussion with your stakeholders in a way that you can both relate to. And because in my, at least in my work, I'm not an expert in the business domains of my stakeholders, but I'm an expert in data. So we need to have this common ground to have a conversation. And I feel like this is the way in. And not to be overly technical, but presented in a graph and tuning things, it can lead to further insights into the underlying processes in the data, like physical processes.
Scoping data science projects
And congrats on the new role as well. Thank you.
So, you know, I know if you have, if you are carrying with you the project that you managed, you know, you were part of the people that did before those, those projects as a individual contributor, I think it's a, make sure that you are, find someone that will continue. I think, you know, this is where I find myself continue to do what I did before carrying my projects, but also start to be a full-time manager. And then like in, in terms of time management was very challenging. So I think thinking about transition of your past project, assuming that you're still maintain them is important. And thinking about your time, right? There's maybe the, at least the beginning, you're trying to do, to do both your, your previous task and also the new task. And then there is a limit of time, how much you can work crazy hours and a long time you will burn yourself. So that's, I feel my main advice, right, to make sure there is a good transition and focus on the important stuff as a manager, like, you know, make sure that your team has whatever they need to success.
Did you always have those managing skills or people skills or did you have to learn them?
I think I have, I had some of them and, you know, down, you know, you always learning new stuff. So I did before my role, you know, my life as a data science, I did some management roles. So working, working with people and, you know, the ability to manage a team, it's something that I think I had before, but also data science will be different. So I was also learning experience for, for my end. I think what, you know, for example, scoping data science project was something that I learned. So there are some skills that I came with and some skills that I learned through the process and mainly related to scoping.
What does that look like scoping and data science project?
I think the first thing is to understand the business problem and make sure that the business problem is defined correctly. And going back to the, you know, what we, the question from Frankie about, you know, working with the focus, but like you need to have some business analyst that, or someone, you know, the terms, you know, business analyst or data analyst that define what this problem that we are trying to solve without it, you, it's easy to get lost. So, you know, scoping is, is translate the business problem into data science features. And I think there are two in the project that I, for my perspective, there are two challenges in a data science project. One is to define the business problem. The other, the second one are more challenging is to find a data that required to solve the business problem. If, if you are able to solve those two, the modeling part, and, you know, it's the easy part and it's like, like the 70, 30 or 80, 20% time allocation between the data and the modeling. That's I think the reality. So if you want to articulate the business problem correctly, it's easy to get lost. And I think that's kind of like the, the most important part when you're starting a project.
Simplicity and putting models into production
Not necessarily causal inference, but more, you know, the users that they try to understand the mechanism of the model. And this is where I would go and sit with that stakeholder and experiment my methodology. We'll try to do it as simple as possible. I always advocate to use, you know, if you can use something simple, use it. Don't, it's not necessary to be fancy. More fancy is not necessarily better results. So, yeah, I don't know, I don't know if there's try to explain like how regression is working to and how we use it to solve some problem or different model is something that I would do. And again, the best method is like to create our markdown with the data and with, you know, going through the process and use some shiny components.
So, you know, usually like when I'm talking in a meetups on production, which I did, I had two of those in the last two weeks. I would say that, you know, if someone comes to me and say like, I can create a fancy deep learning models or some machine learning models, that's great. But if someone will come to me, I can deploy my linear regression in production. So this is a God. So I think that, you know, in the data science world today, at least it's more important that you can be able to take your work into production than don't do fancy stuff. Like if I can see my projection, my progress as a data science, like I feel like that I had a normal curve, like as any other data science in the room here, that when I was just starting as a really junior, I was into this deep learning machine and I was curious about it. And over time, I realized that if I need to run a model like for one hour, two hours to optimize it and probably get the same results as I would run linear regression, it's probably not worth it just for the sake of saying that I did something fancy. And if I cannot communicate it to others, like, you know, deploy it in a dashboard or deploy it somewhere, it's worthless because, you know, you're not getting the recognition. So today I think being like a full stack data scientist that you can take your work into production and it's super important.
But if someone will come to me, I can deploy my linear regression in production. So this is a God. So I think that, you know, in the data science world today, at least it's more important that you can be able to take your work into production than don't do fancy stuff.
Prioritizing problems and iterating
Just talking to lots of data scientists and organizations, it seems that often, not always, but often the highest value problems tend to be really complex and then sort of the dichotomy of the simpler problems, but they tend to be lower value. So I'm just curious how you navigate that. How do you make those priorities to make sure that you tackle, because you touched on it before, that's what made me think of it, you talked about the high, sort of the big problems. How do you make sure that the organization is actually tackling those?
I would say that, you know, my approach is always to approach stuff with the most simple and approachable solution. And, you know, I always actually looking at the hanging fruit. Like if I see that the business, all the problems that there are, and usually there are more problems that we can actually solve because we're always bounded by resources. I always love to go first with the hanging fruits and start with those ones, because this is where I get my quick wins and then move to the complex. It's better to have something than nothing. It's another common phrase that our team is using. And iteration. So you don't need to solve the problem completely, you know, to spend one year just to solve a problem. If it's a complex stuff with something simple, you know, deliver after one or two and keep iterating, improve it. At least you deliver something, and, you know, the business start to see some value, rather than just do a long research and maybe get something after a year. I think that there is a trade-off, right? As long as you can articulate that it's not perfect and you need more time, but you're going to iterate and it's a process and the business understand, I think that's what I like to do.
Sharing skills and techniques with team members
So, typically they teach me new stuff. I'm fortunate that I am surrounding with very smart people that they are, you know, we have lunch and learns. Actually, at 10 o'clock, I'm jumping to internal R meetup that we have. So, we're trying to do as many as those internal lunch and learn. If someone came and say, oh, I learn, I solve out to, you know, I learn how to create a pipeline with that tool or I learn how to build this with that tool or something like this. So, it's like the immediate response in our Slack is lunch and learn. And so, we're trying to create an environment that we share the knowledge and documentation is very important. If you learn something, we have like a big book down that we, any new tools that we're using, we are just documented, trying at least to document, not always doing a good job. At least I'm probably the worst in documented, but that's the culture that we're trying to create to make sure that we are sharing.
It should be as long as you're open and, you know, like, let's say that if tomorrow I need to manage a team that are Python users, probably I should learn some Python. Because I think it's, you know, as long as a manager, as I mentioned before, you are doing this coding, but like I do review code. I do, I want to, you know, when we have a code review, you still want to make sure that, you know, you understand what other folks are doing. So, I think in a, if you're still in that management level, you should probably be open to learn the other language, if it's a Python or any other language, to be involved and understand what the folks are doing.
So, most of the time I'm using R because, you know, from the data, when you're loading data, you're putting the data from APIs, the easy ways to, you know, put it with and process it with and process it with dplyr and all the other great tools. The case that I will use, for example, I'm using a lot of Bash, thanks to Denton, that is also here in this call. And when you try to automate R code in some command line environment, and this is where, you know, it's easier to do automation, call different R files, or if you're having some YAML file with a manifest of some job, instead of doing multiple steps, it's just, you know, we are calling it Bash script that doing those steps or in a Docker, when I'm building a Docker, I also like to use Bash to automate. So, that's an example. I had some example, like, if you need to pull some S3 objects. In the past, I used to, there is a Python package, library called Pluto3. So, it was kind of like the native library to pull S3 objects. So, I used this. But mostly, I'm using R.
I know we're getting to the top of the hour here, and you may have a lunch and learn to run to. You're hosting it. Okay. So, real quickly, if people have follow up questions, what's the best way to get in contact with you or to network with you? Is it LinkedIn? Probably, yeah. I'm on LinkedIn, Twitter. Yeah. LinkedIn is probably the platform that I'm most active. And I could share your LinkedIn here really quickly, but you are also very easy to find there. If people just search your name. But thank you so much, Rami. I really appreciate it. I know you have to run, but great insights and thank you for jumping on and joining us. Looking forward to next week as well.
