Data Science Hangout | Stephen Bailey, Whatnot | From Academia to Industry

Transcript#

This transcript was generated automatically and may contain errors.

Welcome back to the Data Science Hangout, everyone. If you're joining for the first time, it's great to meet you. I'm Rachel, I'm the host of the Data Science Hangout.

As I mentioned at the meetup yesterday, if you were there, I do want to take a moment to say it's nice to be able to share some space with everyone right now. What we do at RStudio is only made possible because of the community. And we're all beneficiaries of so many amazing community members, many of whom are affected by the war in Ukraine right now. So we also want to use this opportunity to support them back in any way that we can.

And for anybody joining for the first time, the Data Science Hangout is an open space for the whole data science community to connect and chat about data science leadership, questions you're facing and what's really going on in the world of data science. We really want this to be a space where everybody can participate. And we can hear from everyone. So there's three ways to ask questions. You can always jump in live and maybe raise your hand on Zoom might be the best way to do that. You can put questions in the Zoom chat and just put a little star next to it if you want me to read it or else I can just call on you to bring into the conversation too. And then lastly, we do also have a Slido link where you can ask questions anonymously too.

Just like to reiterate, we love to hear from everyone, no matter your level of experience or area of work too. But for today, I'm so happy to be joined by my co-host, Stephen Bailey. Stephen's a data engineer at Whatnot. And Stephen, I'd love to turn it over to you to introduce yourself and maybe share a bit about the work that you do.

And the second win is it creates a channel for communication and collaboration. That is really hard to, to create if you're not using something like GitHub.

For me, I'm a huge, like, for me, I get a lot of reward out of doing things that make, like, that I can see an impact on the way people work or like, like the personal sort of like, of collaborating with people. And so the types of things that I really enjoyed at Immuta were when I could sit down with a business stakeholder and like, say like, hey, like, what's your problem and really understand, like, the sorts of tedious tasks that they're doing that we could automate. I could build, like, oftentimes it's pretty easy for a data person to build something that's going to, you know, automatically pull those numbers and sort them in the right way that's going to help them help a business user do actions better.

At Whatnot, I'm kind of doing similar things where I can build pipes and build systems that make it easier for us to understand our customers or, you know, build workflows that are more efficient. But I do miss, I do miss the domain, like, the healthcare domain, the image processing stuff, thinking about the sort of bigger picture of science and general, generalization. I think it's offset a little bit though, because the data world is so deep and the technology is always changing. There's a lot of, you know, conversations around what's the best way to do this or, like, what's the best way to organize the teams and all of this stuff. So I find that even though I miss some of the healthcare type of learning and thinking very deeply about specific problems, there's more than enough exciting stuff for me to learn and dig into that that keeps me satisfied.

Privacy-enhancing technologies and data compliance

So zero-knowledge proof is basically a concept around cybersecurity where you can validate stuff without actually knowing the actual answers. So I was just wondering if there's any applications around personal data that help making working with easier, because, you know, I'm in Europe and GDPR is quite a big thing over here. And, you know, there's a lot of concerns around privacy.

Yeah, so during, I can't speak too knowledgeably about this specific question. I do have some experience with like implementing privacy, privacy enhancing technologies in organizations. So at Immuta, we, the product would scan a data warehouse for sensitive data and tag it. So like it would say, all right, this looks like you have social security numbers, it looks like you have addresses, it looks like you have names and things like that. And then there were a couple of approaches that we could automatically implement if you had a policy that wanted to mitigate privacy concerns. The ones that we had were masking technology, masking methods, like hashing, redaction, replacement with string, rounding, etc. We had one that was called k-anonymization and we had one that was called differential privacy.

What I'll say is that it's for a lot of organizations, just implementing like the basics is very, very, the basics at scale is very, very challenging because you have to have a lot of metadata, essentially, a lot of high-quality metadata to do privacy management very, very well. So you have to know, you have to have a language around what is sensitive data, you have to have metadata on the data itself that says this column is sensitive. You have to have policies in place in your organization. So you have to know, like actually, you actually have to have someone who's translating that into an actionable policy of like, we need to mask this type of data for these types of users. You have to have high-quality user data to know like who is accessible.

So it's extremely challenging to get all those things right. I think what we found, like what we spent a lot of our time doing at Immuta was trying to help organizations build a language that would allow them to address those sorts of questions at scale. So all of that to say, I think what we found oftentimes was that people would come in and they'd say like, we want to do differential privacy, which is essentially like randomizing, injecting randomization into your data set to provide a level of privacy guarantees. But then they'd always end up going down to like, let's just mask the data, like let's just mask the data and like get it, like get started using it rather than like trying to do the most private thing first off.

Transitioning from data scientist to data engineer

I actually wanted to ask you about your transition from being a data scientist to data engineer. So how is the transition itself? And what would be some of the things that you think that really helped you in your current job from your data science experience? And what are the weaknesses or things that you had to learn immediately?

So for me, I love the systems building part of data engineering. So I love being able to think big picture about what are the patterns that we're implementing in these systems and how do we make them very high quality and efficient, like implement checks. And that was true even when I was doing my PhD, I often found myself gravitating towards the methods and rerunning pipelines and trying to make pipelines more efficient, even when that probably wasn't the most, the best thing for me to be doing.

The challenge that becoming more of a data engineering role, like moving more into the data engineering role has presented is you have to know a lot about like the technologies and you have to learn a lot of software engineering patterns that I was never taught. And so, you know, things like domain driven design and like testing, like how do you design a Python package? How do you implement good testing? How do you implement observability and like log tracing across multiple systems, networking concerns. There's a lot of technology, like I would say the technology side of things is much more important in the data engineering world than it is on the data science side.

I think one thing that I bring to a data engineering role that I wouldn't have if I hadn't spent time as a data scientist is understanding use cases for data. A good data engineer, I think has a lot of leverage over helping the organization, like get data, not just to places on the right time, but in the right way with the right metadata that makes it useful, like do some pre-processing of data to make it really easy for data scientists to work on it. And so I think I bring a lot of that like context, like contextual information about how data is going to be used, which makes me better at building the systems for the users.

Being the first data scientist vs. joining an existing team

That's a great question. I think there's pros and cons, and you do have to ask yourself, like, I think a lot of self-knowledge comes into play here. What's great about being the first data scientist at a company is you get to write the, you get to build the system, you get to understand, you get to chart the territory. You're, like, kind of, like, you know, you're the pioneer charting the uncharted territory. So that's very, that can be very exciting for people. It can also be very overwhelming and lonely, because you don't know if your decisions are going to be right. You don't know if, like, you're going to have to rebuild the system in the future. You might not even know what you're doing, really.

When I left Immuta, I knew for sure that I wanted to join an existing data team, because I wanted to learn from others. I think that's the, that's one of the differences, is if you're doing it, if you're charting your own course, there are great communities out there, there's great learning materials out there, but you're still going to be ultimately alone to some extent. I knew I wanted to join an existing team, because I wanted to learn from other people. I wanted to see what other people were doing, how they were doing it, what kind of dashboards were they building, what kind of conversations were they having, what, like, opportunities they see. And it's been very, very rewarding to be in that sort of position for this at Whatnot. So it's just, like anything, it's just got trade-offs. I would say, if you want to be the first data hire at a startup, go for it, but just make sure you know kind of what you're getting into, and are prepared for that.

Working with real-time data

You mentioned earlier that you went through, when you started at Whatnot, working with real-time data in comparison to batch data in the past, and I'm curious what unique and specific challenges that you ran into when navigating that switch.

Yeah, so, I mean, this is very much an active area of work for our company, but it's a, what I would say is that the technical side of things matters a lot more. With batch data, you can kind of, you know, just shove stuff into an S3 bucket, or like, you know, not think too much about schedules, or efficiency of processing, and things like that. If you're trying to deliver real-time analytics to an application, like a mobile application, there's almost no room for error. Like, you have to be thinking, because you're trying to do it, whatever your SLA is, it might be five seconds, right? So you think, like, after you click a button, you know, that event has to go somewhere, and then it has to have logic applied to it. Maybe it has to be joined with other data, or a model has to run on it, and then it has to be sent back to your mobile device. And if anything takes, if it takes 20 seconds, like, as a mobile user, that's an eternity.

So the stakes are just so much higher, because you are trying to do something much more, you know, that affects the user much, much more closely, and it's much more interactive. So that's one of the biggest shifts, I think, just from a sort of emotional, like, almost like an emotional standpoint of, like, people are going to be using this, this is going to affect the user experience. If this thing doesn't work, then, like, maybe a show doesn't show up on someone's feed. And that means that seller doesn't get featured as much, which changes, like, the seller's perception of their experience on the app.

R vs. Python for data engineering

I see Ian had asked earlier, what's your opinion of R as a data engineering tool versus other languages like Python or C, for example?

I'm not a, I'm not an expert in R. I used it during my PhD, a number of, like, for a number of use cases. I love R Shiny . R is, like, so much more elegant for a lot of data science work than anything in Python. I think that's my, after using, after, but I made a deliberate switch to learn Python for two reasons. One was because a lot of my work was using image processing libraries that required me to do script, like, use things like OpenCV or SciPy. There were just more libraries out there for image processing work.

As I've, and I've, my sense is that there are more out of the box solutions for Python. Like, it's much more of a lingua franca in the data engineering world than R is. Just for, like, thinking about, like, what AWS provides, like, AWS provides a CLI for, or not a CLI, but a library for Python users to interact with resources in AWS. Maybe there's something like a Boto3 for R, I'm not sure. But I would say that outside of, if, like, to the extent that you have to do things outside of a, like, data management and processing flow, like, like, pull, like, ingest data, move data from a source system into a database, but not, like, doing the processing on the fly, my guess is that Python has a little bit more, more libraries out there.

The data engineer's role and organizational thinking

Yeah, so one of the, I think one of the fundamental things that most data engineers are responsible for at some point is, like, the ingestion pipelines that, you know, either ETL or ELT is kind of a new pattern of just, like, extract and loads or replicate the data into your warehouse and then do the transformation in the warehouse or in the data lake. So you spend a lot of time, like, building, setting that up. But there are some good tools out there now, I'd say in the last five years, tools like Fivetran, Stitch, Meltano, Airbyte is another one now, that have essentially made it a commodity, like data replication a commodity.

But one of the areas that I think is newer, I think something that is emerging in the data engineering world is the data engineer as not just a technical person, but as a systems builder and a systems thinker. Some larger organizations have data architects, but this sort of idea that we have, like, across the whole company, you have information flowing around, and we need to be able to have a, we want to avoid a situation where, as the company grows, you start having all of these data silos, and everyone's, like, getting their own sources of truth and, like, creating their own metrics, and none of the metrics agree. The data engineer is really well-positioned to think about, like, how does data move throughout the organization? What sort of conventions do we want to put around publishing data products? What sort of guarantees do we want to make? What kind of language do we want to use? What compliance patterns do we need to implement to make sure that our data is being used correctly?

Towards the end: moving from academia to industry

Thanks, Steven. I see Tatsu had a question, as well. He said, as a fellow recovering academic, if you want to jump in, Tatsu.

Yeah, sure thing. Hey, Steven. Great to see another post-PhD doing well in industry. So, actually, it's pretty funny, because my background is very similar to yours. I think you said you were imaging while you had kids reading, or something like that. Mine is basically sub-EEG for the imaging medium, and then we had kids exercising. So, a lot of similarities there. I ended up landing in the customer success space. I work at RStudio. And, of course, I have a background in using RStudio, and here I am, right?

But, what I found interesting, so far, from, you know, here and there, as you're answering questions, right, is I'm very curious on your perspective on, you know, that transition and, like, how easy or hard it is, right? For me, I think it was very difficult. You know, I come from psychology, which traditionally, kind of, all they teach you to do is become an academic, right? There isn't a whole lot of help for you as the student who's, you know, considering a career and, you know, maybe a data-related role, right, which lends really well to the skill set that you're taught. But, there isn't any course that teaches you how to do that transition. There isn't a whole lot of networking events that lends well to that.

And, at the same time, right, like, I think that, from the industry side, I think people are starting to recognize that, oh, wow, like, people that come through the PhD pipeline, they have a lot of what we need. But, the issue then becomes, right, from our perspective, we don't know how to phrase those skill sets, right, to be able to apply it in an interview setting or anything like that. So, I just like to hear your, kind of, thoughts on that. Is it, kind of, a failure of the higher education system, or is it something more that, you know, industry folks could be doing better to identify?

That's a, that's such a great question. I think my brain is scrambled on it a little bit. How do I say, like, it's such a, it's almost like a relationship. When you go with a PhD program, you do feel like it's, it's, it's both, it's not, it's a job, but it's, like, more than a job, too. There's, like, it's almost like you have a relationship with the field, and you're very invested in the research, and you're, like, it's, you're being, you're part of the scientific community. And I think one of the things that's very jarring about leaving a field is you lose that, it's all, you feel like you're getting divorced, or something like that, like, you're, like, you're severing that relationship, and you're severing that community, because it is, it is almost two different worlds.

For me, I identified pretty early on that I wanted to move out, and that helped me, because I was able to spend a lot of time talking with people who weren't in the university, and building relationships, so that when I did transition, when I graduated, I already knew, like, I knew people who were on the other side, so to speak. And so I didn't, even though I felt lonely, and I was the first data science hire, and, like, no one knew what a data scientist did, I also felt like I had a pretty good understanding of business, of, like, this is what a business, this is what doing data science and business, doing data in a business looks like.

So I think that if anyone's thinking about transitioning, or doing something similar, I think that's a very, like, talking to people, and building relationships before that transition can really help make it softer.

Yeah, I had an interview, you know, one of the things that kind of surprised me is I was this biomedical imaging PhD at Vanderbilt in Nashville, and Nashville is, like, a huge healthcare hub, and I couldn't find a job. I, like, it looked for quite a while, and I couldn't find a job, and I met skepticism. I met, I had one interview where they're kind of like, you sound great, but why would you want to work here?

And then I had another job that was like, well, we don't really do any imaging stuff, and, you know, I was kind of like, I'm like, I don't, I, it's about the learning, it's about the impact, it's about, like, it's not about the imaging, you know, it's just one example. This is just one subset of problems that I'm interested in, and so you do, you do have to go through, like, and intentionally think about branding yourself. And I think the biggest thing in industry is you can't get, you can't get obsessed with the problem. You can't fall, you can't fall in love with a problem. It's much, it makes your job of finding a job and, like, fitting in much harder. You have to fall in love with the, your position in some way, like, like, your, the way, like, how you're solving problems, more of, like, applying the mindset of a scientist into this business context.

I think the biggest thing in industry is you can't get, you can't get obsessed with the problem. You can't fall, you can't fall in love with a problem. You have to fall in love with the, your position in some way, like, like, your, the way, like, how you're solving problems, more of, like, applying the mindset of a scientist into this business context.

Yeah, I haven't spoken to a PhD where, you know, things like this don't resonate, right, and I've certainly struggled through that myself. It does, you, everyone carves their own path, I think, as Tatsu was saying, you just kind of, you learn to learn, you kind of figure it out, and I think that the skill sets that everyone's learning in graduate school, no matter what the subject area is, are very helpful in being able to help address that. But, you know, my biggest thing, my biggest advice would be don't be afraid to reach out to people. Most people are very willing to, to talk to you about what their journey was, no matter where they're at in, in their career, whether they have a PhD, whether they don't have a PhD. And once you kind of see that and you start to talk to people, you realize that your, your, your room is much bigger, right? There's, there's a lot of people from all walks of life that have moved multiple careers or multiple industries.