Resources

Data Science Hangout | Sep Dadsetan at ConcertAI | Infrastructure that Encourages Reproducibility

video
Sep 15, 2021
1:08:58

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to the Data Science Hangout. Thanks, Sep, for jumping on and co-hosting with me today. So welcome back to all the familiar faces and to those joining for the first time. It's really fun to see this like recurring group starting to form. But for anyone new today, these discussions cover a variety of different topics on data science leadership in the enterprise, but really focus on questions that are most important to you all. So again, no agenda and everyone is welcome to join in live or put any questions that you have in the chat.

That said, I'd like to just jump in here and introduce Sep Dadsetan, Executive Director at RWE. And I am checking with you that is real world evidence analytics at ConcertAI. But Sep's passion about learning and exploring new concepts and ideas and, and bringing them to life. I found that on your website somewhere. And I will say, Sep, I did learn a few random things about you making that into like DJing and footgolf. But would you be able to kick things off by introducing yourself and sharing a bit about the work you do?

Yeah, happy to. I'm glad to be here. I think I've already been in part of several weeks of discussion and it's highly valuable. So I definitely encourage you guys to share your perspectives. I think this is a great opportunity to just learn from everyone. So yeah, so my name is Sep Dadsetan. My formal training education is in actually like molecular physiology. So I studied immune cells and T-cells. And then made a, did a brief postdoc at Genentech studying B-cells, another immune cell. And then that's when I said I was going to leave academics and join a startup company, which was CyApps, where we worked with real world data and real world evidence there. So that was most of my tenure, seven and a half years there. Brief stint at Kite for a couple of years doing IT for their R&D group. And then recently joined ConcertAI back in the real world data space about six months ago now.

So that's kind of the brief tour there. With respect to what I do, at Concert, I'm mainly kind of focused on achieving kind of scalability, reproducibility, doing some training, and trying to, at this point, really kind of focusing on architecture and trying to kind of gather our teams and smooth some edges, so to say.

What's exciting in data science right now

I think broadly speaking, I'm just excited about the level of adoption businesses are having around data in general and how it can impact their business. So I remember probably early 2010s, data science is this kind of term that's popping up. People are kind of, what is it? What is it? It's probably still today we ask what that is, right? But a lot of businesses really weren't really sure what that meant and how it could benefit them. But I think now, like a decade later, we're seeing a lot of companies IPO-ing that are in the data space. We're seeing increased adoption of cloud computing. We're seeing a lot of businesses really shift towards data, which really kind of, I think, solidifies that data is here to stay. All data science, data analytics, and data-adjacent roles are here to stay and that there's a lot of value to be provided.

And we're still kind of obviously shaking a lot of those things out to figure out what that means on a broad level. Yeah, I think what's interesting probably on the forefront is also synthetic data. I haven't played a whole lot in that space, but that's been pretty interesting to see how, yeah, we generate a lot of data in general, but not a lot of that data is necessarily usable. Or if it is usable, it may be sensitive, in which case we have different strategies to allow technologies like AI and ML to kind of proceed in a more safe manner. So that's kind of exciting.

Gaps in the data science space

I think one of the interesting aspects or I guess gaps might be how people come into data science and the varied, you know, directions that they come, which would kind of, you know, when we talk about training and like trying to get everybody on the same page, it's a little bit more difficult than if you were to kind of kind of, let's say, have a standard path to become a data scientist, so to say. So I think that's one of the challenges currently in the field. That's at that low level that I was kind of referring to. At the higher level, I think we're in a bit of a transition period as businesses are starting to kind of realize the value that that data might provide them. Not all, not all, let's say leadership or a lot of businesses are data savvy. And I think, again, that goes back to this training aspect where, you know, you need to translate your work for a business leader or someone who might not kind of understand it. You know, that still exists and it's kind of will always shape how, let's say, the data science groups get resources and funding. And it will also shape, you know, how businesses will kind of move forward and plot their course.

How to show key stakeholders the value of data science

Yeah, I mean, that's a great question. And I think that that generally applies to not only data science, many other teams as well. I mean, even engineering teams, sometimes you have one person that's like the front end developer, right?

You know, I think that what I've found success with, you know, and it's hit or miss sometimes, right? And it's not everything is a single solution or single path to a solution. What I found success is just trying to be as close to the business as possible and understanding what their needs are. Sometimes you find that they just move too quickly to even know what their needs are. And so you have to kind of help guide them. And in doing that, you build this kind of partnership where you then have more value and understanding of how you can kind of guide, you know, orient your team to provide that value. And then they get closer to understanding it.

But when you're building the infrastructure in a way that makes that delivery, let's say reproducible or makes that infrastructure reproducible. And then, and they see the speed at which you can turn things around in the future. Because inevitably, we all probably experienced this. Someone's going to ask a question or want a delivery. And then they want some variant of that delivery three months from now, right? And so you go build, you build that first iteration, maybe it takes you, you know, a week longer to do it. But if you do it right, the next time they come around, it turns around a day, they they fall out of their chairs, right? And I think that that's where all of a sudden, you're you're showing that, look, this is this is the value of doing things in a manner that's that's reproducible and scalable, because we're saving costs, where we are turning things around that impresses, you know, the customer.

But if you do it right, the next time they come around, it turns around a day, they they fall out of their chairs, right? And I think that that's where all of a sudden, you're you're showing that, look, this is this is the value of doing things in a manner that's that's reproducible and scalable, because we're saving costs, where we are turning things around that impresses, you know, the customer.

And so kind of scale and all those things are kind of an interest for me, but I don't know, now I'm just probably blabbering, I'd love to hear other people's take.

I think, you know, you probably the adage that's like 80% of a data scientist's time is spent cleaning, cleaning data and organizing data might be appropriate. I don't know, I've never really kind of thought about it. To be honest, I just kind of like get it done and try to do it in a manner that I don't have to do it. Like repetitively try to build on knowledge.

Yeah, I mean, I think that goes to, I guess, to go back to Rachel's question about what gets me excited, right? So what gets me excited is that great, all these businesses are very interested in data, we're here to stay, right? But there's a flip side to that is that data literacy also is required. And so there are various levels. So some people are happy to understand the nuance and the nitty gritty of why it takes so long to kind of stitch these things together. And then there are others that are saying, look, we need to deliver to the customer, this is what we want. And I get that it takes time, but hurry up, right?

And so, yeah, I mean, I always try to kind of simplify and really kind of point out, give status updates. I mean, transparency for me is really important. So if you're transparent with your stakeholders and they recognize what the blockages are, and I think everybody that you're going to get engaged with is going to want to know something that you might be able to help out. I think that again, helps with establishing that relationship and having them get a little bit closer to the data and understand some of the hurdles that you might face. But yeah, sometimes patience is perhaps thin and they don't quite, quite get it. But that comes with this other side. If you're going to be more data company, we're going to have to have more data literacy and what are better methods of perhaps working with data and how it can enhance us rather than, let's say, throw people at a problem, do it intelligently from an engineering standpoint, so that you can scale your operations.

Setting up the infrastructure to encourage reproducibility

Yeah, great question. So yes, I mean, reproducibility and scalability all, I mean, it relies on the success of training, making sure everybody's doing things and the processes are all the same. And it also is related to this partnership and relationship.

So part of reproducibility for me is, you know, everything, I mean, it's like, it really involves everything. So, you know, making sure everyone's kind of using, you know, let's say version control the same way, that naming conventions are the same. We built a, basically a package, an R package so that when people create new, because we use RStudio. So we create a new package. So when you create a new RStudio project, that project will always be the same. So it creates the same folders, the same setup. So when I'm talking infrastructure to make it reproducible, everything from how you use code, how you name your code, how you name your variables, how your projects are set up, how your files are named, all of that stuff. They're really, really annoying little things, but they bubble up to be so much more because when you create a project, for example, an analytical project, everybody that it becomes a transposable unit. That means I can take my project. I go on vacation for a year, someone emergency request, someone else can pick it up and know exactly where the data is, know exactly where the scripts are, you know, et cetera. And they could, they should be able to pick it up with a lot less, you know, overhead for having to figure out, oh, well, how did Sep organize this stuff?

I think a big part of it actually, to be honest, is a lot setting up a lot of that real basic stuff. And then it also makes it much smoother so that every time you set up a project, you don't have to like, think about, think about those things. You just kind of go, this is where you write your stuff and whatever. And then we happen to use RStudio Connect as well. I know that there are other solutions, but that allows us a feedback loop. I think one of the aspects that, you know, that we ended up talking a lot about in these weeks are like communication and how you communicate with business stakeholders. And so, you know, if we're going to set up that, you know, RStudio project, we want to then be able to publish it, whatever that output is, whether it's an API, whether it's a product, a report, whatever, an application, we want to be able to publish it and publish it in a central location. So people aren't trying to search for it everywhere. And then we have a URL, we could share that URL and we get this feedback loop. It's not sitting in somebody's email and part of a chain. You have to then go through your email. And so we've kind of streamlined it. And that allows us to kind of, you know, basically hit a conveyor belt. If there's iterations, changes, we can kind of do that. And then it's modular so other people can kind of pick it up. So that's the kind of, I guess, reproducibility and scalability that I'm referring to.

The importance of architecture and data quality

I mean, generally you have to partner pretty closely with your data engineering team. It kind of depends. Sometimes data engineers themselves have a little bit of an analytical bent to them. So they're always kind of constantly checking. Other times there's a dedicated QC team that's constantly evaluating and kind of sampling as data moves from place to place, just to make sure that you are collecting the right information and that all the transformations are going accordingly. Other times, and again, depends on the size of the company and how roles are broken up. The data science team should have someone that is actually kind of evaluating or at least generating some sort of report out of wherever the staging areas of your existing data is sitting to kind of hopefully prevent that where you do all this work and eventually you get it and you're like, oh boy, this is not what we wanted at all.

Yeah, I don't know. I think, you know, Sep spoke pretty well to it. I just find that communicating upfront and I think design kind of does this intrinsically is when you spend a lot of time on designing kind of what the output solution is going to be. It requires you to bring all the stakeholders to the table upfront, you know, the owners of the data, the owners of the platform, the people who are going to be using your solution, you know, on the ground day in and day out, the technical folks in your analytics team, and then also the IT team that might be enabling the solution to even work, security, infrastructure, stuff like that. I think when you focus, when at least you attack the project with a design mindset up front and really focus on nailing down, you know, what the end-to-end process is going to look like, I think it pays off quite a bit in the end in ensuring that, you know, whatever you put together is well architected. I find myself nowadays probably drawing as many, spending as much time drawing architecture diagrams as I am writing our code, which, you know, is fun if you're into that, but I think it's an important step to take in any project, and I think it's a good team mindset to have as well.

Balancing formality and speed on teams

Yeah, that's a great question, and it is a common situation for better for worse. Again, it depends on the culture of the team and what the business is really kind of after, I think, and it depends on, I don't know, what relationships are like. I like to try and lead by example, so at the very least, if I'm finding that it's difficult to kind of, you know, hey, like look, this is a large team, or the team's really, everyone's kind of just doing their thing and in their own lanes, at the very least, I try to simplify it for myself and try to make my work more reproducible. And in many cases, what ends up happening is whether there's, you know, code review or whether there is, you know, future fire drills or whatever, that process will always win out, right. And when someone else sees it, they're like, oh, maybe that's cool, maybe I want to try that, how did you do that, right.

And so, yeah, sometimes it's not so easy just to kind of, you know, go dictate to people and say, look, like this is how you should do it, until they kind of come to the realization themselves. And so that was, you know, that's one method that might work, where it's just, at the very least, you could do it for yourself, it does take time, but even small wins are important, right. So I've been in situations where, I mean, especially in a startup, where it's like, okay, we need something in two weeks that we don't have anything, like nothing exists right now, but we need to deliver in two weeks, right. So it just ends up being, okay, just dive onto the keyboard, but even in those cases, wherever I can have kind of an engineering mindset and modularize what I can, I know that those pieces can be used either on other projects, or this project becomes much easier with future revisions.

Yeah, so just to add a bit there, so one of the selling points for us when we talk about data science, and about using tools like R, is about efficiency, right? The ability to produce reproducible code, or results, and efficiency. So even though we may think documentation and structure is boring, and it is, we know if it's not in place what the results are, right? So essentially, something that will take you two minutes, may take you an hour, right? And in terms of even communication, communication with members on your team, or even outside, when there is no commonality, you're talking about, in Jamaica, we call it Chinese telephone, where you're trying to get to someone, you can't get to them because the language is different. So, you know, what is so interesting about this conversation now, is that we're talking about infrastructure, we're talking about communication, and none of this is really about data analysis itself. But how important all the tenets are, all the connecting points around the data analysis itself is extremely important.

What is so interesting about this conversation now, is that we're talking about infrastructure, we're talking about communication, and none of this is really about data analysis itself. But how important all the tenets are, all the connecting points around the data analysis itself is extremely important.

Yeah, yeah. I mean, it's, again, it's like showing the art of the possible. And I think when they realize, you know, you do things in a smart way, then you can gain speed. They're more than happy to then support that, right? So, if a new request comes that's outside of the scope of the original request, they then have an understanding of what it took to make that original one, and then they'll want it again. Because it's, again, it's like building knowledge, and it's a building scale, right? You're not going to get everything in one go, but if you do things, you know, majority the right way the first time, over time, you're going to have a bolus of knowledge that you're enabling the remainder of the business, whether that's other groups, whether that's marketing, whatever, to be able to do what they do.

Organizational habits for leading teams

I think there's, you know, one of the questions that I think that was discussed perhaps in the past was, like, people that are interested into getting into leadership, right? There's naturally going to be a transition point where your hands aren't on the keyboard as often, you know, where you're not doing the analysis per se, but now you have to manage a team and grow that team. And with that come different responsibilities. And so, some of those responsibilities are just evangelizing your team and being able to communicate and work with other business partners. And so, I think, you know, I've been now in, I think, two situations where I have either started a team from scratch or the team is basically pretty new. And I think the most important thing is, obviously, you get your bearings first, but then basically going on a tour, right? Speaking to all the business partners, different stakeholders around the company and trying to understand what it is they do, how they engage with data, how they envision working with that team, and just maintaining those relationships, like constantly having conversations and not being afraid to, like, help out when you can, right? Sometimes some of these groups might be resource strapped. And so, jumping in and trying to solve little problems here and there may be helpful. And I think that builds the rapport. It kind of accelerates a little bit about this partnership building and, you know, really allows you as a group to have a better understanding of the business so that you can align your priorities in accordance with that and the goals of the business, but also allows them to understand how you operate and what benefit you can bring to the table.

Yes, and I think this depends, again, on how the business is led. So, if the business has very clear goals and guidance, then your group should always really kind of be aligned with that. Other business units may or may not have, you know, alignment in that regard. And so, you can't just, I mean, some people are nice and they just want to field whatever request. Okay, fine, you could do that. You're still going to, you know, get some wins, but it'll be also a bit more distracting towards you achieving what you need to do as a team to become most efficient, because now you have, perhaps, extraneous projects. Whereas if the businesses, you know, some businesses might not be so clear on their objectives and goals, maybe it's new, maybe it's fluid. And so, that requires partnership with those other business units. There's a balancing act. You don't want to field all requests. It doesn't hurt to have an understanding of all the requests, so that you have an idea and they can maybe say, hey, you know what, like there's a piece of this that we might be able to work on, or a piece of this, or maybe we can just tackle this for you.

Transitioning from academia to industry

Yeah, that's a great question, and I think for data scientists, so true, because the data science skills are applicable in so many different contexts, but then you have to get up to speed on that context. I guess maybe I'll speak from my personal experience. I mean, one is just like that confidence and that you can learn it. Like, just, you know, you know, you can learn a lot like, just, you know, there's always going to be that learning, and it sounds like you already have that, and like encouraging that in others, just that sense that, like, no, you're not going to know everything, but what I have confidence in is my ability to learn less.

In terms of, yeah, particular tips is one situation I've seen, because I also worked in high-performance computing, is maybe that the computational person comes in sort of feeling like they have the answers and not necessarily listening or asking the right questions. So coming into it, you know, you sound like you're already doing all of this, so, you know, just like with that perspective of listening, but also thinking of the right questions to ask, so spending some time not only, like, listening, but then formulating questions for people to answer. And sometimes people are better about answering them, like, in a conversation, and some people like to have time with the questions and write them, so sometimes I'll provide different mechanisms for people to give me information or feedback, because people work in different ways, so I've found that to be helpful.

I mean, I have a lot of interests, so I'm, like, between a lot of different kind of areas. I think for me, and I've also hired people that are from outside of healthcare, outside of pharma, exactly what Tracy was saying, I think being humble, but also tenacious, where, you know, if it's an area that you want to kind of go explore and apply, you know, again, your data science skills are kind of applicable to anything. It's that domain knowledge that generally people need to show in a particular interview, and so, yeah, if you're tenacious enough to be like, yeah, I could do it and you have that confidence to do it, but also recognize that you don't know everything and there may be nuances and be receptive to listening and getting that feedback. I think those are attributes that are fantastic to have because I love taking people that are outside of that area just to bring a new perspective into it, but they have to also have the right personality to be able to do so in a manner that's not like, oh, like, you know, dismiss or disregard, you know, whatever anybody else is saying because they know it all.

Data science infrastructure and shared responsibility

Sep, you had talked about a data engineering group versus a data science group, and once people start to have this perspective where they say, hey, this is my job, this is what I do, and they allow garbage data to flow through their pipelines because they were responsible for the pipeline, not what's flowing through it, I feel like that's a really bad environment and mindset. I see that responsibility for having the right infrastructure, infrastructure being like the data because that's the information. Everyone that touches it, somewhere along the line, should to feel that sense of commitment to make it better.

Standing ovation, Frank, standing ovation. That resonates with me 100%. I totally agree, like if you're going to do work, do quality work, but I don't know, at least for me, the reality is that many of the people that I've engaged with are on both sides of the coin. Some people are like, hey, it's just a job, I'm going to do what I'm told, and it's sometimes, especially with data, it's like it takes that little bit extra to be like, am I, just let me just double check, am I really getting what I'm supposed to be getting? And then, and if you don't do that, then it's, again, it's on the next Target employee or whatever to go pick up that trash, right?

Yeah, great point. Thanks Hugh. I just put that link in the chat as well. I just found a recording there.

When to start implementing data science tools

I mean, I have a very specific experience with this. I have a belief that you have to have your data in order before you get your data scientists and analysts. A lot of times it ends up being reversed and then you have a data scientist who may have pretty reasonable experience with engineering and getting the data in shape, but that's not necessarily their area of expertise. And so as long as that's understood that that's going to be technical debt and it can be addressed later, okay, then that may be fine, but generally you're going to want to get your data in order and have a very clear strategy of how you're ingesting that data, how it's going through its various stages until it gets to its resting state to have then products or whatever built on top of it.

So I would say that would be, that should probably come first. And in many cases, it would be, if possible, it'd be great that it's done with someone that has analytics data science expertise as well, because sometimes the data engineer doesn't have that other side of the coin. And so the two together would be the greatest kind of start where you have a data engineer and you have a data scientist analysts, and then however they prefer to choose. I mean, I'm a little bit R biased, but you could also do this in Python and then you kind of build your tool set and architecture around that, right? So once you have your pipeline, you can then say, okay, well, how are we going to do the analysis? How are we going to make this scalable? How do we deliver our products? And you now have kind of at least some kind of foundation for a small team, literally two people, very talented people, but two people at least.

Yeah, thanks. I, I was actually thinking a bit, jumping to a different analogy about cooking, right? You know, everyone can cook at home with the equipment they have to hand. You don't need the very best, but you know, if you want Michelin star cooking, then you've got to have the right people with the right tools, the right skills to bring it all together. And I, I think it's easy, just what Seth was saying, it's not easy, but it's easier when there's a very small group, because you can have that one-to-one dialogue that says, okay, I know what you're doing. You know what I'm doing. We know the tools that we have. Let's come together and get this done. When you go to a really big organization, and this comes back to something that Frank was saying, there's a tendency for people to grab the tools they know, hack it, whatever they need to do to get today's work done and get it out the door. And the problem there is that in doing that process, they leave a damn mess everywhere. You know, they leave the pots and pans lying about and everything's, you know, messy and they don't clean up afterwards.

And my, you know, I work in a big organization and, you know, there's an awful lot of people who are doing their own way, you know, and I kind of like the tidyverse thing of it's opinionated, but it's opinionated for a reason because you can then, you know, reproduce that pipeline again and again and again using the same kind of tools. You know, I have lots of colleagues that learned R 10 years ago and they're all base R and they're like, yeah, but I can do that in base R. It's like, that's great. I'm really happy for you that you can do that in base R. This isn't a competition. This isn't code golf. You know, I'd like us to be able to do it in a way that I can pass my stuff to you, you can pass your stuff to me, and, you know, we know what we're going to get and we can then be part of that, you know, chain of getting stuff out the door, but getting it out the door, looking good, smart, fulfilling the brief, you know, what everyone wants to see on their plate. I'm sorry, I'm all about analogies.

I don't think I see any other questions in the chat, but if I've missed any, please feel free to speak up too. Yeah, I have to drop unfortunately, but I appreciate you guys giving me the time to share a little bit of my perspective. I appreciate everybody else who also shared their perspective and asked great questions. Thank you to Rachel and our RStudio team to putting this together. I look forward to the future ones as well. Awesome. And Sep, quick question. What's the best way for people to get in touch with you if they have other follow-up questions? Google me, I guess. Yeah, LinkedIn is perfectly fine, but I'm pretty much available on like whatever platform, but yeah, LinkedIn is probably easiest. Okay, awesome. Well, thank you so much, Sep, for joining and sharing your insights. I know you have to run, but thank you all for joining as well and same place, same time next week too, if you happen to be free. We'll also put the recordings up on YouTube too, but thank you all. Have a great rest of the day.