Resources

Supporting 100 Data Scientists with a Small Team | Mike Thomson | Data Science Hangout

video
Oct 8, 2025
55:32

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hey there, welcome to the Posit Data Science Hangout. I'm Libby Herron, and this is a recording of our weekly community call that happens every Thursday at 12pm US Eastern Time. If you are not joining us live, you miss out on the amazing chat that's going on. So find the link in the description where you can add our call to your calendar and come hang out with the most supportive, friendly, and funny data community you'll ever experience.

I am so excited to introduce our featured leader today, Mike Thomson, Data Science Manager at Flatiron Health. Mike, hi, how are you doing? Hey everyone, how are you doing? I'm doing well. This is my first time showing up in this big of a data science hangout. It's been a little bit since I was on the last one, but it's really good to be here and see you all.

It is. Yes, we've grown. We keep creeping further and further over that 100 mark, 150 mark every time we meet. Well, Mike, it would be great if you could tell us a little bit about what you do and something that you like to do for fun. Yeah, definitely. I am a Data Science Manager at Flatiron Health. We are in the real-world evidence space, and I support a team of roughly 100 data scientists across research science and data analytics who use a lot of our tooling internally in the Posit team platform. I just got a golden retriever puppy who is four months old, and that is my part-time job at this point of training and just really enjoying having a puppy. It's my childhood dream come true.

About Flatiron Health

I wanted to ask a little bit more about Flatiron itself. I know that Flatiron is a subsidiary of Roche, but you operate independently. Because we're coming from all these different industries, who doesn't understand or know the healthcare space, can you talk a little bit more and explain it to me like I'm five details about what Flatiron does and what your teams do? Yeah, exactly. Flatiron has been around for over 10 years. The initial idea is it has an oncology-specific medical record software, very similar to Epic or Cerner, but it's targeted towards community oncology practices that are treating patients all across America and increasingly global in their community oncology setting. We have real-time data being collected of a really diverse patient population. Correspondingly, we're able to do a lot of research after de-identifying those patient records to be able to understand what treatment patterns are we seeing real-time, real-world, separate from, for example, the big academic medical practices that may have different patient populations and different treatment norms accordingly. We're really trying to learn from the experience of every patient with cancer across the U.S., across the world, and glean a lot of different scientific insights from those really rich data sources and really rich patient experiences.

Could you give us a few examples, speaking of the data that you're working with, the data types that you're working with, or maybe the types of data sets or collections that you might be working with? Yeah, we have mostly, because it's the EHR, electronic healthcare record, it's mostly structured patient records. You could have things like visits information, this patient was seen on this date and received this treatment. We could also have different clinical notes that we can extract and say, do they have this specific subtype of cancer, and correspondingly be able to analyze based on that. We're actually very fortunate as scientists and analysts to have pre-cleaned data sets. I think one of the unique part of Flatiron's model is how can we transform and aggregate all of those source-level information from the EHR into analytic-ready data sets that allow scientists to take those and generate insights a lot faster.

I wanted to ask as well, do you have longitudinal data? Do you follow patients across time ever for long-term studies, or is this just observational data? It is observational, but it's relatively longitudinal. Of course, you're subject to what patients are receiving. Obviously, a patient with cancer, we don't want them to have really long treatments if they're progressing well. It's not like a clinical trial registry where we're proactively collecting data, except for some parts of the business model. It's typically observational, and the idea is how can you complement the data that may be collected in a clinical trial with observational data in the real world.

Using Quarto for reproducible outputs

I want to ask Mike about Quarto, and we have some team members for Mike's team to help chime in and talk about that, because I am really, really invested in Quarto having better outputs into Word and Excel. When I work in Word and Excel, I tend to actually go back and use our markdown, so I can use the officer package or the office verse to do stuff like that. I would love to talk about that. If anybody out there in the Hangout world also still finds themselves really, really frequently needing to output things to Word and Excel, even though the rest of the developer world feels like they've moved on from that, I really would love to talk about that, so throw your support behind that in Slido.

How does your team use Quarto? That would be a good way to get started. Yeah, we use Markdown for a very long time, and we're a relatively small team. We're roughly anywhere between two to four people who are supporting our tooling ecosystem for a team of, as I mentioned, a hundred data scientists, data analysts. And so we really have to be precise on where we focus our energy and what tools we want to build when. I think we had a narrative of our Markdown works really well. To your point, if we're looking to create word outputs in the officer ecosystem is extremely impressive and allows us to do that really well, at least for word outputs. At some point, Quarto kind of became the mainstream and we said, okay, should we switch to this? And one of the main drivers for doing that is that one, it's where a lot of support has been thrown in building up that ecosystem. And two, it relatively easily allows us to output multiple formats at the same time, which is honestly magical.

So for example, we could use Posit Connect as a way to publish reproducibly our analyses and generate an HTML that's visible by our clinicians that our scientists are working with and sharing insights with. But at the same time, there's also a Word doc that can be downloaded in the click of a button. And that Word doc is typically what is shared with our research partners or our customers with whom we're delivering analytic reports. And so it allows us flexibility to easily have the same source code and then leave kind of the rendering details and formatting details mostly to a Word template and Quarto on the backend. So it took us a bit to get there, but once we did and we worked with a wonderful team at Plymouth Analytics to kind of lay the foundation of how we would use Quarto in our day-to-day work.

It relatively easily allows us to output multiple formats at the same time, which is honestly magical.

Oh, that's such a great call out for like bringing in a third party that specializes in that type of stuff. I've worked with groups like that. They're amazing. If you don't know that that exists, there are groups that will help you or help your team figure out a workflow that works for you and sort of templatize things or even help you build themes, color schemes, all of that stuff.

I know that one of your team members is here, Erica. I don't know if Erica wanted to hop in and add anything about the Quarto to Word and Quarto to Excel journey and what that looks like. Sure. Yeah, I sit on Mike's team and one of the larger projects that I've been working on is creating a function in R to allow for to easily output analytic tables from Quarto to Excel and kind of getting the formatting of that as well. Most of our services teams are generating analytic outputs with like dozens to hundreds of tables and as those analyses get more complex in Word, it's really hard to maintain formatting, which is one of the reasons why we're pushing towards using Excel for our analytic outputs only. But unfortunately, there's not like a native way to output formatted tables from Quarto to Excel.

So we have been using the FlexLSX package along with FlexTable to easily output formatted tables from a Quarto doc to Excel and the function that we built pretty easily fits into our current workflow and some of the other functions that we've built to generate those initial tables in R. Yeah, and then using FlexLSX in our internal like output Excel function has allowed us to output to essentially like an Excel template similar to what a Word template would be.

Yeah, I know these struggles. I know these struggles so well. And Luke was saying there's also OpenXLSX2. There are a few different packages that have been helpful. Yeah, we're also using OpenXLSX2. I think some of the formatting in OpenXLSX2 is a little manual. So we already had like pre-existing formatted FlexTables and the FlexLSX table allows us to like persist that formatting into the Excel output rather than having to manually do it, especially since our tables have varying numbers of columns and rows and headers. The FlexLSX package allows us to just take the existing formatting and basically paste it into Excel.

Open source tools and security

Noor, you asked one about open source. Would you like to ask that one live? So basically, how do you convince stakeholders of the value of going open source instead of sticking with their continual contractual things with proprietary tools? And secondly, how do you handle security concerns with sensitive data and open source? Because I used to work as a contractor and that was often a concern of like, well, we know how this works.

Yeah, really good question. I think I come from a place of privilege in that fire and in 2016 made a very intentional decision. And I think the people who made this decision that we wanted to be open source first and using R as a baseline, because we had the privilege of being a new company to define how we wanted to operate. And I think part of it too is we're a tech company in the healthcare space. And correspondingly, you have a little bit of different attitudes on where you want to start. That said, I think you do end up investing a lot more in risk control frameworks, security review processes, et cetera, in order to make those viable, safe and secure. Because at the end of the day, patient safety and scientific integrity is paramount to what we do.

And so one of the things we've invested in a fair amount is working with our quality team to define and security team to define processes and review necessary before integrating open source tools. I'll give you an example for something like the FlexLSX and OpenXL. Before in taking those into our system, we did a review of, for example, how many dependencies does this bring in? Is the maintainer relatively active? Are we introducing business risk by taking on a dependency from open source? Because sometimes you find a magical package in the wild that does exactly everything you need, but it hasn't been touched in a very long time. And you're introducing either security risk because there isn't active development or you're introducing technical risk over time and maintenance burden. So our team tries to be fairly intentional around what we decide to use when.

And one of the things we love about open source is you can see what's going on. And generally, if it's a very actively maintained area, then if issues do come up and it's well used, then a lot of folks will jump onto that and help support and iron out those wrinkles. And so that's why we try to really, again, anchor to those that are widely used and will have had a lot of those issues vetted already or will continue to be vetted as they come up going forward. And we just try to be intentional and grateful, especially for those in the open source community who have already paved that path because it's not always an easy journey.

Contributing to open source

Yeah, we're all thankful for our open source heroes. That actually brings up a great topic because I think that your team and people in your team under you contribute to open source and the open source ecosystem. I know that you had mentioned a few packages. What were the packages? I know one was dbplyr. Yeah, I can give an opening to Anthony, who's on our team here.

One of the things we've all probably felt when trying to use an open source library is we don't exactly know which aspects of it are well-supported. And so you may think, oh, this should work off the shelf. And thankfully, someone has already solved that problem really well for you. And other times, you're the first person to think about that problem or use that backend that hasn't yet been paved. And so when we started a translation from, we went from using Redshift to Snowflake, we were using dbplyr really heavily on Redshift as a way to pre-process some of the data in SQL, and then we went to Snowflake. And I'll let Anthony describe the experience from there. I think this was at a time when I was relatively new to R as well. So it was a bit of a baptism by fire to learn R while learning the innards of the dbplyr package.

Definitely diving into the deep end, but effective, I think. Effective. So over the course of a few months, we were able to work with the tidyverse team to develop a set of translations from R to SQL, and we contributed those to the package. And they're in pretty prevalent use at this point. And I think we can be proud of that. So I'm excited. Yeah, no kidding. That's amazing. I cannot imagine learning R at the same time as learning the innards of a package.

What was your background before R, Anthony? Some Python. Some Python. Well, that's really inspiring. Do you have any tips for anybody who's like, I really want to get started with contributing to open source? How can I do it? Do you recommend the deep end? Like, I have a problem that exists that this package could almost solve, but it doesn't, and trying to help add that? Yeah, for sure. I'm not sure if it's my personality type or just general guidance, but I find it most useful to have a problem that I really need to solve if I need to, when developing a tool.

Mike, do you have anything to add on to that, like getting started with contributing? From a management standpoint, do you allow people to have a certain amount of time in their day or in their week or month to contribute back to open source? Or does it need to be really focused on problems that Flatiron is solving? Yeah, it's a tough one, right? We're all busy with our day jobs, and the question is, where do you focus your energy always? And we generally try to, I don't want to say limit, but focus on, would this be critical for us to be able to move forward again? Because we're a fairly lean team. We don't have a lot of latitude to carve out as much free capacity as we would likely like. And so we've tried to be intentional around, let's say we're doing a big migration, a snowflake, this isn't really helpful to do this, then it's also a great opportunity for someone like Anthony or me or Erica to jump in and try to solve that problem, not only on behalf of us, but also others who may encounter this problem going forward.

So I think it's like, when we need to, we definitely are willing to, and make sure we work with our legal and privacy team to share any considerations for where we give back. But I think importantly, I learned too, even just opening an issue or sharing feedback or discussing with people is a form of contribution. I've had some amazing conversations, even at Posit Conference, where you ask a question of, I'm trying to think about this, and they say, please open an issue. And it's not so much that you're complaining, but you're just sharing ideas and sharing feedback that, for example, you may have a system internally that others don't have access to, and you can help cross those communication gaps via GitHub or via that open source community.

Even just opening an issue or sharing feedback or discussing with people is a form of contribution.

It is such a great point to make that opening an issue or just asking a question can be so super helpful, because you might be starting a conversation, and there could be people out there in the dev team that have been thinking about it too, and have just needed the push to sort of push it over the edge. That happened just this year with tidyverse, actually with dplyr. There were some kind of holes in dplyr where it was not as easy as it probably should have been to modify a column based on the values in another column or based on the values in a named list or a named vector. And people in the community started talking about it, posting about it, asking questions about it, and it turned out that the tidyverse dev team was like, we've already been thinking about this. Now that we have all of these issues popping up, let's do it and put it in there. And so those functions now have been put out and exist. So it really does matter.

Hub and spoke support model

Nick Drew, if you are available to ask live, that'd be great. Hi, Libby. Can you hear me? Hey, I can. Great. Yeah, I think parts of my question were answered a little bit before, but I'll just go ahead and ask it. Mike, thanks for joining us today. Really good information. It sounded like you have some folks that you oversee or supervise, and it wasn't clear to me whether they report directly to you or there's a matrix reporting relationship. But I just wanted to understand a little bit about your support model for that group. When people use desktop applications and open source software, oftentimes people come up, as we've been talking about, come up with their own ways of doing things. And so how does that support look like?

Yeah, it's really evolved over... I've been at Flatiron Health for six years, and we've really tried to iterate and learn from different ownership and support models. I'll share with where I am or where we started and then paint the picture to where we are now. As a team of scientists, we are not always software engineers at our core. And so correspondingly, figuring out who supports the R installation scripts that we all have, or who supports the libraries that we may or may not have created on a one-off. We at some point said, okay, we're at a team where we want to be able to manage some things more centrally, and how do we do that? So we created what was called, and I would give a shout out to Nathaniel Phillips, who's not here, who created the R Enablement Initiative, to think about a dedicated core set of maintainers who can support that ecosystem.

And so I think that's kind of a core principle where we've kept. We have a core set of tooling maintainers and a core set of tooling users. And I view that really as a gradient along a hub and spoke. The maintainers are really the hub where we can support and make sure we're prioritizing the things that can impact the most users across the most use cases. The spoke may be, I have a one-off analysis where I want to contribute a function, or I want to create something that's really not necessarily applicable to everyone, and we would push that support out to the spoke.

So currently, I manage two teams, a team of data analysts who are primary users, and a team of data scientists, including Erika, who are supporting our core tooling ecosystem. And we found that generally to be a really helpful model where our users are focused on either execution or science or analytics, and their goal is to be empowered with the tools they need to do their work really well. And our team of maintainers are aligned with the incentives to say we need to empower our users, our analysts, our scientists, to use those tools really well. And so we are always trying to build tools that empower our users hand-in-hand. The other thing I would say is we have an engineering platform team who supports our Posit team infrastructure and manages Workbench, Package Manager, and Connect alongside our core tooling team to make sure our platform supports our tooling maintainers who ultimately support our users using the platform and the tools.

This is such a big topic. And there was one more question, anonymous one on Slido that I saw that was along these lines. The anonymous question said, did you say you manage about 100 data scientists, or did I mishear that? I do not manage 100 data scientists. I manage a subset of those. Our tooling team is roughly two to four people at any point, and our data analytics team is more in the range of 10, but our broader scientific org who is using the platforms is in the realm of 100.

The add-on question to that was, do you have any tips for asynchronous working? How do you communicate with your team? In terms of communication, we generally use Slack a fair amount as our way to intake and triage, which maybe gets to the second part of that question. In interfacing with users, especially if they're less technical or not excited about the technical details, having a quick way to iterate and get down to the root issue to unblock them is so helpful, because ultimately they're not necessarily sure which question to ask, or they see an error message in the package and not necessarily clear where to go from there and what team to escalate to. So we really try to first triage, and then second, determine when we need to solve and what the urgency is back to that user, back to that use case.

Is there anything besides Slack? Are you like a Jira team or an Asana team for assigning tasks or tickets? We do use GitLab internally and use issues as a way to track across our internal libraries and development in the same way most folks would use GitHub. We also use Jira as a team to operationalize against some of those developments. So we're trying to say across these next two weeks in the spring, what do we want to focus on? What are the areas that we want to deliver as a team, in addition to balancing our inbound user support workflows as well that mostly come in through Slack and may get ticketed in Jira or GitHub or GitLab.

Getting into the healthcare industry

It says, hey, Mike, it's anonymous. I'm a data analyst working in other industry, but interested in getting into the health industry. So how important is domain knowledge in your recruitment? Good question. I can just give a little bit of background on how I entered this space. My background is in science, math, chemistry, a little bit of economics. And I was always passionate about at one point, I thought I would be a doctor and or a dentist or veterinarian. But at some point, I realized my passion is doing that behind the scenes, doing the math, doing the modeling, analyzing the data and being able to tell that patient narrative in a different way. So for me, I was always curious on how I could apply that science and healthcare within the analytics base accordingly.

So I went into economic consulting out of grad school and was doing a lot of health insurance litigation and decided rather than focusing on how much to pay claims, I wanted to dive deeper into what is the clinical patient journey that supports those. And so fortunately, I was able to shift into Flatiron Health and dive deeper into the space. I think that sort of depends on the role that you're entering. Some roles can be really domain agnostic. And let's say it's more specialized in the example that Hubert gave, maybe you're an expert in extracting unstructured records. And that can be within finance reports, it can be within healthcare records. And you can still apply that internal to different industries. But there are other areas where let's say you're interested in epidemiology and modeling of scientific methods, then having that domain expertise and experience is pretty critical for you to be impactful in the role you enter.

So I think something I always value is curiosity and willingness to learn. And for certain data analytics teams, you don't necessarily need to have that clinical experience to do the analytics aspects of it well that support the business and support your stakeholders. So I think it's just important in the interviewing conversation process of understand how much is needed for the role and what do you bring to the table. And if it's a role that you really want, figuring out if there are ways to close those gaps either on the job or in your own learning or in a different path towards that position. So I'm always a learner and advocate for learning. But I think at the end of the day, I support people finding creative paths as well.

Tooling decisions and Positron

Rachel, would you like to ask that? Sure. I'm trying to remember which question. It might be the one I was just thinking about right now, which is, Mike, I think when we were chatting, we talked a little bit about Positron and that led us down a path of like, how do you support people in the cool new things that they always want to use? Do you have a lot of people who are always coming and asking you for the latest and greatest? Is there a process for people getting things approved to use in your environment? I think one of the most impactful questions we can ask as data scientists is, what's the impact or why are we doing something, right? Is there business value for doing this? Is it a quality of life improvement for our developers? Is it allowing us to answer more questions from the data that we have? Before we decide either what tool to use or whether to introduce a new tool, that's the question we always try to anchor to.

Let's take Positron, for example, where we haven't yet switched folks over in part because we used Posit Team and Posit Workbench as a way to, en masse, allow users to have their development environment pre-specified for them. That said, we try not to be overly prescriptive on what tools are supported. We want to work with users to hear their feedback on what we think is a good approach forward or what others have suggested to us to hopefully move in that hub and spoke so that most of the folks along the various spokes feel well supported as we decide on which tools to introduce and when.

Yes. We don't have a formal intake process for, let's say, someone wants to use a new package, but let's say we are adding a new dependency to our core sets of projects, then, yeah, we'll have very intentional decisions around whether we want to do something. Coming back to the example we talked about earlier, Excel templates, we are recognizing with Word in Quarto that's really well supported and that's not something we need to worry about as much investing and maintaining going forward. But if we are building out our own in-house R functions that wrap something like OpenXLSX or FlexLSX, then we need to be comfortable assuming some amount of maintenance cost going forward and we need to have the right expertise internally and we need to feel confident that is something we can sustain before introducing. So, I think a high school guidance teacher once said we can take any course that you want, but you can't take them all and it's very much the same in the tooling space of you can use whatever tool you want, but you can't use them all and you can't always support them all well. So, it's a matter of making your trade-offs that allow you to maximize your impact, maximize your focus.

Career advice

I am located in San Francisco, and I actually moved here in 20, fun fact, I moved here by meeting some people on a plane who introduced me to their son, whose company I then ended up working at, and so I never intended to be on the West Coast, but my team is all on the East Coast, and so my mornings are usually starting with coffee, taking a moment to warm up for the day, and then hopping into meetings, and then my afternoons are typically more deep-thinking time, able to dive into some coding or dive into some problem-solving and review.

Okay, well, our career question is, is there a piece of career advice that has either really, really helped you, you've really liked, that you try to give to other people, that you mentor people on your team? I think something I always anchor to is, think about what problems need to be solved, and then come up with solutions against those problems. Importantly, as you grow and as you stretch, think about what problems people aren't trying to solve yet, or what is a solution that someone hasn't yet been able to come up with to solve that problem that they care deeply about. Being able to connect with people on the problems they care about allows you to be a clear partner, be an innovative technologist as needed to match their solutions. For me, it's all about collaboration and aligning on the problem to solve before jumping into anything else.

Being able to connect with people on the problems they care about allows you to be a clear partner, be an innovative technologist as needed to match their solutions. For me, it's all about collaboration and aligning on the problem to solve before jumping into anything else.

Do you recommend getting to know your end users and your stakeholders and stuff for all of your people on your teams? Yeah, that's the most important thing. If you build a tool and people don't use it, then you haven't really been successful in building that tool. Being close to the people who need it is the most important thing, and they're ultimately the measure of success. If they're using it to deliver better science, then you've done your job well. Right. Yeah, then you contributed to delivering better science. That's the whole part of understanding the business problem, right? You've got to understand it from the perspective of the person who has the problem if you want to help solve it.

All right. This was a wonderful chat. We have one minute left, so I will say we must move on and say goodbye. Just a reminder to everybody, there's no Hangout next week. We are at PositConf next week. Mike, thank you so much for hanging out with us. This was wonderful. I hope you had a good time, too. Yeah, likewise. Thanks for having me. I'm looking forward to chatting more with you all.

Erica and Anthony, who work with Mike, thank you for being here, both unmuting and in the chat. If there are some topics that we talked about today that you would like to see us talk about in future Hangouts, fill out that survey after we leave and let us know. We would love to take that into consideration. It was so nice seeing everybody's face today. I will see you in two weeks' time for another Hangout, but I will hopefully see you next week and this weekend on the Discord server for PositConf.