Data engineering for the 99% | Will Hipson @ HIAA | Data Science Hangout

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everybody. Welcome to the Data Science Hangout. I'm Rachel Dempsey. I lead customer marketing at Posit. Posit is the open source data science company building tools for the individual team and enterprise. So thanks for hanging out with us today. The Hangout is our open space to hear what's going on in the world of data across different industries and connect with others facing similar things as you. And we get together here every Thursday at the same time, same place, unless it's a holiday. So we won't be here on July 4th in two weeks, I believe. But if you're watching this as a recording and want to join us in the future, there's details to add it to your own calendar below.

And I know people really enjoy connecting with other attendees here live as well. So if you're interested in connecting with others, I want to encourage you to say hello in the chat today and briefly introduce yourself, whether that's sharing your role or your base or something you do for fun. Some people like to share their LinkedIn there in the chat too. We're all dedicated to keeping this a friendly and welcoming space, which you all have made it. So thank you. We love hearing from you no matter your years of experience, titles, industry, or the languages that you work with. If you are hiring, feel free to share those roles in the chat here today. I love to see those open roles there. But it's also 100% okay if you just want to listen in today, although I love getting to hear from you live.

So there's three ways that you can jump in today and ask questions or provide your own perspective. So you could raise your hand on Zoom. If you're unsure how to do that, there's a little reactions button below in the Zoom bar where you can raise your hand. You could put questions in the Zoom chat and just put a little asterisk next to it if you want me to read it, or I could call on you to introduce yourself and add some context. And then lastly, we have a Slido link, which I'm sure Curtis is sharing here in the chat right now, where you can ask questions anonymously too. But with all that, thank you for being here today. I'm excited to be joined by my co-host, Will Hipson, Data Engineer at Halifax International Airport Authority. And Will, I'd love to get us started here by just having you introduce yourself and share a little bit about your role, but also something you like to do outside of work for fun too.

Yeah, absolutely. Thanks, Rachel. It's a real pleasure to be here. So yeah, I'm a data engineer at the airport here in Halifax, Nova Scotia, so we're on the East Coast of Canada. And a lot of, just briefly, my role involves typically what you think about with data engineering. So like databases, pipelines, but also I see a lot more of like, I guess you'd call it maybe like the full stack, given that it's a small team. I'm also involved in a lot of presentations, so dashboards, reports, sort of the more visual side, Shiny that you might associate with data science and R specifically. And even around say like the philosophy and it's kind of activity that goes on here within the broader team around data discovery, literacy, sort of advancing that forward beyond just managing the data in the back end.

And then in terms of something I guess fun about myself, I like to bake sourdough bread. I got started on it right before the pandemic, actually, like just a couple of months before like the real craze of everyone was making bread at home when things were shut down. But I just got in there a little earlier. It's almost like I anticipated it or something. But that's my kind of like fun thing to do on a weekend. Not when it's a heat wave, obviously, because it's having the oven on in the house is just sweltering in that case. But yeah, it's like my fun kind of thing for me and the family benefits from it too, because they get bread.

And if you miss the whole lineage, the whole life cycle of the data before that, um, then you can't possibly be using LLMs in a way that's appropriate.

Thank you. I, uh, I see Dom asked a question in the chat and there's a little star there, so I'll read it, but it's, you mentioned your two-person data team is part of a larger IT group. Do you have situations where some data tasks are managed by the wider team? And if so, how do you determine who's responsible for which tasks?

Yeah. So I'd say almost no, like it's all handled by the data team. What the IT team does, uh, where we interact with the IT mostly is in actual physical servers or things that, that needs to be like hardware more likely. So like, um, a good example, like we have a vendor that has cameras that manages our cameras at the airports. Um, and so we want to be able to also, uh, and these cameras are designed such that like you can't see the actual person's face. It's like a face down kind of camera. Um, but like we want to use these cameras to do people counting and actually that, that, that company does that for us. And so we worked with the IT team to kind of get that set up and start doing that. But once, once it's like, here's your way to access the data, then it's entirely on us. There's, there's no, the, the IT team, they have their own, uh, issues and work to do. Like they are, they are inundated with even just like service requests and things like that. Like that's the typical, what you might think of like an IT department doing. It's like, uh, handing out hardware to people, managing issues that go on a day to day. There's a production, like if the bag, baggage system goes down, which does periodically, and there's like physical machines involved in there at that. So we're definitely a few steps removed from that. So it definitely does feel like a two person, one person team, um, given that they're often doing completely separate stuff from us.

Thank you. Uh, Sunday, I see you, you just put a question in the chat. Want to go next?

Um, hi, I think you're doing great, great work. Um, yeah, but I'm just thinking, um, do you have some kind of system, you know, where you measure, you know, um, and this could be feedback from whoever your clients are, internal clients within, you know, your group. And, uh, the passengers too, you know, some kind of feedback system and maybe a bit of AB testing here and there, you know, something that says we've measured this and we can see this has changed and this is better, you know, so it kind of like makes for some kind of strong argument for your team, you know, two is awesome, but it's small. It's, it's fun days, I guess, but as it grows, it gets a little more bureaucratic, but bigger is better, I guess. So my point is, is there some kind of system for measuring and then being able to feed that back to, you know, the, the larger team and say, Hey, this is the argument for what we're doing and why we need to grow this team.

Okay. Great questions. Lots unpacked there. I'm going to start actually at the end and just say, I'm going to push back a little bit on bigger is better. I do think a small team works very quickly. I realized like when you have a bigger team, you get a lot more diverse skillsets and that can bring value too, but we've been able to move in the 10 months that this data team has existed. We've moved from two dashboards that were developed by enterprise companies that were charging a hundred thousand dollars and no one can maintain it to four products, several databases with like 80 tables and about 20 production pipelines that supply that whole process. And then there's tools for people to be able to integrate with the data and self-serve. So a team of two or a small team that you can count on a single hand there's agility there.

Uh, and I think scrappy, small stuff like can get you really far. And I want to talk about that a lot more. Um, so that, that kind of like veered right into that. I, what was the other part of your question? Yeah. Asking about like measuring, you know, the impact of your data products and seeing how that feeds back to the value you offer too. I think at our stage, as well as the passengers. Yeah. Yeah. So mostly what we're serving right now are internal or that's what we're trying to do. We're trying to do self-serve. We do have a few, uh, uh, I guess I'll call them products or like some data goes that, that power is a passenger facing application. Um, but right now we're at such an early stage that any kind of, uh, quantitative measurements, I guess, of KPIs, like key performance indicators and stuff like that would just wouldn't work for us. We need to be able to actually like talk with the people we're serving right now before we can figure out how to measure it in an automated way.

So a good example, like of what we're, we're struggling with and what we're trying to face right now is we have this data platform. It's ready for people in the organization to use. Um, how do we get them to use this? And we thought this would be like, okay, it's there, you build it and they come and we're realizing, no, you know, it's not as simple as that. Um, people want to use data, but they are, they're, they're, I don't want to say they're scared, but they're like, they're used to doing it in a way that they're familiar with. And maybe that's sharing Excel files via email or a public or not public, but an internal SharePoint, they don't know databases. They don't know, um, tables as well, or they think of them as spreadsheets. Um, and we've done a really good job, like cataloging and meticulously going through, like describing the tables, giving metadata, describing all of the columns. And even that is not enough for self-serve. You know, we have a quick button that takes you to like how to get that table. And it's not about the tools, what we realized, um, myself and my, my colleague, like, it's not about giving them access. It's about literacy first. Like they need to know, they need to understand some key data principles. And what I think about with that is like tidy data, that sort of stuff. Like what you might be familiar with doing in Excel, but like color coding cells to mean different things, like, no, that should be data in an actual column. You can't be using that or bunch of separate sheets in a single Excel workbook. Like these things that doesn't, they don't follow tidy of principles. And even then you still need to be able to like know the domain that you're working in and like understand that data. There can be different tiers of quality. Like we might have transactional data that could be really messy, or maybe there's a table that's undergone more cleaning and processing and deduplication. So what we've realized very recently is like, yeah, it's not about the tools. It's about the literacy. The team needs to have literacy to, uh, to be comfortable using the data platform.

Handling KPI changes and data stories

On the topic of tidy data too, I know, Saul, you had a question on handling changes in KPI definitions. You want to jump in? Oh, yeah. Thanks so much. Maybe tell us what KPIs are too. Yeah. Um, I guess my questions are around change in general. So KPIs, like key performance indicators, metrics, um, there's things like conversion, for example. Um, how do you guys handle changes? You know, I guess, well, I've always had difficulty with like conversion has now changed from, you know, a week after a call to two weeks after the call, but maybe back bills aren't an option based on data. And then long-term metrics like, you know, sometimes we'll have privacy deletions where we can't really track a certain customer over time. And then we have tombstoning where it's like, after a certain amount of time, some of the columns are deleted for storage efficiency, and only the absolute minimum is kept. But how do you guys handle keeping track of like changes over long periods of time? Uh, and being able to tell executives like clear stories when again, like the target, the measure, the yardstick have changed.

Right. So for us, I mean, KPIs, if you're talking about like monitoring of the data and less about like the effectiveness of what you're doing, like we have KPIs set up to monitor like the data flow and data quality and the pipelines similar to what was asked previously. Like we don't really have like measurements or concrete measurements of how effective the team is or how effective what we're producing is because it's still at such an early point of what we're doing. And, but I do expect that will change. And that's something I think with a more mature, I don't want to say mature team, because I feel like the maturity is there. It's just, as the organization itself starts to become more on board with using data in this way, because we're still like, you know, we have to be continuously almost promoting the team in a way just to the, like the IT group realizes it's useful. We know it's useful and we are doing production thing. Like things are happening in production that affects passengers. But I think like where we're really trying to pitch this internally, it's still kind of going unnoticed.

I'm really interested in the topic of like change management, but also always having to like promote the work that your team is doing too. Yeah. And part of what we're trying to do, oh, sorry. I'll just add on one more thing. Like we were trying to also create like a culture here of like being excited about data. So we're, we're writing data. We call them data stories and it's just like a, it's a corridor, a corridor in our Markdown document that we host on, on our, on our webpage, like our data hub, which is also hosted on Quarto . It's a Posit Connect. And we write little vignettes about exciting things you can do with data, geared more towards a non-technical audience. So instead of a typical like blog posts where you might have code interweaved throughout, it's more the text and the visuals. And it's like, Hey, if people get excited at why you can do this with data, you can do that kind of forecasting. You can do these predictions like, Hey, that's an in for them to start asking more questions. And then we say to them, you can do that too.

Let me, let me take a pause and sit on this, this one for a little bit. I get excited when I see all the questions in the chat, but that's really interesting to think of like the Quarto doc you're sharing with some of the non-technical stakeholders. Could you maybe share an example of one of those and like, what was the impact of that?

Yeah. So one of what we were trying to do, or a thing that gets commonly kind of brought up here is forecasting, time series forecasting around the passenger flow at particular times of the day. So if you're not in the airline industry, like there are periods in the week where things are tend to be heavier, like Fridays and Sundays, people coming back or leaving for the weekend and around key events. Like if there's a big local event, people will be coming in for that. And even then on the day itself, there are peaks, like there's a morning peak when people tend to come in for the morning flights. And then there's like another peak in the afternoon. And we have all this historic data. We have like right now, four years of all the scans when you, when you go with your boarding pass and you line up and they, someone, an agent scans it, you know, we get that in our system, not all the information about who you are, but like that there was a scan at this point at this time. And so then we can use that to start forecasting passenger flow. And we can use that to start answering questions too, around how full are flights going to be? Because we also know that the flight that you're going on, and if you've heard a little bit about things in the airline industry, it's like they tend to overbook. And they do that expecting that some people will not, will miss their flight or they'll decide not to go at the last minute. So we can also see two flights that are overloaded and we can look and do forecasting. And so we did this, I wrote a blog post internally here around forecasting, what's called load factors. That's the proportion, the percentage of how full a flight is going to be. So a hundred percent is like that's, that flight is full, every seat is occupied. Then you might get 105 or 110 percent would be like extremely over, they've overbooked that flight. And we want to see times of year, times of day even, where flights are more likely to be overbooked.

Managing priorities and stakeholder expectations

Thank you so much. Michelle had asked a question a little bit earlier that was, with a small team, any tips on managing priorities or expectations with stakeholders? Sometimes it feels like everything is a high priority, but also since you're having to educate and work on documentation for existing projects, imagine that you'd be resource constrained.

Yeah, that's a really good question. When we started, Ryan and I, we started using GitLab and we're still using GitLab. And we were going to kind of document everything and prioritize issues and all this stuff and use it to its fullest extent. And that just ended up taking too much time. Like honestly, to create issues for every single thing that comes up and prioritize them, it's like, you know, let's just use GitLab for like deployment and hosting our code, like a repository for our code and maybe occasional issues that are like really important to think about. But we're not going to do that. It's just, we're too small a team. We can manage most of the stuff in our head or on a whiteboard. So a lot of it is kind of a little more back of the napkin, like just keeping things in working memory, I suppose, of how much is going on. But then like, you're right, like the priorities from not just that self-generated, but like people come in and they ask for things and it's like a lot of ad hoc. And you're probably thinking of that. That's where the question came from. It's like this thing all of a sudden becomes extremely high priority, like drop everything and do it now. I'll say we're pretty fortunate here that people are strongly discouraged to do that kind of thing. And if they do, there's an expectation that nothing can be like right away.

And I know that's not always the case. So I suppose what I would recommend is like, it's trying to give realistic expectations to people about how long things are going to take. And I realize that's really hard because it's so hard to know how long a software project, the data project will take. But sometimes people get the impression that because you can do something really quick with code, that that's how long it's going to take. And sometimes the proof of concept is like a small fraction of the actual time it will take to do the full piece. And we've had cases where it's like, it took us like 10 minutes to do a POC, but then to actually take that to production took three months. And it wasn't like three months of building. It was three months of iterating on all the data doesn't look like quite right. Oh, we actually need it filtered by this. It was actually this. So it was that kind of thing like the iterative, like going back and forth and back and forth. So just build that into whatever kind of when stakeholders come to you or when someone comes to you and asks for a request, build in the fact that it might not just be how quick it is to build that thing, but how long it will take to iterate over it.

Simulation and the Maestro package

I will say I don't have the chat open here, but a little message pops up every time someone posts something, like a little snippet. And I saw some people talking about simulation packages, SymPy and Simr. In my previous job, I used Simr a lot because I was also in transportation, except it was maritime instead of aviation. And I loved using Simr. It's such a great package. Big kudos to the developers of that. It works blazingly fast because it's C++, like it's using RCPP. But it's such a great, like we had problems around like how can we simulate vessels coming into a port and that whole, like to optimize scheduling at the port. And every time we think about problems that I see here now at the airport, it's like, oh, can I use Simr for this problem? Because I really loved using Simr.

So just a shout out, because like people were talking about simulation. I think simulations are a really undervalued tool for coming up with, for solving problems and for even doing prediction and forecasting. If you can kind of lay out your scenario in discrete steps and kind of play that over a bunch of different scenarios and permutations, it really helps stakeholders because you can ask questions like, well, what if we try it like this? What if we add an additional lane to the security queue so that we can process people faster? How much is that going to save us in terms of waiting time? It's all those kinds of questions. And supply chain has no shortage of those types of questions. Like how can we tweak the system to improve flow, to improve throughput? Yeah, it's a really fascinating space and I can't wait to use Simr again because it's such a great package.

It's actually my first time hearing about that one. So thank you all. But while we're talking about packages, I know you have an exciting package your team has worked on and I want to make sure I give some time to chat about that too.

Yeah, thanks. So we've developed a package for scheduling pipelines. So one problem we were facing is like, well, we have all these R pipelines and by pipelines I mean like, I'm thinking mostly ETL. So like you have data coming in from a source that you want to extract, that's the E. And the T is you want to do some transformation on that, like filter, mutate, deploy, type stuff. And then you need to load it somewhere into a database we have a ton of those pipelines. And what we were doing before is we'd have a project for each pipeline and we'd put it somewhere and you'd have to schedule it. And we had like 15 or 18 of these pipelines and it's like, well, what can we do this within a single project? And how can we, how would we schedule it? Because R is not something that you typically have continuously running. You'd have to keep running it over and over again.

And we built a package called Maestro. So that's like the orchestration for orchestrating data pipelines and put it out on GitHub. And it's in its early stages, but I really, if it's not already, I'll share it in the chat or something. We really want people to use it. Yeah. So we have 18 pipelines that are running in production off of Maestro. And what's nice about it is you can deploy it on something like Posit Connect or really on any kind of server. And you run, you have a bunch of pipelines sitting in a folder and these are just our functions and you use Roxygen tags. Like if you've heard of Roxygen for documentation, you use Roxygen tags to specify, I want this script to run once a day at 12 PM. And I want this script to run every three hours and so on and so forth. You have a bunch of these pipelines, these R scripts, and you have an orchestrator script that you schedule as well. And every time that orchestrator runs, it's going to check, okay, which of the pipelines need to go and then kick them off the ones that need to go, skip the ones that don't. And you get some observability built in as well. So you see the, you know, how many pipelines had errors, warnings, what were the logs that came out of that pipeline? How long did they take? And we were using this internally. It's like, well, let's just put it out there for the public. First time for myself really developing a package to be used non-internally. I had some experience building our packages for internal use, but this like the level of quality and checking involved is obviously much more, and documentation involved for a public package was a lot more. So really great learning experience for sure. And I'm hoping people will use it and that they'll break it and put issues up there and tell me how like the documentation doesn't make sense and then we'll rewrite it. But this is what we want. We want constructive criticism and then hopefully there will be a CRAN release at some point in the near future.

Using Posit Connect for data engineering

Thank you so much for sharing that with all of us too. And I've been so impressed with how open your organization is to helping others learn as well and like sharing the work that you're doing so openly. I did share your team's customer story in the chat a little bit earlier as well if people want to check that out. Because I never want to put people on the spot about how they use our Pro Tools, but I thought because you already did a customer story, it might be nice to share a little bit about how you're using Posit in your workflows.

Yeah. We've been using Posit since kind of the first day here. We both, you know, myself and Ryan, we're both heavily, we use our, that's our main language. And so I just got the message that my earbuds are probably going to die, so I'll just switch to that. Hope that's, sounds good. Okay. Yeah. So we've been using Posit Connect and initially like we thought, you know, it's great for serving content like Shiny apps, Plumber APIs and documents. And then we kind of thought about it. It's like, you know what, you can run ETL pipelines off of Posit Connect. You just have them going, like you deploy in our Markdown or a Quarto script with logic you want to run and it's, and you schedule it and there you go. Like it's, it can do data engineering as well. And if you're, if you're at an early stage, maybe in your organization or you're just doing things yourself too, like yeah, you don't have a database, like just use pins. Like it's got a lot for, for not just, not just serving content, I think is what we realized is like, it can do a lot more than just that. And I think Maestro has a place there too, because if you deploy like a Maestro project and we've done this ourselves, like if you deploy a Maestro project and you have your orchestrator as a Quarto document and you schedule it to run, then you can run all your pipelines in one project off of there.

I was like, one thing I think is great about our team, myself and Ryan, like we, every week or so, we, or maybe even a couple of times a week, we think like, you know, how would we redo things if we were like, needed to run on the cheap? Like if we were just doing stuff ourselves, we didn't have an organization with, with tons of money to support us. Like, how would we do things? How would we do things? What tools would we need? And like, well, you need somewhere to store data and you need somewhere to compute it. And so like, Posit Connect jumps up as one option for us because we're very familiar with it and we've used it before, but like, don't, don't feel like you need to be just swayed by these massive Fortune 500 company. Like, don't worry about what Meta is doing. Like they, the latest and greatest stuff that's being shared on by Fortune 500 companies on LinkedIn, like don't get persuaded or like feel like you need to adopt those. Like I love scrappy data science. Like I just want to schedule in our script. What's the simplest way I can do that without, you know, shelling out tons of money. It's like, ask those questions a lot and like continuously ask those questions a lot. And it doesn't mean you do it necessarily, but it's just good to like reflect on like, how would you do things affordably and all that stuff.

Like I love scrappy data science. Like I just want to schedule in our script. What's the simplest way I can do that without, you know, shelling out tons of money. It's like, ask those questions a lot and like continuously ask those questions a lot.