Resources

Data engineering for the 99% | Will Hipson @ HIAA | Data Science Hangout

video
Jul 10, 2024
59:40

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everybody. Welcome to the Data Science Hangout. I'm Rachel Dempsey. I lead customer marketing at Posit. Posit is the open source data science company building tools for the individual team and enterprise. So thanks for hanging out with us today. The Hangout is our open space to hear what's going on in the world of data across different industries and connect with others facing similar things as you. And we get together here every Thursday at the same time, same place, unless it's a holiday. So we won't be here on July 4th in two weeks, I believe. But if you're watching this as a recording and want to join us in the future, there's details to add it to your own calendar below.

And I know people really enjoy connecting with other attendees here live as well. So if you're interested in connecting with others, I want to encourage you to say hello in the chat today and briefly introduce yourself, whether that's sharing your role or your base or something you do for fun. Some people like to share their LinkedIn there in the chat too. We're all dedicated to keeping this a friendly and welcoming space, which you all have made it. So thank you. We love hearing from you no matter your years of experience, titles, industry, or the languages that you work with. If you are hiring, feel free to share those roles in the chat here today. I love to see those open roles there. But it's also 100% okay if you just want to listen in today, although I love getting to hear from you live.

So there's three ways that you can jump in today and ask questions or provide your own perspective. So you could raise your hand on Zoom. If you're unsure how to do that, there's a little reactions button below in the Zoom bar where you can raise your hand. You could put questions in the Zoom chat and just put a little asterisk next to it if you want me to read it, or I could call on you to introduce yourself and add some context. And then lastly, we have a Slido link, which I'm sure Curtis is sharing here in the chat right now, where you can ask questions anonymously too. But with all that, thank you for being here today. I'm excited to be joined by my co-host, Will Hipson, Data Engineer at Halifax International Airport Authority. And Will, I'd love to get us started here by just having you introduce yourself and share a little bit about your role, but also something you like to do outside of work for fun too.

Yeah, absolutely. Thanks, Rachel. It's a real pleasure to be here. So yeah, I'm a data engineer at the airport here in Halifax, Nova Scotia, so we're on the East Coast of Canada. And a lot of, just briefly, my role involves typically what you think about with data engineering. So like databases, pipelines, but also I see a lot more of like, I guess you'd call it maybe like the full stack, given that it's a small team. I'm also involved in a lot of presentations, so dashboards, reports, sort of the more visual side, Shiny that you might associate with data science and R specifically. And even around say like the philosophy and it's kind of activity that goes on here within the broader team around data discovery, literacy, sort of advancing that forward beyond just managing the data in the back end.

And then in terms of something I guess fun about myself, I like to bake sourdough bread. I got started on it right before the pandemic, actually, like just a couple of months before like the real craze of everyone was making bread at home when things were shut down. But I just got in there a little earlier. It's almost like I anticipated it or something. But that's my kind of like fun thing to do on a weekend. Not when it's a heat wave, obviously, because it's having the oven on in the house is just sweltering in that case. But yeah, it's like my fun kind of thing for me and the family benefits from it too, because they get bread.

Data at the airport

I love it. Thank you. So I think all of us, I love hearing examples where like, we can actually resonate it resonate with it, like going to an airport or being on a plane. And so I was curious, can you give us a few examples of the way that you're using data across the airport?

Yeah, so just for context, we're like a medium sized Canadian airports. And we are international. So there's a lot of activity going on here, even though it's not say like a Toronto, Pearson or Vancouver. And how we're using data. So we're in a pretty early phase right now of what I would say, like his data adoption at the airport. Before I started, and before the data team even existed, people use dashboards that were built by enterprise companies that were paid, you know, 10s of 1000s of dollars to build something and then sort of go away. So people think of it data in that way. And also people, of course, were using data just in their own kind of space, maybe they have like data specific to their domain, just sitting on their computer and SharePoint. And so we started this team fairly recently ago, like this, this data team as part of the larger IT group, to have a more top down approach where we have all of the data in one location, sort of centralized repository of data, so that people can trust and have access to whatever data they might need to do what they're doing.

Now, you asked for kind of practical examples. So like, things that come to mind are flight schedules, managing the flights that are incoming outgoing kind of observing that passenger flow is another big part of it, especially now that we have recovered from the pandemic. During that pandemic recovery phase, monitoring passenger flow was a huge priority, of course, to see, you know, just the status of the the airport, like, you know, are people traveling and people coming through at the rates that they were pre pandemic. So people often ask the question of like, well, have we recovered now comparing the present time to like, say 2019. So those are like your typical things. And then, you know, we like to think a little beyond just the immediate of what's going on right here right now, to start thinking, well, can we use other data sources to improve the airport experience to help people working here understand, and maybe do model predictions or things like that. So weather data, both locally and even kind of global space, because, you know, whether the impacts and airports, not us, but maybe another airport within our network, that's going to affect how things look here, flights are delayed. And we're thinking to like passenger experience, how do we improve when someone comes to the airport? Should they be able to find out easily in one place, bus routes and stuff like that. And just thinking that whole journey from the beginning of you decide to take your flight to you've returned, and how data can be used along that that lineage.

Thank you. I loved when you're explaining to me how, like, when you're actually working from the airport, you get to see all of these like use cases in action and just be a part of it, too.

Yeah, that's and that's one of the things I love about this position. So I am at the airport today. And you can go out and walk around and see, you know, people coming in and going, it's a very busy active place. There's people are buying things, they're waiting around, they're going through security. So it's really, it really grounds me to like the domain of like what I'm doing to actually be physically in the space. I realized that's kind of a privilege, not every position allows you to do that. Sometimes you're a few steps removed from that, or you're working remotely. So you don't really see what's going on physically. But it really does. For me, it really helps a lot. And I love going on site, I do work a hybrid kind of work from home and remote and, you know, in person, but those, yeah, just having, going downstairs, seeing what's going on, you can kind of see the problems or like the things that people are struggling with, we can see, for example, like, if there's a huge backlog of people waiting to get through security, it makes you wonder, like, how can we, how can we improve this situation? So yeah, it's really great for just kind of like making me understand, like, okay, my work does have value somewhere physically to this space.

Building the data team from scratch

Yeah, absolutely. I see, Ethan, you had a question in the chat. Do you want to jump in here first? Sure, yeah. Thanks, Will. It's been interesting hearing so far about your role and work, but I know we'll get into a lot more details, but I'm kind of curious to hear a little bit more about what you said at the beginning where the data team didn't used to exist and your role and the team sound like they're fairly new. So I'm just curious how much context or how much you could share about how that team was created in the first place and how your role was created and kind of how you were, how you were hired. And I'm just imagining that you probably had to do a fair amount of defining what your role is, if it was kind of, you know, creating a data team from scratch. So just curious to hear a little bit about that, if you don't mind sharing.

Yeah, so what I know about like how the team was like the motivation for the team, there's a couple things that come to mind. So previously, before there was an in-house data team, the, what would happen is there'd be a need for something like a dashboard, like people like, oh, I need to really see what's going on with passenger flow right now. It's critical. And so they'd hire some company to build something and all that data would be kind of managed within this single application and it'd be extremely expensive. And then because it was a contract type of thing, they'd build it and leave. And then inevitably it would break. And then people would be lost, kind of wondering, okay, now what do we do with this? Do we hire them again to fix it? Do we start from scratch? Because by then the priorities have probably shifted to a little bit. And so it was this constant cycle of like hiring to build something, it breaks, replace it. And then nothing, no knowledge was being generated for the company itself. Like we weren't, there was nothing here to have that could be maintained or a team to kind of help, help the broader team even understand how to use these tools.

And I think also this has been a little more recently, but with AI being so much in the forefront, now there's a need to help integrate AI. People aren't really specifically know what they want to use AI for yet, but they know it's there. And if they don't know sort of the limitations, if they don't have basic data literacy, then they might run into issues. So we're facing that right now. And like, okay, AI is here. It's not going away anywhere. People want to use it. They're excited, but they don't really know how to use it appropriately. So we're, we're approaching that data literacy. And this is a very like not technical type stuff. This is like human to human interaction. Like how do we educate people how to use these tools?

So I think it's both of those things together. And then you asked about like how my role is defined. I think it's, it's tricky when you're in a, such a small team, it's me and one person on the data team as part of the larger IT group. And the work I do is the data engineering, the nuts and bolts of like actually having data flowing in, quality checking pipelines all the way to the end product. So data engineering, I feel like is maybe not the best name. I feel like full stack data something is probably better because it's like a little bit of data engineering, but then I see that full, that full, what is it? The, the full landscape, I guess the full landscape really of raw source data to something that somebody will actually interact with.

Skipping ahead to AI without clean data

Um, well, I guess as we're already just kind of touched on this topic, I know many organizations are eager to jump into the latest and greatest with AI ML. And I know you have a lot to share on this topic of like what can be done without having to jump straight to AI and ML and just wanted to have you chat or just share a little bit with us on how you're, you're doing that.

Yeah. So this is, and this comes again from the, what we see here is people are excited to use AI and they want to use it right away. And this is not nothing new. I think a lot of people in this industry kind of understand this, but if your data, if you don't have good data, if you don't have clean data, that's, that's ready to use for AI, it's going to lead you astray. You're going to, you know, these, these models are not ready for that kind of, for, for managing it at that point in the life cycle, you need to have good data and people kind of don't really understand it. They don't see it data that way. They see data as a single thing, like a spreadsheet on their machine. Um, they don't see that there's a lineage to data. So data has a lineage. It has a life cycle. It has to start somewhere, you know, it has to start at a source and that source, it might not be, you might get, it might be JSON, not rectangular. How do you manage that? How do you clean it? How do you make it, you know, remove duplications. Um, and if you're in a production scenario, which most people are like that data is changing. It's not, it's rarely ever static. If you're working off of static data, you're, you're doing analysis, you're doing something ad hoc, maybe you're in academia, but if you're in a production place using data, that, that data is changing and you need to be able to monitor that. See, uh, has it changed in ways that are reflect poor quality, like something we have just implemented recently in our data pipelines is to check for, to continuously check for duplicates, like verify that the primary keys that we're using for our tables are in fact valid. Cause this is an assumption that we have, and maybe it holds at the beginning, but something along the way happens that introduces some duplicates. How are you going to be aware of, of that? And it's, it gets into the kind of like data profiling and there's more than just duplicate checking. But, um, these are the thing, this whole lineage, people don't see that they see the end product. And if you miss the whole lineage, the whole life cycle of the data before that, um, then you can't possibly be using LLMs in a way that's appropriate.

And if you miss the whole lineage, the whole life cycle of the data before that, um, then you can't possibly be using LLMs in a way that's appropriate.

Thank you. I, uh, I see Dom asked a question in the chat and there's a little star there, so I'll read it, but it's, you mentioned your two-person data team is part of a larger IT group. Do you have situations where some data tasks are managed by the wider team? And if so, how do you determine who's responsible for which tasks?

Yeah. So I'd say almost no, like it's all handled by the data team. What the IT team does, uh, where we interact with the IT mostly is in actual physical servers or things that, that needs to be like hardware more likely. So like, um, a good example, like we have a vendor that has cameras that manages our cameras at the airports. Um, and so we want to be able to also, uh, and these cameras are designed such that like you can't see the actual person's face. It's like a face down kind of camera. Um, but like we want to use these cameras to do people counting and actually that, that, that company does that for us. And so we worked with the IT team to kind of get that set up and start doing that. But once, once it's like, here's your way to access the data, then it's entirely on us. There's, there's no, the, the IT team, they have their own, uh, issues and work to do. Like they are, they are inundated with even just like service requests and things like that. Like that's the typical, what you might think of like an IT department doing. It's like, uh, handing out hardware to people, managing issues that go on a day to day. There's a production, like if the bag, baggage system goes down, which does periodically, and there's like physical machines involved in there at that. So we're definitely a few steps removed from that. So it definitely does feel like a two person, one person team, um, given that they're often doing completely separate stuff from us.

Thank you. Uh, Sunday, I see you, you just put a question in the chat. Want to go next?

Um, hi, I think you're doing great, great work. Um, yeah, but I'm just thinking, um, do you have some kind of system, you know, where you measure, you know, um, and this could be feedback from whoever your clients are, internal clients within, you know, your group. And, uh, the passengers too, you know, some kind of feedback system and maybe a bit of AB testing here and there, you know, something that says we've measured this and we can see this has changed and this is better, you know, so it kind of like makes for some kind of strong argument for your team, you know, two is awesome, but it's small. It's, it's fun days, I guess, but as it grows, it gets a little more bureaucratic, but bigger is better, I guess. So my point is, is there some kind of system for measuring and then being able to feed that back to, you know, the, the larger team and say, Hey, this is the argument for what we're doing and why we need to grow this team.

Okay. Great questions. Lots unpacked there. I'm going to start actually at the end and just say, I'm going to push back a little bit on bigger is better. I do think a small team works very quickly. I realized like when you have a bigger team, you get a lot more diverse skillsets and that can bring value too, but we've been able to move in the 10 months that this data team has existed. We've moved from two dashboards that were developed by enterprise companies that were charging a hundred thousand dollars and no one can maintain it to four products, several databases with like 80 tables and about 20 production pipelines that supply that whole process. And then there's tools for people to be able to integrate with the data and self-serve. So a team of two or a small team that you can count on a single hand there's agility there.

Uh, and I think scrappy, small stuff like can get you really far. And I want to talk about that a lot more. Um, so that, that kind of like veered right into that. I, what was the other part of your question? Yeah. Asking about like measuring, you know, the impact of your data products and seeing how that feeds back to the value you offer too. I think at our stage, as well as the passengers. Yeah. Yeah. So mostly what we're serving right now are internal or that's what we're trying to do. We're trying to do self-serve. We do have a few, uh, uh, I guess I'll call them products or like some data goes that, that power is a passenger facing application. Um, but right now we're at such an early stage that any kind of, uh, quantitative measurements, I guess, of KPIs, like key performance indicators and stuff like that would just wouldn't work for us. We need to be able to actually like talk with the people we're serving right now before we can figure out how to measure it in an automated way.

So a good example, like of what we're, we're struggling with and what we're trying to face right now is we have this data platform. It's ready for people in the organization to use. Um, how do we get them to use this? And we thought this would be like, okay, it's there, you build it and they come and we're realizing, no, you know, it's not as simple as that. Um, people want to use data, but they are, they're, they're, I don't want to say they're scared, but they're like, they're used to doing it in a way that they're familiar with. And maybe that's sharing Excel files via email or a public or not public, but an internal SharePoint, they don't know databases. They don't know, um, tables as well, or they think of them as spreadsheets. Um, and we've done a really good job, like cataloging and meticulously going through, like describing the tables, giving metadata, describing all of the columns. And even that is not enough for self-serve. You know, we have a quick button that takes you to like how to get that table. And it's not about the tools, what we realized, um, myself and my, my colleague, like, it's not about giving them access. It's about literacy first. Like they need to know, they need to understand some key data principles. And what I think about with that is like tidy data, that sort of stuff. Like what you might be familiar with doing in Excel, but like color coding cells to mean different things, like, no, that should be data in an actual column. You can't be using that or bunch of separate sheets in a single Excel workbook. Like these things that doesn't, they don't follow tidy of principles. And even then you still need to be able to like know the domain that you're working in and like understand that data. There can be different tiers of quality. Like we might have transactional data that could be really messy, or maybe there's a table that's undergone more cleaning and processing and deduplication. So what we've realized very recently is like, yeah, it's not about the tools. It's about the literacy. The team needs to have literacy to, uh, to be comfortable using the data platform.

Handling KPI changes and data stories

On the topic of tidy data too, I know, Saul, you had a question on handling changes in KPI definitions. You want to jump in? Oh, yeah. Thanks so much. Maybe tell us what KPIs are too. Yeah. Um, I guess my questions are around change in general. So KPIs, like key performance indicators, metrics, um, there's things like conversion, for example. Um, how do you guys handle changes? You know, I guess, well, I've always had difficulty with like conversion has now changed from, you know, a week after a call to two weeks after the call, but maybe back bills aren't an option based on data. And then long-term metrics like, you know, sometimes we'll have privacy deletions where we can't really track a certain customer over time. And then we have tombstoning where it's like, after a certain amount of time, some of the columns are deleted for storage efficiency, and only the absolute minimum is kept. But how do you guys handle keeping track of like changes over long periods of time? Uh, and being able to tell executives like clear stories when again, like the target, the measure, the yardstick have changed.

Right. So for us, I mean, KPIs, if you're talking about like monitoring of the data and less about like the effectiveness of what you're doing, like we have KPIs set up to monitor like the data flow and data quality and the pipelines similar to what was asked previously. Like we don't really have like measurements or concrete measurements of how effective the team is or how effective what we're producing is because it's still at such an early point of what we're doing. And, but I do expect that will change. And that's something I think with a more mature, I don't want to say mature team, because I feel like the maturity is there. It's just, as the organization itself starts to become more on board with using data in this way, because we're still like, you know, we have to be continuously almost promoting the team in a way just to the, like the IT group realizes it's useful. We know it's useful and we are doing production thing. Like things are happening in production that affects passengers. But I think like where we're really trying to pitch this internally, it's still kind of going unnoticed.

I'm really interested in the topic of like change management, but also always having to like promote the work that your team is doing too. Yeah. And part of what we're trying to do, oh, sorry. I'll just add on one more thing. Like we were trying to also create like a culture here of like being excited about data. So we're, we're writing data. We call them data stories and it's just like a, it's a corridor, a corridor in our Markdown document that we host on, on our, on our webpage, like our data hub, which is also hosted on Quarto. It's a Posit Connect. And we write little vignettes about exciting things you can do with data, geared more towards a non-technical audience. So instead of a typical like blog posts where you might have code interweaved throughout, it's more the text and the visuals. And it's like, Hey, if people get excited at why you can do this with data, you can do that kind of forecasting. You can do these predictions like, Hey, that's an in for them to start asking more questions. And then we say to them, you can do that too.

Let me, let me take a pause and sit on this, this one for a little bit. I get excited when I see all the questions in the chat, but that's really interesting to think of like the Quarto doc you're sharing with some of the non-technical stakeholders. Could you maybe share an example of one of those and like, what was the impact of that?

Yeah. So one of what we were trying to do, or a thing that gets commonly kind of brought up here is forecasting, time series forecasting around the passenger flow at particular times of the day. So if you're not in the airline industry, like there are periods in the week where things are tend to be heavier, like Fridays and Sundays, people coming back or leaving for the weekend and around key events. Like if there's a big local event, people will be coming in for that. And even then on the day itself, there are peaks, like there's a morning peak when people tend to come in for the morning flights. And then there's like another peak in the afternoon. And we have all this historic data. We have like right now, four years of all the scans when you, when you go with your boarding pass and you line up and they, someone, an agent scans it, you know, we get that in our system, not all the information about who you are, but like that there was a scan at this point at this time. And so then we can use that to start forecasting passenger flow. And we can use that to start answering questions too, around how full are flights going to be? Because we also know that the flight that you're going on, and if you've heard a little bit about things in the airline industry, it's like they tend to overbook. And they do that expecting that some people will not, will miss their flight or they'll decide not to go at the last minute. So we can also see two flights that are overloaded and we can look and do forecasting. And so we did this, I wrote a blog post internally here around forecasting, what's called load factors. That's the proportion, the percentage of how full a flight is going to be. So a hundred percent is like that's, that flight is full, every seat is occupied. Then you might get 105 or 110 percent would be like extremely over, they've overbooked that flight. And we want to see times of year, times of day even, where flights are more likely to be overbooked.

Managing priorities and stakeholder expectations

Thank you so much. Michelle had asked a question a little bit earlier that was, with a small team, any tips on managing priorities or expectations with stakeholders? Sometimes it feels like everything is a high priority, but also since you're having to educate and work on documentation for existing projects, imagine that you'd be resource constrained.

Yeah, that's a really good question. When we started, Ryan and I, we started using GitLab and we're still using GitLab. And we were going to kind of document everything and prioritize issues and all this stuff and use it to its fullest extent. And that just ended up taking too much time. Like honestly, to create issues for every single thing that comes up and prioritize them, it's like, you know, let's just use GitLab for like deployment and hosting our code, like a repository for our code and maybe occasional issues that are like really important to think about. But we're not going to do that. It's just, we're too small a team. We can manage most of the stuff in our head or on a whiteboard. So a lot of it is kind of a little more back of the napkin, like just keeping things in working memory, I suppose, of how much is going on. But then like, you're right, like the priorities from not just that self-generated, but like people come in and they ask for things and it's like a lot of ad hoc. And you're probably thinking of that. That's where the question came from. It's like this thing all of a sudden becomes extremely high priority, like drop everything and do it now. I'll say we're pretty fortunate here that people are strongly discouraged to do that kind of thing. And if they do, there's an expectation that nothing can be like right away.

And I know that's not always the case. So I suppose what I would recommend is like, it's trying to give realistic expectations to people about how long things are going to take. And I realize that's really hard because it's so hard to know how long a software project, the data project will take. But sometimes people get the impression that because you can do something really quick with code, that that's how long it's going to take. And sometimes the proof of concept is like a small fraction of the actual time it will take to do the full piece. And we've had cases where it's like, it took us like 10 minutes to do a POC, but then to actually take that to production took three months. And it wasn't like three months of building. It was three months of iterating on all the data doesn't look like quite right. Oh, we actually need it filtered by this. It was actually this. So it was that kind of thing like the iterative, like going back and forth and back and forth. So just build that into whatever kind of when stakeholders come to you or when someone comes to you and asks for a request, build in the fact that it might not just be how quick it is to build that thing, but how long it will take to iterate over it.

Simulation and the Maestro package

I will say I don't have the chat open here, but a little message pops up every time someone posts something, like a little snippet. And I saw some people talking about simulation packages, SymPy and Simr. In my previous job, I used Simr a lot because I was also in transportation, except it was maritime instead of aviation. And I loved using Simr. It's such a great package. Big kudos to the developers of that. It works blazingly fast because it's C++, like it's using RCPP. But it's such a great, like we had problems around like how can we simulate vessels coming into a port and that whole, like to optimize scheduling at the port. And every time we think about problems that I see here now at the airport, it's like, oh, can I use Simr for this problem? Because I really loved using Simr.

So just a shout out, because like people were talking about simulation. I think simulations are a really undervalued tool for coming up with, for solving problems and for even doing prediction and forecasting. If you can kind of lay out your scenario in discrete steps and kind of play that over a bunch of different scenarios and permutations, it really helps stakeholders because you can ask questions like, well, what if we try it like this? What if we add an additional lane to the security queue so that we can process people faster? How much is that going to save us in terms of waiting time? It's all those kinds of questions. And supply chain has no shortage of those types of questions. Like how can we tweak the system to improve flow, to improve throughput? Yeah, it's a really fascinating space and I can't wait to use Simr again because it's such a great package.

It's actually my first time hearing about that one. So thank you all. But while we're talking about packages, I know you have an exciting package your team has worked on and I want to make sure I give some time to chat about that too.

Yeah, thanks. So we've developed a package for scheduling pipelines. So one problem we were facing is like, well, we have all these R pipelines and by pipelines I mean like, I'm thinking mostly ETL. So like you have data coming in from a source that you want to extract, that's the E. And the T is you want to do some transformation on that, like filter, mutate, deploy, type stuff. And then you need to load it somewhere into a database we have a ton of those pipelines. And what we were doing before is we'd have a project for each pipeline and we'd put it somewhere and you'd have to schedule it. And we had like 15 or 18 of these pipelines and it's like, well, what can we do this within a single project? And how can we, how would we schedule it? Because R is not something that you typically have continuously running. You'd have to keep running it over and over again.

And we built a package called Maestro. So that's like the orchestration for orchestrating data pipelines and put it out on GitHub. And it's in its early stages, but I really, if it's not already, I'll share it in the chat or something. We really want people to use it. Yeah. So we have 18 pipelines that are running in production off of Maestro. And what's nice about it is you can deploy it on something like Posit Connect or really on any kind of server. And you run, you have a bunch of pipelines sitting in a folder and these are just our functions and you use Roxygen tags. Like if you've heard of Roxygen for documentation, you use Roxygen tags to specify, I want this script to run once a day at 12 PM. And I want this script to run every three hours and so on and so forth. You have a bunch of these pipelines, these R scripts, and you have an orchestrator script that you schedule as well. And every time that orchestrator runs, it's going to check, okay, which of the pipelines need to go and then kick them off the ones that need to go, skip the ones that don't. And you get some observability built in as well. So you see the, you know, how many pipelines had errors, warnings, what were the logs that came out of that pipeline? How long did they take? And we were using this internally. It's like, well, let's just put it out there for the public. First time for myself really developing a package to be used non-internally. I had some experience building our packages for internal use, but this like the level of quality and checking involved is obviously much more, and documentation involved for a public package was a lot more. So really great learning experience for sure. And I'm hoping people will use it and that they'll break it and put issues up there and tell me how like the documentation doesn't make sense and then we'll rewrite it. But this is what we want. We want constructive criticism and then hopefully there will be a CRAN release at some point in the near future.

Using Posit Connect for data engineering

Thank you so much for sharing that with all of us too. And I've been so impressed with how open your organization is to helping others learn as well and like sharing the work that you're doing so openly. I did share your team's customer story in the chat a little bit earlier as well if people want to check that out. Because I never want to put people on the spot about how they use our Pro Tools, but I thought because you already did a customer story, it might be nice to share a little bit about how you're using Posit in your workflows.

Yeah. We've been using Posit since kind of the first day here. We both, you know, myself and Ryan, we're both heavily, we use our, that's our main language. And so I just got the message that my earbuds are probably going to die, so I'll just switch to that. Hope that's, sounds good. Okay. Yeah. So we've been using Posit Connect and initially like we thought, you know, it's great for serving content like Shiny apps, Plumber APIs and documents. And then we kind of thought about it. It's like, you know what, you can run ETL pipelines off of Posit Connect. You just have them going, like you deploy in our Markdown or a Quarto script with logic you want to run and it's, and you schedule it and there you go. Like it's, it can do data engineering as well. And if you're, if you're at an early stage, maybe in your organization or you're just doing things yourself too, like yeah, you don't have a database, like just use pins. Like it's got a lot for, for not just, not just serving content, I think is what we realized is like, it can do a lot more than just that. And I think Maestro has a place there too, because if you deploy like a Maestro project and we've done this ourselves, like if you deploy a Maestro project and you have your orchestrator as a Quarto document and you schedule it to run, then you can run all your pipelines in one project off of there.

I was like, one thing I think is great about our team, myself and Ryan, like we, every week or so, we, or maybe even a couple of times a week, we think like, you know, how would we redo things if we were like, needed to run on the cheap? Like if we were just doing stuff ourselves, we didn't have an organization with, with tons of money to support us. Like, how would we do things? How would we do things? What tools would we need? And like, well, you need somewhere to store data and you need somewhere to compute it. And so like, Posit Connect jumps up as one option for us because we're very familiar with it and we've used it before, but like, don't, don't feel like you need to be just swayed by these massive Fortune 500 company. Like, don't worry about what Meta is doing. Like they, the latest and greatest stuff that's being shared on by Fortune 500 companies on LinkedIn, like don't get persuaded or like feel like you need to adopt those. Like I love scrappy data science. Like I just want to schedule in our script. What's the simplest way I can do that without, you know, shelling out tons of money. It's like, ask those questions a lot and like continuously ask those questions a lot. And it doesn't mean you do it necessarily, but it's just good to like reflect on like, how would you do things affordably and all that stuff.

Like I love scrappy data science. Like I just want to schedule in our script. What's the simplest way I can do that without, you know, shelling out tons of money. It's like, ask those questions a lot and like continuously ask those questions a lot.

Frameworks for saying no and growing the team

Does your team have a framework for rejecting work? So if so, do you have any advice for communicating no to stakeholders?

Yes, we have a framework. It hasn't been widely put out there or adopted yet, but we do have one. Our, it's typically like if there would be a no, it's like, let's say someone wants to use AI, like in a particular department, or they want to use AI on some data. We would ask them well, how ready are you for automation? Is your data that you want us to use? Because not all of it is publicly shared in the organization. Some of it's specific to a department. Is it tidy? Can we ingest it in a way that doesn't involve you sending us spreadsheets via email? So we'd ask those kinds of questions. And if they're not ready, it's like, no, you have to be ready. You have to meet a certain set of automation ready before things can happen. And I think more and more we're realizing too is that people need to come with a somewhat defined question or goal. It can't be too open-ended. I want to predict passengers coming into the airport. Well, that's a pretty broad question when you start drilling down to it. Like how often do you want to predict? Is it every minute? Is it every day? That sort of thing.

So I think Isaiah maybe had to drop off for another meeting, but I wanted to ask their question, especially because Isaiah said I use the airport a ton and remain intrigued and impressed by the two-person team. So Isaiah asked now, 10 months in, what skill set or type of data talent do you think you need to take the great work you're doing to the next level?

Yeah, that's a really good question. I think 10 months ago, I would have said DevOps 100%. I feel like in 10 months, we've learned a lot of DevOps just kind of by doing. And Alex Gold's book on DevOps for Data Science, if you haven't read that or looked, that's really great. If you don't know anything about setting up your own server or whatnot, that's a really great book, online resource to look at. Yeah, I think what would really be nice to have now is like someone to actually be like a more direct interface between the different individual departments here. Because I'm not a domain expert. I work in the airport, I understand enough about it, and I'm learning a lot. But it's not my, I didn't learn this in school. I'm just, and most data science are like that, right? They come in, they know the technical capabilities, they know how to approach data, but they don't really know the problem domain, they kind of learn it as they go. I think we would really benefit from someone who can be that like liaison between departments and be able to kind of like ask them more questions and understand what they're, what they need. But still with some data, like at least data literacy, and technical know how to kind of start down that road and start down that path. Like, I want someone now to actually start like really using our data, like use the data platform, because I mean, I'm, I'm the builder of the data platform. I use it, but I'm not really using it to answer questions, if that makes sense. Like I'm using it to provide it to people. Now I want people to use it.

Managing new ideas and exploration time

Absolutely. Grace, I see you had a question in the chat. Do you want to go next? Hi. Yeah, maybe just one quick question. How do you manage the flow of new ideas in your work, or in your daily work time and, and the urge to incorporate new tools, for example?

Yeah, that's a good question. So because we're a small team, just me and one person, like we, when we're in the office together, I remember we're like, oh, I'm going to do this. And then when we're in the office together, even when we're working remote, like we're, we pretty much just like, hey, do you want to chat about something? It's like, sure. And we meet in here, and we write on the whiteboard. And we've also specifically, so as a team, we've decided every Friday afternoon for three hours, we're going to set aside work, like specific work requirements, and just dream, you know, maybe we'll watch conference video, or we'll talk about it, we'll explore new tech. And it's completely apart from, from work, there's no requirement to get anything done. It's just to explore and hang out and build new stuff, try new things. If there's a new R package that came out that week that we're excited about, you know, we'll spend that time maybe looking and digging into that. So I think setting those like specific times of the day or week or whatever, whatever makes sense for the team to just, you know, explore doesn't even have to be like doing technical stuff, like oftentimes, we'll just sit and chat and complain about things. But that's, that's good, too. And ideas flow from that. The package that we built Maestro came out of that kind of space, like, you need to have the freedom to also be able to like try things and make mistakes. And yeah, if you're if you're constantly in an environment where it's like just getting the next thing done, getting the next thing out. It's not going to lead to burnout, but you're not going to have that professional development. That's fulfilling. Yeah, thank you.

Scheduling Maestro on Posit Connect

Love that. Thank you. I see a few people in the chat are really excited to check out Maestro. And Michael had a question. If you deploy to connect as R Markdown or Quarto, do you schedule that? Yeah, run at some frequency and connects UI?

Yeah, so the thing is, you schedule your pipelines, and that's using Roxygen comments, like in the pipeline itself, you also need to schedule your orchestrator. And we're trying to kind of we're trying to figure out how we're going to make this as easy as possible. But there's really kind of two things you need to do, you need to tell in the code, how often you're going to schedule your orchestrator, like maybe you want to run every 15 minutes. And that decision is going to be based off of how many different pipelines you have and their frequencies. Like, if I have like a bunch of pipe, maybe I have three pipelines, they're running every hour on kind of different offsets of time, maybe one on the half hour one on the 15, then yeah, I should really run my orchestrator every 15 minutes. And then when I go to deploy that, you need to make sure that you actually like you actually do schedule to run every 15 minutes. The thing about Maestro is it does need to actually run in an environment, like it needs that you can deploy locally and run it on cron if you're using Mac or task scheduler, or you can put it in the cloud. Just saying like, just saying that you're going to schedule it in the orchestra, it doesn't mean it's actually going to do that, you actually have to go deploy it and do what you said you're going to do. It's like a commitment, it's saying, I'm going to schedule my orchestrator to run every 15 minutes. Okay, now you need to make sure you actually do that. And in the package down, I just even today, like wrote this vignettes around like scheduling and how to kind of make sense of it. Because the thing about Maestro is it's stateless. And that's great for you, because it saves you on computes, it saves on a bunch of things. But to work in that kind of way, it takes some shortcuts, and it makes some assumptions. And understanding that can maybe be a little bit different if you're coming from an environment where maybe you've used like airflow, or some other thing that's like a continuously running service that's checking the times and then kicking off pipelines based off of that approach. So yeah, so you do need to actually schedule it to run somewhere. We've done in the in Google Cloud to that we just did that this