Phillip Cloud - CI madness with Ibis: testing 20 query engines on every commit

video

Oct 31, 2024

19:47

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

All right, so let's talk about how to test 20 databases on every commit.

So, okay, sit down if you work with zero databases. Okay, one. Okay, two or more. Interesting. The rest of you are probably suffering quite a bit.

So sit down if you, if you work with three or more databases. So I feel your pain and that is what this talk is about. Basically.

What is Ibis?

Who am I? I'm Philip Cloud. I work on a project called Ibis at Voltron Data. I've been doing this sort of work with tabular data tools for about 10 or more years. You can find me at these places on the internet.

And the question we're sort of going to address here is, have you ever needed to test a complex system? Um, so there was a handful of things you might have encountered while needing to test a complex system. One of them is Docker. Um, you probably, uh, have also encountered Python packaging as a challenge in some way. Or maybe you don't care about tests and you just are like, screw it. Everything works the first time I write it. And I don't need to run it anywhere else except my laptop.

So one, one complex system that I work on every day is called Ibis. And I'm going to talk a little bit about it and then we're going to delve into the testing bits.

So Ibis is a Python library for exploratory data analysis, general tabular data munging, data engineering, MLP processing. And I think probably for a lot of people in this room, the analogy of dbplyr will make sense. And so it's like dbplyr, but it's Python.

So what does that look like? Uh, it's kind of one API and 20 backends. I think, I think we're at 20. I don't remember. Um, and you can kind of, you can see like we support a bunch of stuff. This is, this is a reading a parquet file, doing a group by an account, nothing earth shattering. And you can see that when I go from DuckDB to Polars to DataFusion to PySpark, things kind of, you know, they kind of look the same, right? The first line is the only thing that's changing.

And, you know, when we get to the rest of the backends, things start to look kind of like this. Um, you as the user don't see this crazy Rube Goldberg machine thing. Um, but, but as developers of this project, like this is sort of what it looks like, you know, like there's a guy putting a ball into a conveyor belt and there's like a TV saying, no, things get weird.

Why is this hard to test?

So, uh, maybe you've already had this thought already, but like, why is this hard to test? And just a few facts about Ibis. We've got 17 SQL backends, three non-SQL, two of them run in the cloud. Nine of the SQL ones are like distributed and run in a cluster. Um, three of the backends don't have anything to do with SQL at all. They're just DataFrame APIs. We support a backend that is 40, that is 45 years old. It was written 45 years ago. It's Oracle.

Um, we have the youngest backend is called RisingWave. It's a streaming SQL backend and it is about two years old. Latency is all over the place. We've got stuff going, you know, to, from, you know, New York to Singapore and we've got DuckDB running on a laptop. So things are kind of weird all over the place and deployment varies, right? Somebody might be running a cluster of, you know, a thousand nodes of Trino on a, you know, Kubernetes, or you might be running Polars on your laptop.

And on top of that, we still have to develop features. We're not sitting around writing tests all day, right? Although it might feel like that after this talk. Um, so we're, we're sort of in a bit of a pickle here trying to test 20 databases.

Managing complexity

So how do we, how do we manage this complexity? Um, well, there's, there's like basically one high-level goal, which is to iterate quickly. And that's sort of project dependent, but we found that there are a few things that really matter when trying to do this. And one is that your environment setup needs to be fast, because if you're going to run this stuff in continuous integration, like say on GitHub actions, you don't want to spend a bunch of your time solving dependency constraints.

Um, by the way, a lot of this stuff is not stuff that we started out knowing, right? But we were like, Oh, our environment setup needs to be fast. It was more like, Oh, this is really slow. And here's what we've learned. Your tests also need to be fast. Don't run tbch scale factor a thousand in a unit test.

In our case, we have all these backends and their tests, like running their tests should not depend on each other, right? If I want to test like the, you know, the, the DuckDB backend, I shouldn't have to wait for the Postgres test to also finish. And then running these things like running, say the Postgres backend locally needs to be easy, right? You don't want to have like a 10 sub command long Docker compose command. To like spin up a Postgres server.

So we use something called just, which is a task runner. It's kind of like make without make where all the targets are phony. If you don't know what that means, I'm happy to talk your ear off about it later. CI must complete quickly. Continuous integration must finish fast.

So the tools that we use are, we use poetry for dependency management. If somebody wants to come berate me for that later, happy to talk about that too. Um, we use GitHub actions. The project is hosted on GitHub. It's just easier than any alternative at the moment for a big backends. And I say big, I don't mean necessarily scale, but their complexity of deployment. We use Docker, Docker compose for the small backends. They don't really get any special treatment. You just sort of run them, you know, on whatever machine they happen to be running on. And then, like I said, for tasks sort of, you know, just do X, just whatever. The tool is literally called just.

And so why do we use poetry? To make environments set up fast. Solving like satisfiability constraint problems in CI is a terrible, terrible idea. This is what we used to do. And it turned out that we turned out to be a bad idea. We were like, Hey, why is our CI slow? Well, our tests aren't actually slow. We're just like solving this massive constraint problem. Every single time a commit is pushed to me.

Solving like satisfiability constraint problems in CI is a terrible, terrible idea.

So poetry is one way to do this. Generally, the principle here is like get a lock file. And a lock file is essentially a giant list of pin dependencies. And so maybe one of the downsides is, okay, well, I can't test. Maybe I can't test every single, you know, constraint that I could possibly come up with. But like the question is like, are you doing that now? And if you are, I would really like to know about how you are doing.

Mentioned Docker. I've seen setups for more complex, tested, more complex projects where people have like a Docker setup for local stuff. And then they have a Docker setup for CI and they don't match exactly. And then, then you sort of get like works on my machine, you know, syndrome. That's not great. So really here, we found that using Docker exactly how you would use it locally in CI really helped like cut down on that. Bugs. And ultimately when we did get bugs was more reproducible.

If you're using dot compose, there's health checks and you can use really dumb ones. Like just curl the thing at, you know, basically any, any address that it can listen on. You know, you can use like Netcat to check if a port is open on a particular address. Like that's often enough or send, you know, select one plus one to your database every five seconds. And it's gotta be easy for people to use. I keep saying that, but it's like really important that people don't have to write, remember a Docker compose command and then seven environment variables. And then like, oh, you gotta have this file here and this directory here and that plate. Like, no.

GitHub actions, not a whole lot here, but we ended up paying for what's called the team's plan, which is just a way to run more stuff at the same time. And it's not particularly expensive. We spend $624 a year for a team of like eight developers to work on Ibis full time.

Measuring CI performance

So how does this, how does this like sort of stack up? Like what do we do to actually figure out whether any of this stuff was working well? Bit of terminology. A job is like a set of commands. So it roughly corresponds to like all the things you would need to do to test one Ibis backend. So it's like pip install Ibis, you know, run your tests, upload coverage, that sort of thing. And then you've got workflows, which are essentially a collection of those jobs. And so our sort of main workflow that is the biggest bottleneck in our test suite, it contains the, you know, basically a job for every single backend.

So we measured some stuff. And one of the things that we measured that turned out to be really useful was the average job duration. Not a particularly complicated thing to look at. But here you can actually see that sort of pre-poetry, we had this weird U thing, not entirely sure what's going on there. There's like this, you know, there's this line that looks like, you know, it's like, oh, trending downward. And it's like, not really. And variability is all over the place. Once we switched to poetry, we see that there was kind of a step change in our average duration. And that was great, which allowed us to run more and more jobs concurrently. And so we can, you know, if we wanted to add some more backends, it would be great. We can run stuff concurrently. Variability dropped quite a bit.

And then when we went to the sort of added the ability to do more concurrency, there wasn't a huge effect on average job duration because that doesn't actually help with that problem. This is like about making your actual individual sort of test this backend thing faster. And like, since we've switched to poetry in the team plan, like we've added three or four new backends and it's roughly stayed the same, which is pretty nice.

So that's jobs. What about workflows? So for workflows, we've got queuing time, which is very important. It's a good measure of basically how long your individual jobs are waiting to run. And you might be like, oh, well, before poetry, like things were better and it's only because we just weren't running that many jobs. But the ones that we were running, like you can see in the plot below the workflow duration, they were all over the place. So not a whole lot of jobs. So the queue was like when a thing came in, it was like, boom, kicked it off, ran. But when it did run, it took forever to run.

And then you can see once we moved to poetry, like, okay, queuing time seems to have sort of gone up a bit. And that was actually because we were able to run more stuff at the same time. And that's when we were like, okay, well, maybe we should get more concurrency now. So that's when we switched the team plan and things like sort of happily went back down closer to that original queuing time.

There's sort of this weird swell going on in, in workflow duration. Haven't really figured out what's going on there.

But yeah, so that's, that's workflow metrics. And, and this is sort of like a different view on that, which is, which is maybe makes the, the relationship between workflow duration and queuing time a bit more obvious, like when things get weird. So you can kind of see at the bottom of the left plot that like, you've got this line at zero, which is like tons of workflows are slow, but queuing time is really fast. So that's like maybe an indication you should look into making your jobs individually faster. And then on the right, that's sort of what this, the same data look like after we switched to poetry and increased concurrency.

And things are, we've still got a couple of outliers that need to be debugged a bit, but we're, we're in a much better place where things are, are, are both more correlated and sort of shallower.

So the kind of moral, moral of the story here-ish is that you kind of want queues and workflows to be correlated. Now that's not actually enough because you can just have really long queuing time and really slow jobs until they'll be perfectly correlated potentially. But you don't want to see things where you've got a vertical line, you know, here or a horizontal line there.

So if queues are slow and workflows are fast, you probably don't have enough concurrency. If your queues are fast and workflows are slow, you probably need to look at your individual like sort of CI jobs. And then if both are slow, it's sort of hard to say, you probably have slow jobs that you're trying to run a lot of. So you kind of need to look into that.

So if queues are slow and workflows are fast, you probably don't have enough concurrency. If your queues are fast and workflows are slow, you probably need to look at your individual like sort of CI jobs.

Summary

So to summarize, testing complex projects as possible, use Docker for both dev and prod, don't have two different setups for that. Don't solve satisfiability problems in CI. Measure stuff about the things you care about, and then also spend some time on the easy to run the complicated things. That's it.

Q&A

Thank you, Philip. And actually, we have a couple minutes for some questions from the crowd. So yeah, we'll start off. I'm an R programmer who would like to use Python for data wrangling. Emily recommended learning Polars. Would you recommend I use Ibis with a Polars backend?

Tough question right off the bat. You can definitely use Ibis with the Polars backend. I would say that Ibis's API is much closer to a dplyr API in terms of naming. And also selectors are definitely a thing that Ibis had before Polars, just friendly bit of competition. But yeah, so yeah, I would say it's possible. There are other backends. It sort of depends on what you want to do. If you need to get some more scale, you might choose something else.

How do you collect the CI metrics like view time and workload duration? So we have a job that runs in a GitHub action. I think it's 20 every day. And there's like a GitHub API that you can just dump into a JSON thing. I think we convert that to parquet and then upload it to BigQuery. Uploading it to BigQuery is probably a mistake, but it's been doing that for a long time. And it works. Yeah, dogfooding Ibis for that. All those plots and the analysis for the plots, that was done with Ibis for sure.

Speaking of which, that actually segues to the next question. Those plots do have that nice hand-drawn feel. So what library did you use to make your plots? Plot9. And Plot9 has Ibis integration. You can hand an Ibis table expression to Plot9 and it will convert it into pandas. Or whatever it's doing, it just works with Ibis.

This one, hopefully we'll see if it's controversial. What is your favorite backend? And what is your favorite file type? These are weird questions. So let's start with file type. I don't know. Parquet. Parquet is pretty good. Sometimes I'm a bit nostalgic for CSV. Um, I did, I did, I did write an Ibis backend that runs like awk against CSV, uh, using named pipes. If anybody wants to talk about that later, it's, it's a mess, but it's fun. Um, and backend, um, I'll just give a very political answer and say that DuckDB is the default backend, uh, for Ibis and it gets used quite a bit by me.

What is your current guess as to what is going on with the squiggly line for the workflow duration in the current setup? Oh, okay. So I have thought a little bit about it and I, I think we made some changes to one of the back, actually the big query backend around the time of the drop. Um, I, you know, it's, it's a little bit hard to say what the sort of creep upward is. Um, but yeah, I don't know. Another, another like possibility is maybe there was like a hardware shift in like the underlying infrastructure for whatever's running GitHub actions, which is probably Azure pipelines. And so maybe the, you know, they swapped out the hardware and things kind of got a little bit better. Um, I don't know.