Ben Edmunds

Rearchitecting CI at SeatGeek

Our CI runners were slowly ruining everyone’s day. Shared, stateful hosts where one team’s build could poison the environment for the next team. Classic noisy neighbor problem, except at the infrastructure layer where it’s way harder to figure out what went wrong.

We had about 80 hosts on weekdays, 10 on weekends. Fixed scaling. Peak hours meant developers sat in queue. Off-peak meant machines sat idle burning money. We were paying for capacity we weren’t using and still didn’t have enough when it actually mattered.

Container builds made it worse. Tightly coupled to EC2 hosts with local Docker daemons. Settings persisted between builds, so one team’s config could quietly break the next team’s image build. Multi-arch images required emulation that absolutely destroyed build times. The whole thing was brittle and nobody trusted it.

600 Repos Can’t Flip a Switch

We had over 600 repositories on this infrastructure. You don’t just migrate that over a weekend.

We ran a four-phase rollout. Started with our own platform team repos so we ate our own dogfood first. Expanded to other platform-owned repos. Updated the shared CI jobs. Then automated the long tail with multi-gitter across hundreds of repos. Each phase surfaced problems we could fix before the blast radius got bigger. That phasing is the only reason we pulled this off without pissing everyone off.

Autoscaling CI Is Harder Than You Think

This is the part nobody warns you about. Autoscaling CI infrastructure sounds straightforward until you actually try it.

You need the right metric. We used saturation: the ratio of pending and running jobs to available slots. Most default metrics don’t capture the developer experience of staring at a queue. You need graceful shutdowns because Buildkit will just kill your build mid-way on SIGTERM if you let it. And you need capacity reservation so that scaling up doesn’t mean waiting five minutes for a node to come online. We over-provisioned with low-priority pods that pre-pulled common images, then got evicted when real work showed up.

None of these are individually hard. Together they’re a real engineering problem.

What We Built

Kubernetes with ephemeral pods. Every CI job gets its own isolated pod that’s destroyed when the job finishes. No state leaking between builds. No noisy neighbors. Clean slate every time.

For container builds, we deployed Buildkit as Kubernetes deployments connected via remote driver. Completely decoupled from the CI hosts. No more shared Docker daemons.

The Numbers

Average queue time dropped from 16 seconds to 2. The p98 went from over 3 minutes to under 4 seconds. Cost per job down 40%. Concurrent job capacity doubled. And zero state pollution between builds.

Go Read the Technical Posts

We wrote a three-part series on the SeatGeek blog with the full technical details:

Part 1: Rearchitecture covers the architecture decisions, the Kubernetes executor setup, and the migration strategy across all those repos.

Part 2: Building Containers with Buildkit covers how we replaced local Docker daemons with remote Buildkit deployments and all the fun edge cases that came with it.

Part 3: Optimizations covers autoscaling, caching, NVMe storage, and the performance work that got us to those final numbers.

If you’re running shared CI infrastructure and feeling the pain, start with Part 1. The problems will sound familiar.