Testing Microservices Without Cloning the World

Shared staging environments are a lie we all agree to believe. Developers need to deploy broken code to test it. Other developers need that same environment to not be broken. You can’t have both, and yet every company tries.

This is one of those problems that seems simple until you actually have microservices. With a monolith, you spin up a review app and you’re done. With dozens of services talking over HTTP, gRPC, and message queues, testing one service in isolation tells you almost nothing. Your feature touches three services and a background worker. A single-service review app is useless.

The obvious fix is cloning the whole environment. Just copy everything for each feature branch. Except that’s absurdly expensive and slow when “everything” means your entire infrastructure. Nobody’s doing that at scale.

Mocking is the other trap. It feels productive because your tests pass, but your mocks drift from reality over time. You end up with green CI and broken production. False confidence is worse than no confidence because at least with no confidence you test manually before shipping.

The Real Problem Is Partial Isolation

What you actually need is a way to route traffic to your changed services while everything else falls back to stable staging. You want isolation where it matters and shared infrastructure everywhere else.

At SeatGeek we built Ephie to solve exactly this. The core idea is DNS-based routing. Services in your ephie environment resolve DNS within their own namespace first. If the service exists there, traffic stays local. If it doesn’t, DNS falls back to stable staging. Simple concept, and it actually works.

But HTTP routing only gets you halfway. When services communicate through RabbitMQ, you can’t just redirect DNS. You need isolated message brokers too. And if you spin up workers against a shared database without thinking about it, those workers start competing with staging workers for the same jobs. So Ephie handles that by scaling worker replicas to zero when resources aren’t isolated, and optionally spinning up dedicated RabbitMQ clusters and Postgres databases per environment.

What Developers Actually See

Developers don’t care about DNS routing or namespace resolution. They care about “can I test my thing.” Ephie gives them an interactive CLI where they pick which services to test, choose a branch or MR for each, and optionally toggle on isolated databases or message brokers. The Postgres databases restore from weekly snapshots and spin up in under a minute. Slack notifications ping when your environment is ready. Datadog dashboards show you what’s happening inside it.

Average startup time is just over three minutes. At peak we had 75 monthly active users (engineers) running 70 concurrent environments. Teams that used to wait days for a test cycle on shared staging cut that time dramatically. The inventory team was one of the loudest advocates because their workflows touched so many services that shared staging was basically unusable for them.

The Part That Mattered Most

The strongest signal that Ephie worked wasn’t the usage numbers. It was that teams actively campaigned for continued investment in it. When engineers go out of their way to tell leadership “keep funding this tool,” that’s about as clear a signal as you’ll get in platform engineering.

Check out the full technical details on the SeatGeek blog: Ephie: Ephemeral Environments at SeatGeek. That post covers the Kubernetes architecture, DNS resolution chain, CloudNativePG setup, and all the implementation specifics.

Get In Touch.

If you'd like to get in touch, you can reach me at ben@benedmunds.com.


Find me on ...