Stephen Kiers - Staff+ Software Engineer

"I'm going to invite any of you to question any assumptions I make here. I have no ego. Feel free to poke holes in any argument I make."

That's what I said in my first screen recording to the tiger team. It was 2 AM, I'd been mapping a broken Kafka pipeline for weeks, and I was beginning to understand just how deep the problem went.

I'd been brought in as a contractor for various engineering projects. While investigating initial pipeline issues, I discovered a much deeper problem. Job listings weren't syncing between systems. Major clients had noticed. Revenue was at risk. The directive became clear: stop the bleeding.

What nobody told me—because nobody fully knew—was that the bleeding was coming from seven different places at once.

Phase 1: Triage (Days)

The system wasn't documented. That's not quite right—several people had partial maps, fragments of understanding scattered across different heads and old Slack threads. But no one had the full picture. Many of the people who might have helped had been recently laid off.

So I started drawing.

FigJam diagrams. Grepping through GitHub repos. AWS Secrets Manager became my dependency graph—incomplete, but useful. If a service had credentials to talk to something, that meant it talked to something.

What I found: 7 Kafka topics. 5 Debezium CDC sources. 90+ connectors. 4 PostgreSQL databases. 3 Elasticsearch indices. 10+ ECS services. Debezium captured changes from four different PostgreSQL databases, pushed them into Kafka topics, which fed into Kafka Connect sinks that updated Elasticsearch indices, which were consumed by ECS services running custom Python scripts that performed business logic and wrote back to different databases.

The actual architecture. No single person had the full picture when I started.

It was a Rube Goldberg machine—grown organically under real constraints. And somewhere in that machine, jobs were getting lost.

I recorded late-night screen recordings explaining what I'd found, mapping the architecture, connecting dots between what the C-suite was hearing and what the engineers were seeing. Posted 2 AM Slack updates. Many late hours spent translating between business urgency and technical reality. The team would wake up in different time zones and pick up where I left off.

Kyle was an external Kafka expert we brought in; the setup surprised even him because it predated modern Kafka practices. Sam knew the job sync logic and could navigate the complexity. Ethan, from the business team, asked questions that surfaced assumptions engineers had stopped noticing. Drew did manual database surgery while I slept, carrying the work across time zones.

Phase 2: Stabilization (Weeks)

The first break came when we discovered exit code 137. Out of memory errors. One of the ECS services was dying mid-processing, losing messages in flight.

We bumped the memory limits. Merging directly to main to trigger builds faster. One-hour deploy cycles, no staging environment, just ship and watch. Without staging, we optimized for small, observable changes over theoretical correctness. Each deploy a calculated risk.

The OOM fix helped. Jobs started flowing again. But not all of them.

The second clue came from Sam. "The issue is in my estimation quite likely isn't in upsert.py," he wrote in Slack. He was on to something. He wasn't right about the specific file, but he was looking in the right area.

The upsert script wasn't actually upserting. It was supposed to be idempotent: receive a job update, apply it, done. But the logic was create/update instead of true upsert. If a job wasn't in the existing_jobs list, the script did nothing. NOOP. Silent failure.

Jobs that should have been created just... weren't.

We patched it. Deployed again. Watched the metrics. Better. But still not clean.

Then came the hard part: data surgery.

Jobs had been lost for weeks. Some were partially synced—in one database but not another. Some had stale IDs from old ATS integrations. Some companies had job listings in one system but not the other.

Drew and I started doing manual hard deletes to trigger re-creates. We re-synced ATS data across the databases. Company by company. Batch by batch.

By September 25th, we'd re-synced 74 companies and 26,743 jobs.

I wrote in Slack: "I addressed the symptom but the root issue will cause flare-ups if not treated."

I meant it.

Phase 3: What Didn't Happen

Here's the honest part.

We didn't redesign the architecture. We didn't add proper idempotency guarantees. We didn't build deduplication logic. We didn't create a staging environment.

The system remained, in my own words from the incident notes, "a very brittle Rube Goldberg machine."

Some of the early design choices had aged poorly. Missing idempotency. Tight coupling between services. No circuit breakers. No retry logic with backoff. No dead letter queues.

We stabilized it. We didn't cure it.

My engagement ended in October 2024. The structural issues remained.

The Gap Nobody Talks About

Here's what I think matters about this story.

Most incidents end at stabilization. The site's back up. The pipeline's flowing. Metrics look green. Everyone's exhausted. You declare victory and move on.

But stabilization isn't the same as curing the disease.

Stabilization means the system works again under current conditions, with current load, with current data. Cure means you've eliminated the underlying fragility that caused the failure in the first place.

The gap between those two states is where platform maturity lives.

We stopped job listings from disappearing. We didn't make the system resilient to the next OOM error, or the next non-idempotent consumer, or the next database schema change that ripples through five different services.

And here's the thing: I don't think that gap is unique to this project.

I think most platform teams live in that gap. Between "it works" and "it works reliably." Between "we shipped a fix" and "we won't have this incident again."

Because curing the disease takes time, political capital, and convincing stakeholders that the fix isn't done just because the metrics are green.

What Made This Different

A few things kept this from being worse.

The experts we brought in once I discovered the scope were essential. But what held it together was the orchestration—mapping the architecture, connecting people across time zones, translating between C-suite urgency and engineering reality. Many late hours went into that work.

The only reason it didn't get worse is that the engineers went above and beyond.

That first screen recording—the one where I invited everyone to poke holes in my assumptions—set the tone. No egos. No blame. Just a shared commitment to understanding what was actually broken.

But trust and good people can only take you so far when the system itself is fragile.

The Honest Ending

We stopped the bleeding. We didn't cure the disease.

After my engagement there ended, I advised that the work wasn't done. I know the redesign never happened. I returned a couple months later and confirmed it still hadn't. The fixes remained tactical, not structural. Organizational incentives pointed elsewhere.

That's not a failure of effort. All of the engineers involved worked their asses off.

Incident response rewards stabilization. It doesn't reward the slow, unglamorous work of making systems resilient.

And until we change that calculus—until "cure the disease" gets the same urgency as "stop the bleeding"—we're going to keep living in that gap.

Platform maturity isn't measured by how fast you stabilize. It's measured by whether stabilization is enough.

This story is from a contract engagement in 2024. Company and collaborator names have been changed, but the three phases, the honest gaps, and the one-hour deploy cycles were all very real.