Back to Blog

Tag

#reliability

6 articles
Article preview
War Stories May 4, 2026 13 min read

The Friday Afternoon Slack Message That Killed My Weekend

How a duplicate-payouts report from finance taught me that idempotency is a system property, not a one-service bug fix — and the path from a tactical patch to a cross-cutting framework that took at-least-once seriously.

Article preview
War Stories May 3, 2026 13 min read

The Customer Support Ticket That Taught Me to Profile Before Designing

A "simple" lost-update bug took me through optimistic locking, pessimistic locking, and finally a two-stage Redis-locked aggregator — a tour of why the right concurrency primitive depends entirely on the shape of your contention.

Article preview
Real Domains December 9, 2023 5 min read

SRE for Small Teams — What Actually Pays Back

Google's SRE book is 500 pages long and targets 100-engineer orgs. For a 10-person team, the pragmatic subset that delivers most of the benefit at a fraction of the cost.

Article preview
Core Patterns December 11, 2022 5 min read

Dead Letter Queues — Handling the Unhandleable

What DLQs are, why you must have one for every message consumer, and the operational patterns that keep bad messages from blocking the good ones.

Article preview
Core Patterns June 21, 2022 5 min read

Transactional Outbox — The Pattern, End to End

Why "save to DB, then publish to Kafka" is almost always wrong, and the outbox pattern that fixes it — with real Java code, schema, and production considerations.