The basics of event sourcing look tidy on a whiteboard. Running it in production teaches a different set of lessons. This article collects the ones worth knowing before the first production incident.
Lesson 1 — Pick aggregate boundaries carefully
An aggregate is the unit you replay. If you make aggregates too big, rehydration is slow and concurrent writes contend. Too small, and you lose the invariants a single aggregate enforces.
Good heuristic: the aggregate boundary is the smallest unit that enforces an invariant. An Account with balance rules is one aggregate. A Portfolio of accounts is not — it’s a query across aggregates, better handled as a projection.
Bad heuristic: one aggregate per domain entity. You’ll end up with User aggregates containing all the user’s lifetime data, thousands of events per aggregate, and painful replay times.
Lesson 2 — Snapshots are not optional
Rehydrating 50,000 events for every read kills performance. From day one, plan snapshot strategy:
- Snapshot every N events (500 is a common starting point)
- Store snapshots in a separate table or key-value store
- On load: get latest snapshot + events after its version
What size N to pick? Measure. Snapshots have overhead (serialization, storage), so too-frequent wastes. Too-rare slows reads. Start at 500, tune based on production traces.
Lesson 3 — Schema evolution will bite you
Rule one: never modify a stored event’s payload. Rule two: plan how to evolve the schema from day one. You will:
- Rename fields
- Add required fields
- Split or merge event types
- Change enum values
Two main approaches:
Weak schema (JSON). Events are JSON blobs. Code tolerates unknown fields, applies defaults for missing ones. Easy to evolve; easy to silently break.
Versioned events. OrderPlacedV1, OrderPlacedV2. Old consumers handle V1. New consumers handle both. Eventually migrate all old events through an “upcaster”.
I’ve used both in production; the versioned approach has fewer surprises but more ceremony. Pick deliberately.
Lesson 4 — Replay from scratch happens more often than you think
You’ll rebuild projections. Maybe because of a bug, maybe a new reporting need, maybe schema changes. Design for it:
- Projections should be idempotent (replaying the same events produces the same state)
- Keep event store queries efficient — indexing, partitioning by time or aggregate
- Track projection version; on mismatch, rebuild
- Full rebuild should be feasible — if it takes a week, that’s a problem
For projections that can’t be fully rebuilt (e.g., ones that send email), make that explicit. Treat them as a separate class — “side-effect projections” — and handle them with care.
Lesson 5 — Event streams are eternal
You can’t delete events for legal or logical reasons (GDPR complications aside — deal with those via pseudonymization or crypto-shredding). This means:
- Storage grows forever — archive cold data, don’t expect to compact
- Old event types live forever — your code must keep parsing V1 events from 3 years ago
- Breaking changes are expensive — there’s no “reset the DB”
Plan for 5-10 years of event history in any aggregate you build.
Lesson 6 — Consumers can’t be too smart
Tempting: put business logic in event handlers. “When OrderPlaced happens, also check inventory, also notify user, also update stats, and trigger fraud check if amount > $1000.”
Problem: the event has no context. It was emitted in a prior business operation; the consumer sees only the payload. If requirements change (“only notify users who opted in”), you change the consumer; if the event’s meaning drifted, you have a stealth breaking change.
Keep consumers dumb: project state, produce read models. Business decisions belong on the command side, driven by the current state of an aggregate.
Lesson 7 — GDPR and event sourcing
“Right to be forgotten” meets “events are immutable”. Not fun.
Approaches:
- Pseudonymization — store user identifiers separately; on deletion, break the link
- Crypto-shredding — encrypt PII fields with a per-user key; on deletion, delete the key, PII becomes unreadable
- Scrubbing — rewrite specific events to redact PII (breaks immutability in spirit but sometimes unavoidable)
Pick your strategy before you have regulator attention. Crypto-shredding is my preferred approach — it preserves the event stream structure while making deleted data unrecoverable.
Lesson 8 — Know when not to event-source
Event sourcing is most valuable for domains with rich transitions and audit needs. It’s overkill for:
- Read-heavy catalog services
- Simple reference data
- Pass-through APIs that don’t own data
- Any service where “what is the current state” is 99% of the access pattern
Within a larger system, it’s fine to event-source the parts that benefit (orders, payments, positions) and use normal CRUD for the rest (catalog, profiles, static data).
Tooling notes
- Axon, EventStoreDB — full-stack event-sourcing platforms. Heavy but capable.
- Kafka as event store — works but lacks proper aggregate querying; usually paired with a DB-backed event log
- Roll your own on Postgres —
eventstable with(aggregate_id, version, event_type, payload, at)is 90% of what you need - Outbox pattern — essential for coupling event publication with DB writes
When to walk away
If after six months you find yourself:
- Constantly fighting snapshot/replay performance
- Rebuilding projections every two weeks due to bugs
- Team members dreading touching the event-sourced modules
- Schema evolution consuming most of the roadmap
The pattern may not fit your domain. Migrating away is painful but recoverable — much harder than never having adopted it, easier than living with perpetual friction.
Closing note
Event sourcing, done right, is one of the most durable architectural choices — decades-old systems in banking and healthcare still thrive on it. Done badly, it’s a tar pit. The difference is in the ground-level hygiene: aggregate design, schema discipline, snapshot strategy, replay-aware projections. Get those right and it rewards you for years.