“Distributed systems” sounds like a specialized topic, but the moment your application has two processes talking to each other — one service calling another, a server talking to a database, a browser calling an API — you’ve built one. This article covers the core ideas you need to understand why distributed systems are hard and what patterns exist to tame them.
The one sentence that explains everything
In a single process, calls either succeed or throw. In a distributed system, they can also do neither.
A local function call is synchronous, reliable, fast. A network call is none of those. It can time out without telling you if it succeeded. It can succeed on the other side but fail to return. It can succeed, return, and then the response gets dropped. It can arrive twice. It can arrive out of order.
Once you internalize this, every pattern in distributed systems makes sense as a way to live with it.
The eight fallacies
Peter Deutsch listed them in 1994. They still hold:
- The network is reliable. It isn’t.
- Latency is zero. It isn’t.
- Bandwidth is infinite. It isn’t.
- The network is secure. It isn’t.
- Topology doesn’t change. It does.
- There is one administrator. There isn’t.
- Transport cost is zero. It isn’t.
- The network is homogeneous. It isn’t.
Every distributed-systems bug you’ll encounter traces back to assuming one of these. When designing a system, you should be able to point at each component and say what happens when network assumptions break there.
The CAP theorem, briefly
You’ve probably heard of CAP. In plain words: when there’s a network partition (some nodes can’t reach others), you must choose between:
- Consistency — every read sees the latest write (or errors)
- Availability — every request gets a response (but maybe stale)
You can’t have both during a partition. Most real systems bias toward availability and accept some staleness, because an error is usually worse than slightly old data. Banking systems often bias toward consistency — they’d rather fail than show an incorrect balance.
CAP oversimplifies (see PACELC for a more nuanced version), but the core insight holds: network partitions force trade-offs that have no perfect answer.
Consistency models
Not “consistent or not”, but a spectrum:
Strong consistency — everyone sees the same thing at the same time. Expensive across regions. Only feasible within a single database node or via consensus algorithms (Paxos, Raft).
Linearizability — writes appear instantaneous, reads see the latest write. The gold standard. Pays in latency.
Sequential consistency — all nodes agree on the order of operations, but the order might lag real time.
Eventual consistency — all nodes will converge to the same state, eventually. The default for multi-region systems. Has to be designed-for, not hoped-for.
Causal consistency — if A happened before B, everyone sees A before B. Weaker than linearizable, stronger than eventual. Useful for social feeds, comments.
Most application problems are solvable with eventual consistency plus careful UX. Strong consistency is worth the cost only for specific subsystems (payments, inventory).
Time is a lie
In a distributed system, there is no single clock. Each node has its own clock, they drift, they can jump forward or backward. Two events on different nodes — you usually can’t reliably say which happened first.
Practical consequences:
- Never trust
System.currentTimeMillis()for ordering events across nodes - Never use timestamps as database keys (clocks skew, duplicates happen)
- For “happened-before” ordering, use logical clocks (Lamport timestamps, vector clocks)
- If you really need monotonic time, use a hybrid logical clock or a central time service
Failure modes you must design for
Partial failure. Some nodes are up, some are down, and from any single node you can’t tell which. This is the defining property of distributed systems.
Split brain. During a partition, both sides think they’re the authority and make conflicting decisions. Consensus algorithms exist to prevent this; getting them wrong is how data corruption happens.
Cascading failure. One slow node holds up its callers, which hold up theirs, and the whole system grinds to halt. Timeouts, circuit breakers, and bulkheads exist to prevent this.
Byzantine failures. Nodes don’t just crash, they lie (send wrong data). Rare in internal systems, common in adversarial ones (public blockchains). Usually out of scope for normal backends.
The patterns that help
- Retries with backoff for transient failures
- Timeouts everywhere — no network call without one
- Circuit breakers to stop hammering a dead dependency
- Idempotency keys so retries don’t cause duplicates
- Outbox patterns for reliable event publishing
- Sagas for multi-step business transactions
- Consensus (Raft, Paxos) when you truly need strong agreement
- CRDTs for mergeable data structures
- Eventual consistency with UX affordances (optimistic UI, “saving…” indicators)
Each pattern is a scaffolding around one of the failures above. Use the minimum set your actual requirements demand.
What “correct” means
In single-process code, “correct” is deterministic. In distributed systems, “correct” means: whatever reasonable failure happens, the system either recovers automatically or reaches a state the operator can fix. It doesn’t mean no failures — it means failures are bounded, observable, and survivable.
If your architecture diagram doesn’t show what happens when each component fails, the design isn’t done.
Starting small
You don’t need Paxos on day one. A useful starting mental checklist:
- Every network call has a timeout
- Every retryable operation is idempotent
- Every slow dependency is behind a circuit breaker
- Every event producer uses an outbox
- Every metric stack includes success/error rates per dependency
- Every incident asks “which of the fallacies bit us?”
Get those right and you’ve solved 90% of the distributed-systems problems you’ll actually encounter.