Postgres Replication and Failover — A Practical Guide

A single Postgres instance is a single point of failure. For any system with real uptime requirements, you need replication — and a plan for promoting a replica when the primary fails. This article covers the mechanics.

Streaming replication basics

Postgres primary writes to the WAL (write-ahead log). A replica connects and streams WAL records continuously. Each replica applies them to its local copy of the data. Result: replicas have (nearly) the same data as the primary, continuously.

Replication can be synchronous or asynchronous:

Async — primary commits immediately; WAL streams to replicas eventually. Low write latency, some risk of data loss on failover.
Sync — primary waits for at least one replica to confirm. Zero data loss on failover of the confirming replica, higher write latency.

Most systems use async for scale and a single sync replica for durability.

Setting up a replica

High level:

Take a base backup of the primary (pg_basebackup)
Start the replica pointing to the primary’s WAL stream
Replica continuously applies incoming WAL

In practice, managed services (RDS, Cloud SQL, CrunchyBridge) handle the setup. For self-managed, Patroni is the de facto operator for Postgres HA — handles setup, monitoring, and failover.

Replica lag — what it is and why it matters

Lag = how far behind the replica is from the primary in time or WAL bytes. Zero lag is rare; a few hundred milliseconds is typical for healthy async replicas.

Lag sources:

High write volume (replica applying sequentially can’t keep up)
Long-running queries on replica (block WAL apply)
Network bandwidth
Slow disk on replica

Monitor:

SELECT client_addr, state,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;

Alert on lag > 10s for OLTP workloads. Lag > 60s often means “don’t read from replica right now.”

Reading from replicas

Strong temptation: route reads to replicas for scale. Real caution:

Not read-your-writes. User creates something; reads replica; doesn’t see it. Bad UX.
Lag can spike. During big writes, lag might jump to minutes. Reads become very stale.
Complex transactions can break. Serializable isolation doesn’t work the same across primary + replica.

Rules:

Read from replica only for queries that tolerate staleness
Route a user’s own data reads to the primary within a short window (30s) after they write
Keep one replica always available in case others fall too far behind

Failover — the critical rehearsal

Primary fails. What happens? Depends on setup:

Manual failover. Someone on-call promotes a replica. Takes minutes. Not acceptable for user-facing workloads.

Automatic with Patroni (or managed service). Monitors health; promotes a replica within 30 seconds when primary is unreachable.

Failover is never instantaneous. Applications must handle:

Connection errors during promotion
DNS/IP changes to the new primary
Any unacked writes lost (in async replication)

Test failover regularly. A failover that hasn’t been rehearsed is a theoretical feature. Kill the primary in staging; see what breaks. Fix those things. Rehearse again.

Split brain

The scariest failure mode: primary is isolated by network partition but still running; replica gets promoted; now there are two primaries accepting writes.

Prevention:

Fencing. Old primary gets killed before promotion of replica. Patroni does this via watchdog or STONITH.
Majority voting. At least N of M nodes must agree before promotion. Prevents a single isolated node from promoting itself.
Synchronous replication with synchronous_commit = remote_apply — ensures replica has the full commit before it promotes.

Split brain is rare if properly configured; devastating when it happens. Invest in prevention.

Backup is not replication

Replication protects against hardware failure. It does not protect against:

Application bug that deletes data (replicated to all nodes)
Someone running DROP TABLE on the primary
Ransomware

For those, you need backups — pg_dump or WAL archive + base backup (PITR). Separate storage, tested restore process.

I’ve seen teams conflate the two and lose data because they “had replicas.” Don’t.

Connection routing

Applications need to know which node is primary. Options:

DNS. db-primary.internal points to the current primary. Simple, DNS TTL matters.

Virtual IP. Floats between nodes on failover. Works in Linux-heavy environments.

Service discovery. Patroni publishes state to Consul/etcd. Applications query discovery.

PgBouncer + reconfig. Pool config updates on failover.

For most production systems, PgBouncer in front of Postgres with connection strings referencing a DNS name works well. On failover, DNS updates, PgBouncer retries, apps see a brief blip.

Logical replication

Streaming replication is physical — it copies WAL byte by byte, replica is identical to primary. Logical replication ships row-level changes; replicas can have different schemas, selective tables, different versions.

Useful for:

Cross-version upgrades (replicate from PG 14 to PG 16, then switch)
Partial replication (replicate only some tables)
Cross-region with selective data
CDC (similar space as Debezium)

More flexible, more operationally complex. Don’t reach for logical replication unless you need its specific features.

Closing note

Replication done right is invisible — the system just stays up through node failures. Replication done wrong is an outage magnifier. The difference is in the details: sync vs async choice, replica lag monitoring, failover rehearsals, split-brain prevention. Managed services handle most of this for you; self-managed Postgres should use Patroni and a battle-tested setup. Either way, rehearse failover before you need it — the first time a real primary dies is not when you want to discover what your HA strategy actually does.