Martin Fowler named the pattern after the strangler fig — a plant that wraps around an existing tree and gradually replaces it. Applied to software: incrementally replace parts of a legacy system with new implementations, with both running side by side, until the old can be removed.
It’s the most reliable migration strategy I’ve seen. Big-bang rewrites fail far more often. This article is the practical version.
Why not big-bang
Big-bang rewrites promise a clean new system in 6 months. They deliver a broken new system in 18, usually with:
- Lost edge cases the old system handled quietly
- Missing features that weren’t in the requirements
- Performance regressions only visible in production
- Users stuck with the old system because the new one isn’t ready
Strangler fig avoids all of this because the old system never goes away until it’s genuinely not needed.
The basic pattern
┌────────────────────────┐
│ Traffic Router │
│ (gateway / facade) │
└────────────────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Legacy │ │ New │
│ System │ │ System │
└──────────────┘ └──────────────┘Route specific paths, features, or users to the new system. Everything else stays with the legacy. Over time, the router shifts more traffic to new, less to legacy. When legacy traffic hits zero, remove it.
The three decision points per slice
1. What to extract first?
Pick a slice with:
- Clear bounded context
- Low coupling to the rest of the legacy system
- Real pain today (slow, buggy, blocking new features)
- Moderate complexity — not trivial (no learning), not massive (first slice should complete)
Classic good first slice: an authentication system, a notification service, a reporting dashboard.
2. How to route traffic?
Options by complexity:
- URL routing. New system handles
/api/v2/orders; legacy handles the rest. Simplest, least flexible. - Feature flag. Per-user or percentage-based routing. Maximum control, requires flag infrastructure.
- Header-based. Canary clients send a special header. Used for A/B testing or beta access.
- Data-based. Users migrated to new system by ID range or migration timestamp.
3. How to compare behavior?
Don’t trust that the new implementation matches legacy. Run both:
- Dark traffic. Call both systems; use legacy’s response; log divergence from new
- Dual write, dual read. Write to both, compare on read
- Canary. Small % of users on new system, monitor error rates and support tickets
Catch behavior drift before cutting over.
The strangler gates
Don’t increase traffic to the new system until each gate is green:
- Functional parity — all critical paths work
- Performance parity — p99 within tolerance
- Operational parity — monitoring, alerts, on-call runbooks
- Security parity — auth, audit, compliance
- Observability — logs, metrics, traces comparable to legacy
Skipping a gate is how teams end up with new systems worse than the ones they replaced.
A realistic timeline
For a non-trivial slice:
- Week 1-2: new service scaffolded, CI/CD ready
- Week 3-6: feature parity for the slice, dark traffic enabled
- Week 7-8: divergence fixed, first 1% of traffic
- Week 9-12: progressive rollout 1% → 10% → 50% → 100%
- Week 13-14: sanity window with both running
- Week 15-16: legacy code for this slice removed
~4 months per slice is realistic. Plan accordingly.
Common failure modes
The never-ending migration. Slices get added to, never removed from. Five years later, 80% of traffic still goes to legacy. Usually caused by missing sunset dates and political reluctance.
Parallel features. Product adds features to legacy during migration. New system never catches up. Lock down legacy to bug fixes only during the migration.
No measurement. Team “thinks” the new slice handles 20% of traffic. Actually 5%. Always instrument the router to report per-path, per-version usage.
Skipping dark traffic. First real-world request reveals 15 bugs. Dark traffic for 1-2 weeks before user-facing traffic flips catches most of them.
Shared database held as shared. Intentional data decoupling gets postponed. Two systems writing to the same tables = distributed monolith.
When strangler doesn’t fit
- Systems with fundamentally different data models where translation is too costly
- Regulatory environments where cutover must be atomic (rare but real)
- Small systems where the migration itself is cheaper than the parallel infrastructure
Most systems bigger than “a few services” benefit from strangler over big-bang. The ones that don’t are usually small enough that rewrite-in-place is actually viable.
Closing note
Strangler fig feels slow compared to the imaginary quick rewrite. It ships. Rewrites often don’t. The best migration I’ve seen used strangler for 18 months, had zero production incidents caused by the migration, and delivered a clean new system that the team trusted. The worst migration I’ve seen chose big-bang, took 26 months, and the new system was abandoned 6 months after launch. The slow path is the fast path.