A classic example: to place an order, you must reserve inventory, charge the customer, and schedule shipping. All three must succeed together, or none at all. In a monolith: one @Transactional. In microservices with separate databases: you can’t.

Sagas are the answer. This article explains both flavors, their trade-offs, and how to implement them in Java.

The problem

Distributed transactions (XA, two-phase commit) exist in theory. In practice they’re slow, fragile, and poorly supported. Most modern systems give up on ACID across services and replace it with sagas: a sequence of local transactions, where each step has a compensating action that undoes it if a later step fails.

Step 1: reserve inventory   → on failure: nothing to undo
Step 2: charge card         → on failure: release inventory
Step 3: schedule shipping   → on failure: refund card, release inventory

Instead of “all succeed or all rollback”, you get “all succeed, or the system eventually returns to a consistent state via compensations.”

Two flavors

Orchestrated saga

A central coordinator drives the flow. It calls each service in order, handles responses, triggers compensations on failure.

@Service
public class PlaceOrderSaga {

    public SagaResult run(PlaceOrderCommand cmd) {
        String reservationId = null;
        String paymentId = null;

        try {
            reservationId = inventoryClient.reserve(cmd.items());
            paymentId = paymentClient.charge(cmd.amount(), cmd.customerId());
            shippingClient.schedule(cmd.orderId(), cmd.address());
            return SagaResult.ok();

        } catch (PaymentFailedException e) {
            inventoryClient.release(reservationId);
            return SagaResult.failed("payment_failed");

        } catch (ShippingFailedException e) {
            paymentClient.refund(paymentId);
            inventoryClient.release(reservationId);
            return SagaResult.failed("shipping_failed");
        }
    }
}

Pros: state machine lives in one place. Easy to read, debug, and extend. Explicit compensation logic.

Cons: the coordinator becomes a god-service, knowing about every other service. Single point of orchestration.

Choreographed saga

No coordinator. Each service listens for events and reacts by emitting its own.

Orders publishes OrderPlaced
  → Inventory listens, reserves, publishes InventoryReserved (or Failed)
      → Payments listens on InventoryReserved, charges, publishes PaymentCharged (or Failed)
          → Shipping listens on PaymentCharged, schedules, publishes ShipmentScheduled
Compensations via Failed events flowing back up:
  PaymentFailed → Inventory listens, releases
  ShipmentFailed → Payments listens, refunds; Inventory listens, releases

Pros: loosely coupled. No central bottleneck. Services evolve independently.

Cons: the overall flow is invisible — no single place shows the state machine. Debugging requires tracing events across services. Easy to build unintentional cycles or race conditions.

Picking one

Orchestration when:

  • Observability matters (you want to see saga state in one query)
  • Compliance requires auditable step-by-step execution
  • Compensations are complex and interact
  • Team is small enough to own the coordinator

Choreography when:

  • Services are owned by different teams that want independence
  • Each step is simple and local
  • The flow has clear linear shape without many conditional branches
  • You can invest in distributed tracing to make the flow visible

Most medium systems end up with a mix — orchestrated sagas for the critical business flows (placing an order, settling a trade) and choreographed events for side-effects (notifications, analytics, audit).

Making compensations correct

The hard part. Rules:

Compensations must be idempotent. They may be triggered twice due to retries. Releasing an already-released reservation shouldn’t fail; it should be a no-op.

Compensations must cope with partial state. If step 2 failed halfway, the side-effects of step 1 exist. Compensations run against whatever state exists, not a theoretical “clean” one.

Not every step can be compensated. “Send email” can’t be un-sent. Order steps so irreversible actions come last, or design your saga to guarantee them only after everything reversible has succeeded.

Persisting saga state

For orchestrated sagas, put state in a queryable table:

CREATE TABLE saga_instance (
    id            UUID PRIMARY KEY,
    type          TEXT NOT NULL,
    state         TEXT NOT NULL,    -- STARTED, INVENTORY_RESERVED, PAYMENT_CHARGED, COMPLETED, COMPENSATING, COMPENSATED, FAILED
    context       JSONB NOT NULL,   -- input data, step results
    last_error    TEXT,
    started_at    TIMESTAMPTZ NOT NULL,
    updated_at    TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_saga_open ON saga_instance(state) WHERE state NOT IN ('COMPLETED','COMPENSATED','FAILED');

Benefits:

  • Stuck sagas are visible in one SQL query
  • A crashed coordinator can resume by reading state
  • Compliance / support teams can see individual saga instances
  • Operators can intervene manually

Timeouts and stuck sagas

Each step should have a timeout. If a step doesn’t respond within N seconds, assume failure and compensate. Periodically scan for sagas in an intermediate state longer than expected — they’re signals of bugs or dead downstream services.

Tools for orchestration

Rolling your own state machine works for simple cases. For more than ~5 states with complex transitions, use a dedicated tool:

  • Temporal — excellent, handles durability and replay automatically
  • Camunda — BPMN-based, strong for business-process-heavy workflows
  • Spring Statemachine — lighter, in-process, fine for single-service sagas
  • AWS Step Functions — managed, good for AWS-native systems

Closing note

Sagas trade the simplicity of distributed transactions for the realism of “failures happen, compensations are our response”. The best saga implementations I’ve seen were clear-eyed about where irreversible actions lived, designed compensations as first-class citizens, and persisted state where operators could see it. Do those three things and most sagas are tractable. Skip any of them and you’ll be debugging stuck business processes at 2 AM.