“Save the order, then publish an event” is the most common dual-write problem in backend systems. Two resources — database and message broker — and no way to make both atomic. The transactional outbox pattern solves it cleanly. This article covers the full pattern, not just the concept.

The bug everyone writes once

@Transactional
public Order placeOrder(CreateOrderRequest req) {
    Order order = orderRepo.save(Order.from(req));
    kafka.send("order.placed", order.toEvent()).get();
    return order;
}

What’s wrong: the DB commit and the Kafka send are not atomic. Four failure modes:

  1. DB commits, Kafka send throws before commit — event never published, inconsistency
  2. DB commits, JVM crashes before kafka.send — same as above
  3. Kafka send succeeds, DB rollback happens — event published for something that didn’t happen
  4. Kafka send appears to succeed but hasn’t actually been replicated — partial commit

The bug surfaces as “sometimes downstream services miss events” — rare enough to be hard to reproduce, frequent enough to corrupt state over time.

The fix

Write the event into the same database transaction as the business data. A separate publisher reads the events and sends them to Kafka. If the publisher crashes, unpublished events sit waiting; when it recovers, it picks up where it left off.

┌───────────────────────────┐
│     DB transaction        │
│   ┌──────────────────┐    │
│   │  orders INSERT   │    │
│   └──────────────────┘    │
│   ┌──────────────────┐    │   (one commit)
│   │  outbox INSERT   │    │
│   └──────────────────┘    │
└───────────────────────────┘

           ▼  poller reads, sends, marks sent
        ┌─────────┐
        │  Kafka  │
        └─────────┘

The schema

CREATE TABLE outbox (
    id              UUID PRIMARY KEY,
    aggregate_type  TEXT NOT NULL,
    aggregate_id    TEXT NOT NULL,
    event_type      TEXT NOT NULL,
    payload         JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    published_at    TIMESTAMPTZ
);

CREATE INDEX idx_outbox_unpublished
    ON outbox (created_at) WHERE published_at IS NULL;

The partial index is critical — it keeps the “find unpublished” query fast even when the table has millions of rows.

The producer

@Entity
@Table(name = "outbox")
public class OutboxMessage {
    @Id UUID id;
    String aggregateType;
    String aggregateId;
    String eventType;
    String payload;
    Instant createdAt;
    Instant publishedAt;
    // constructors, getters…
}

@Service
public class OrderService {
    private final OrderRepository orderRepo;
    private final OutboxRepository outboxRepo;
    private final ObjectMapper json;

    @Transactional
    public Order placeOrder(CreateOrderRequest req) {
        Order order = orderRepo.save(Order.from(req));

        outboxRepo.save(new OutboxMessage(
            UUID.randomUUID(),
            "Order",
            order.getId().toString(),
            "order.placed",
            json.writeValueAsString(OrderPlacedEvent.from(order)),
            Instant.now(),
            null
        ));

        return order;
    }
}

Both inserts in one transaction. Atomic. No dual-write problem.

The publisher

@Component
public class OutboxPublisher {
    private final OutboxRepository repo;
    private final KafkaTemplate<String, String> kafka;

    @Scheduled(fixedDelay = 500)
    @Transactional
    public void publish() {
        List<OutboxMessage> batch = repo.findTop100ByPublishedAtIsNullOrderByCreatedAtAsc();

        for (OutboxMessage m : batch) {
            try {
                kafka.send(m.getEventType(), m.getAggregateId(), m.getPayload()).get(5, TimeUnit.SECONDS);
                m.setPublishedAt(Instant.now());
            } catch (Exception e) {
                log.error("publish failed, will retry", e);
                break; // stop this batch, try next tick
            }
        }
    }
}

500ms delay between polls is a reasonable default. Scanning is cheap thanks to the partial index.

Delivery guarantees

At-least-once. If Kafka acks but the update fails (row stays unpublished), the message publishes again. Consumers must be idempotent — this is non-negotiable regardless of pattern.

Not exactly-once. Real exactly-once requires transactional Kafka producers + proper coordination and is rarely worth the complexity. Design for at-least-once.

Alternatives and comparisons

Change Data Capture (Debezium). Instead of polling, Debezium tails the Postgres WAL and streams changes directly to Kafka. No polling latency, lower DB load. More ops complexity (run Debezium + Kafka Connect). For high volume, better than polling.

Listen/Notify. Postgres-specific. Use LISTEN in the publisher to wake up immediately on new outbox rows. Cuts polling latency.

Direct inbox at consumer. Some teams put outbox-like tables on the consumer side. Fine but different pattern.

Cleaning up

Published rows accumulate forever if you let them. Either:

  • Delete on publish: DELETE FROM outbox WHERE published_at IS NOT NULL AND published_at < now() - interval '7 days'
  • Separate published rows into an archive table
  • Use TimescaleDB-style retention

Keep a short history window — enough for debugging but not forever.

Ordering

Events for the same aggregate should reach Kafka in order. Single-threaded publisher preserves order; multi-threaded doesn’t without coordination. Options:

  • Publisher is single instance (simplest, lowest throughput)
  • Partition by aggregate ID — each partition handled by one consumer at a time
  • Claim outbox rows with SELECT ... FOR UPDATE SKIP LOCKED so each row is processed once

For most business events, partitioning by aggregate_id + single-thread-per-partition works.

Monitoring

Critical metrics:

  • Unpublished count — alerts if > N
  • Oldest unpublished age — alerts if > N seconds
  • Publish rate — sudden drop → publisher stuck
  • Publish errors — repeated errors → Kafka down or schema mismatch

Without these, the pattern is invisible until users notice missing data.

Code I’ve seen mess up outbox

Outbox in a separate transaction. Defeats the entire purpose. Same transaction or nothing.

Async publish without ack. kafka.send(...) without waiting for acknowledgment — publisher marks row as published before Kafka confirms, then Kafka drops the message. Always .get() or check callback.

No ordering guarantee when required. Two events for one order delivered out of order. Debug afterward is miserable. Think about ordering before production.

No cleanup. The outbox table grows to 500 million rows. Queries slow down. Add retention early.

Closing note

The outbox pattern is one of those ideas that feels like extra work until the first time it saves you from a subtle, hours-long data consistency bug. Add it from the first event your service publishes. The time investment is small; the reliability dividend is massive.