Dead Letter Queues — Handling the Unhandleable

A consumer pulls a message, tries to process it, throws an exception. What should happen? If you don’t have an explicit answer, the default answer is usually “retry forever, block the rest of the queue, page the on-call engineer at 3 AM.” Dead letter queues are how you stop that from happening.

What a DLQ is

A Dead Letter Queue is a separate destination where messages go when they can’t be processed normally. Poisoned messages — ones that throw every time — don’t block the main queue. They move to a side queue for human inspection.

Every message broker has a way to configure this:

RabbitMQ: x-dead-letter-exchange on the queue
Kafka (no native DLQ): applications send bad messages to a separate .dlq topic
SQS: RedrivePolicy with deadLetterTargetArn
NATS JetStream: max delivery then dead-letter stream

The retry + DLQ pattern

Typical flow for a consumer:

receive message
  → try to process
    success → ack, done
    transient failure (timeout, 5xx) → retry with backoff
    retries exhausted or permanent failure → move to DLQ, ack main queue

The main queue keeps flowing. The DLQ collects things needing attention.

A Kafka example in Spring

Spring Kafka with an error handler that sends to DLQ after retries:

@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<Object, Object> template) {
    DeadLetterPublishingRecoverer recoverer = new DeadLetterPublishingRecoverer(template,
        (record, ex) -> new TopicPartition(record.topic() + ".dlq", record.partition()));

    FixedBackOff backOff = new FixedBackOff(1000L, 3L); // 3 retries, 1s apart
    return new DefaultErrorHandler(recoverer, backOff);
}

@KafkaListener(topics = "order.placed", groupId = "fulfillment")
public void on(OrderPlaced event) {
    fulfillmentService.process(event);
}

If fulfillmentService.process throws 4 times in a row, the record is published to order.placed.dlq with headers describing the error and the original topic.

Critical: include error context in the DLQ message

A DLQ message without context is nearly useless. When publishing to the DLQ, attach:

Original topic and partition
Offset where it was read
Timestamp of the first failure
Exception type and message
Stack trace (bounded, first 2-3 levels)
Consumer group that failed

Spring’s DeadLetterPublishingRecoverer does this via Kafka headers by default. For custom implementations, don’t skip it.

What to do with DLQ messages

This is where most teams drop the ball. A DLQ is not a garbage dump — it’s a to-do list. Typical handling:

Alert — DLQ depth > 0 fires an alert after 5 minutes (immediate alerting is too noisy for transient issues)
Inspect — tools or dashboards let you view DLQ messages and the error that landed them there
Categorize:
- Bug on our side — fix code, redrive the messages through main topic
- Bad data from upstream — report back to the producer; sometimes drop, sometimes escalate
- Expired — messages referencing deleted resources, often safe to drop
Redrive or discard — once handled, either replay into main topic or archive

Without a redrive mechanism, DLQs become write-only — messages accumulate and nobody acts.

A simple redrive pattern

@Scheduled(cron = "0 0 * * * *")
public void redriveDlq() {
    // admin-triggered only, not automatic
}

public void redriveTopic(String dlqTopic, String mainTopic, int maxRecords) {
    ConsumerFactory<String, Object> factory = ...;
    try (Consumer<String, Object> consumer = factory.createConsumer("dlq-admin", null)) {
        consumer.subscribe(List.of(dlqTopic));
        ConsumerRecords<String, Object> records = consumer.poll(Duration.ofSeconds(5));
        int count = 0;
        for (ConsumerRecord<String, Object> record : records) {
            if (count >= maxRecords) break;
            kafkaTemplate.send(mainTopic, record.key(), record.value());
            count++;
        }
    }
}

Trigger on demand, not automatically. Automatic redrive creates infinite loops for truly broken messages.

Classifying errors

The dividing line: retry vs. dead-letter.

Retry for:

Timeouts, connection errors
5xx responses
Lock contention, deadlocks
Temporary downstream unavailability

Dead-letter immediately for:

Deserialization failures (message will never be parseable)
4xx responses (the request is bad, not the system)
Business validation failures (order references deleted customer)
Poison messages from schema drift

Conflating the two — retrying forever on deserialization errors — is how you get queue backpressure and alert storms.

Don’t retry forever

The most common mistake: a 10-second retry in a loop with no exit. If the message is genuinely broken, this chews CPU and blocks the queue indefinitely.

Exit conditions:

Fixed number of attempts (3-5 typical)
Time-bounded (retry for up to 30 seconds total)
State-bounded (after this many failures, DLQ)

Exponential backoff with jitter — 1s, 2s, 4s, 8s — gives downstream systems time to recover without giving up too soon.

DLQ anti-patterns

Consume the DLQ directly. If you have a consumer reading the DLQ and re-processing, you’ve just made the DLQ the real queue. Bad messages re-circulate forever.

One DLQ for all topics. Makes monitoring and redrive harder. One DLQ per source topic.

No retention on DLQ. Bad messages accumulate for years. Set a retention policy (30 days is reasonable) and archive older stuff.

No alerting. Messages land in DLQ, nobody notices, data is silently missing from downstream systems until a customer complaint.

Kafka-specific: dead-letter topics vs. parking lot

In Kafka, the “DLQ” is just another topic. Some teams have a per-topic .dlq suffix; others have a global messages.dead-letter. Per-topic is usually better — lets you process redrives per-source.

A related pattern: parking lot topic. When a consumer sees a message it’s not ready to handle yet (e.g., references a resource that hasn’t been created), it moves to a parking lot and retries later. Different from DLQ (temporary vs. terminal failure).

Closing note

DLQs are not optional. Every consumer processing real business events should have one and should have alerting on it. The amount of engineering work is tiny — a few lines of config — and it prevents a class of incidents that would otherwise take the whole queue down. The hardest part isn’t adding the DLQ; it’s being disciplined about draining it.