Introduction
When message consumers repeatedly fail to process certain messages, those messages are routed to a dead letter queue (DLQ). If the root cause affects a large volume of messages -- such as a schema change, corrupted payload, or downstream service outage -- the DLQ can grow faster than the operations team can investigate, consuming storage and masking new failures among thousands of existing ones.
Symptoms
- DLQ depth grows by thousands of messages per hour with no manual intervention
- DLQ storage approaches broker disk limits, triggering high watermark alerts
- Majority of DLQ messages share the same error type or failure reason
- Consumer reprocessing attempts from DLQ fail with identical errors
- Error message:
Deserialization failed: unexpected token at position
Common Causes
- Systematic schema incompatibility causing all new messages to fail deserialization
- Downstream API returning 500 errors, making all messages temporarily unprocessable
- Malformed payload from a buggy producer version writing invalid message format
- Consumer code bug that fails on a specific edge case present in many messages
- DLQ consumer not running or misconfigured, leaving messages to accumulate
Step-by-Step Fix
| jq -r '.headers.error' | sort | uniq -c |
|---|
Prevention
- Implement DLQ message classification to automatically categorize failures by type
- Set up automated DLQ processing pipelines that attempt reprocessing with exponential backoff
- Configure DLQ retention policies to archive old messages to cold storage automatically
- Monitor producer message quality with schema validation at the producer level
- Implement poison pill detection that identifies and isolates permanently unprocessable messages
- Create runbooks for common DLQ failure patterns with one-click remediation procedures