Introduction

When message consumers repeatedly fail to process certain messages, those messages are routed to a dead letter queue (DLQ). If the root cause affects a large volume of messages -- such as a schema change, corrupted payload, or downstream service outage -- the DLQ can grow faster than the operations team can investigate, consuming storage and masking new failures among thousands of existing ones.

Symptoms

  • DLQ depth grows by thousands of messages per hour with no manual intervention
  • DLQ storage approaches broker disk limits, triggering high watermark alerts
  • Majority of DLQ messages share the same error type or failure reason
  • Consumer reprocessing attempts from DLQ fail with identical errors
  • Error message: Deserialization failed: unexpected token at position

Common Causes

  • Systematic schema incompatibility causing all new messages to fail deserialization
  • Downstream API returning 500 errors, making all messages temporarily unprocessable
  • Malformed payload from a buggy producer version writing invalid message format
  • Consumer code bug that fails on a specific edge case present in many messages
  • DLQ consumer not running or misconfigured, leaving messages to accumulate

Step-by-Step Fix

jq -r '.headers.error'sortuniq -c

Prevention

  • Implement DLQ message classification to automatically categorize failures by type
  • Set up automated DLQ processing pipelines that attempt reprocessing with exponential backoff
  • Configure DLQ retention policies to archive old messages to cold storage automatically
  • Monitor producer message quality with schema validation at the producer level
  • Implement poison pill detection that identifies and isolates permanently unprocessable messages
  • Create runbooks for common DLQ failure patterns with one-click remediation procedures