Introduction

Moving messages out of a dead-letter queue is not the same as fixing the original processing problem. If the consumer still cannot handle the payload, the source queue redrive policy is wrong, or the visibility timeout is too short, the replayed messages just fail again and land back in the DLQ. The right process is to validate the fix first, then redrive with controlled observability.

Symptoms

  • Replayed DLQ messages fail again almost immediately
  • Messages bounce between the source queue and the DLQ
  • Consumers now reject message shape or attributes that were accepted earlier
  • FIFO replay behaves unexpectedly because of deduplication or ordering assumptions

Common Causes

  • The original consumer bug or dependency outage is not actually fixed
  • Message format or schema expectations changed since the messages were first produced
  • The source queue visibility timeout is too short for replayed work
  • Redrive configuration or replay logic sends messages back without preserving the right context

Step-by-Step Fix

  1. 1.Inspect the DLQ messages before replaying
  2. 2.Check body, attributes, age, and receive count so you understand what kind of backlog you are about to reintroduce.
bash
aws sqs receive-message \
  --queue-url https://sqs.region.amazonaws.com/account/my-queue-dlq \
  --max-number-of-messages 10 \
  --message-attribute-names All \
  --attribute-names All
  1. 1.Confirm the original consumer path is fixed
  2. 2.Replay without validating the consumer fix just recreates the same outage with more load.
  3. 3.Review source queue redrive and visibility timeout settings
  4. 4.If processing still needs longer than the queue allows, messages will cycle back into failure even after the application bug is gone.
bash
aws sqs get-queue-attributes \
  --queue-url https://sqs.region.amazonaws.com/account/my-queue \
  --attribute-names RedrivePolicy VisibilityTimeout
  1. 1.Replay in controlled batches
  2. 2.Start with a small message sample, verify successful processing, then increase volume only after the path proves stable.

Prevention

  • Make consumers idempotent so replay is safe
  • Monitor DLQ depth and replay outcomes together
  • Validate schema changes against old queued messages before deploying consumers
  • Use small batch redrive as a safety check before large-scale replay