# Fix AWS SQS Dead Letter Queue Issues
Your dead letter queue (DLQ) is filling up with messages, or messages aren't being moved there when they fail processing. Maybe you can't figure out why certain messages keep ending up in the DLQ. SQS dead letter queues are designed to capture failed messages, but they need proper configuration and monitoring to work effectively.
This guide covers diagnosing DLQ issues, understanding why messages fail, and implementing proper redrive policies.
Diagnosis Commands
First, identify your queues and their DLQ relationships:
aws sqs list-queues \
--query 'QueueUrls'Get details about the main queue:
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
--attribute-names All \
--query 'Attributes.[QueueArn,RedrivePolicy,ApproximateNumberOfMessages,ApproximateNumberOfMessagesNotVisible]'Parse the redrive policy:
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
--attribute-names RedrivePolicy \
--query 'Attributes.RedrivePolicy' \
--output text | jq .Check the DLQ's message count:
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dlq \
--attribute-names ApproximateNumberOfMessages,ApproximateNumberOfMessagesNotVisible,QueueArnLook at messages in the DLQ (careful - this might affect visibility):
aws sqs receive-message \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dlq \
--max-number-of-messages 10 \
--attribute-names All \
--message-attribute-names All \
--visibility-timeout 30Check for redrive policies on the DLQ itself (should not have one):
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dlq \
--attribute-names RedrivePolicyGet CloudWatch metrics for message flow:
aws cloudwatch get-metric-statistics \
--namespace AWS/SQS \
--metric-name NumberOfMessagesSent \
--dimensions Name=QueueName,Value=my-dlq \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 \
--statistics Sum \
--output tableCheck approximate age of oldest message:
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dlq \
--attribute-names ApproximateAgeOfOldestMessageCommon Causes and Solutions
DLQ Not Configured
Messages fail repeatedly but never move to DLQ because it's not configured:
Check if redrive policy exists:
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
--attribute-names RedrivePolicyIf empty, configure the DLQ:
```bash # First, get the DLQ ARN DLQ_ARN=$(aws sqs get-queue-attributes \ --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dlq \ --attribute-names QueueArn \ --query 'Attributes.QueueArn' \ --output text)
# Set the redrive policy aws sqs set-queue-attributes \ --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \ --attributes '{"RedrivePolicy":"{\"deadLetterTargetArn\":\"$DLQ_ARN\",\"maxReceiveCount\":5}"}' ```
The maxReceiveCount determines how many times a message can be received before moving to DLQ. Common values:
- 3-5: For critical messages that need attention
- 10+: For less critical, retry-tolerant workloads
Too Many Messages in DLQ
DLQ overflow indicates systemic processing problems:
Get the DLQ size:
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dlq \
--attribute-names ApproximateNumberOfMessagesExamine sample messages to understand failure patterns:
aws sqs receive-message \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dlq \
--max-number-of-messages 10 \
--attribute-names All \
--query 'Messages[*].[MessageId,Body,Attributes]'Check message attributes for processing history:
aws sqs receive-message \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dlq \
--max-number-of-messages 1 \
--attribute-names All \
--query 'Messages[0].Attributes'Key attributes to examine:
- ApproximateReceiveCount: Number of times received (should exceed maxReceiveCount)
- SentTimestamp: When originally sent
- ApproximateFirstReceiveTimestamp: First processing attempt
Messages Never Move to DLQ
Messages seem to fail but stay in main queue:
Verify the DLQ ARN in redrive policy matches the actual DLQ:
```bash # Get configured DLQ ARN aws sqs get-queue-attributes \ --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \ --attribute-names RedrivePolicy \ --query 'Attributes.RedrivePolicy' \ --output text | jq -r '.deadLetterTargetArn'
# Verify DLQ exists and get its ARN aws sqs get-queue-attributes \ --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dlq \ --attribute-names QueueArn \ --query 'Attributes.QueueArn' ```
Check if consumer is actually deleting messages or just releasing them:
aws cloudwatch get-metric-statistics \
--namespace AWS/SQS \
--metric-name NumberOfMessagesDeleted \
--dimensions Name=QueueName,Value=my-queue \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 \
--statistics SumIf delete count is near zero, messages aren't being processed successfully. They're being received, processing fails, and they're released (visibility timeout expires) rather than deleted.
Incorrect MaxReceiveCount
Too low: Messages move to DLQ on transient failures Too high: Messages retry for hours before DLQ
Adjust based on your workload:
```bash # For retry-sensitive workloads (quick retries) aws sqs set-queue-attributes \ --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \ --attributes '{"RedrivePolicy":"{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789012:my-dlq\",\"maxReceiveCount\":3}"}'
# For retry-tolerant workloads (extended retries) aws sqs set-queue-attributes \ --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \ --attributes '{"RedrivePolicy":"{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789012:my-dlq\",\"maxReceiveCount\":10}"}' ```
Visibility Timeout Issues
Visibility timeout affects how quickly messages can be reprocessed:
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
--attribute-names VisibilityTimeoutIf visibility timeout is too short, messages can be received multiple times by different consumers before processing completes, inflating the receive count.
Set appropriate visibility timeout (should exceed processing time):
aws sqs set-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
--attributes VisibilityTimeout=300For Lambda triggers, the visibility timeout should be at least 6 times the Lambda timeout:
# If Lambda timeout is 60 seconds
aws sqs set-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
--attributes VisibilityTimeout=360DLQ Has Its Own DLQ (Circular)
Never configure a DLQ with its own redrive policy. This creates an infinite loop:
# Check if DLQ has a redrive policy (should return empty)
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dlq \
--attribute-names RedrivePolicyIf it has one, remove it:
aws sqs set-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dlq \
--attributes '{"RedrivePolicy":""}'Message Processing Failures
Investigate why messages end up in DLQ by examining your consumer logs:
For Lambda consumers:
aws logs filter-log-events \
--log-group-name /aws/lambda/my-sqs-consumer \
--start-time $(date -u -d '1 hour ago' +%s)000 \
--filter-pattern "ERROR\|Exception\|Failed" \
--query 'events[*].message'Common failure reasons: - Invalid message format (JSON parsing errors) - Missing required fields - Downstream service unavailable - Timeout during processing
Fix at the source:
# Example: Better message handling in Lambda
def lambda_handler(event, context):
for record in event['Records']:
try:
message = json.loads(record['body'])
process_message(message)
except json.JSONDecodeError as e:
# Log and skip invalid messages instead of failing
logger.error(f"Invalid JSON in message {record['messageId']}: {e}")
continue
except ProcessingError as e:
# Specific handling for known errors
logger.error(f"Processing failed for {record['messageId']}: {e}")
raise # Let it go to DLQ after retriesRedrive from DLQ
To retry messages from DLQ, use the redrive capability:
```bash # Check if redrive is available (SQS API) aws sqs start-message-move-task \ --source-arn arn:aws:sqs:us-east-1:123456789012:my-dlq \ --destination-arn arn:aws:sqs:us-east-1:123456789012:my-queue
# Monitor the task aws sqs list-message-move-tasks \ --source-arn arn:aws:sqs:us-east-1:123456789012:my-dlq ```
Or manually redrive by receiving from DLQ and sending to main queue:
```bash # Receive from DLQ MESSAGE=$(aws sqs receive-message \ --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dlq \ --max-number-of-messages 10 \ --query 'Messages')
# Process and resend to main queue aws sqs send-message \ --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \ --message-body "$MESSAGE_BODY" ```
FIFO Queue DLQ Issues
FIFO queues require FIFO DLQs. Standard DLQs won't work with FIFO queues:
# Check if queue is FIFO
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue.fifo \
--attribute-names FifoQueueCreate a FIFO DLQ:
aws sqs create-queue \
--queue-name my-dlq.fifo \
--attributes FifoQueue=trueUpdate redrive policy with FIFO DLQ:
aws sqs set-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue.fifo \
--attributes '{"RedrivePolicy":"{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789012:my-dlq.fifo\",\"maxReceiveCount\":5}"}'Verification Steps
After configuration changes, verify DLQ behavior:
```bash # Send a test message that will fail aws sqs send-message \ --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \ --message-body '{"test": true, "shouldFail": true}'
# Check message counts after processing attempts aws sqs get-queue-attributes \ --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \ --attribute-names ApproximateNumberOfMessages
aws sqs get-queue-attributes \ --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dlq \ --attribute-names ApproximateNumberOfMessages ```
Set up monitoring:
```bash aws cloudwatch put-metric-alarm \ --alarm-name sqs-dlq-messages \ --alarm-description "DLQ has messages requiring attention" \ --namespace AWS/SQS \ --metric-name ApproximateNumberOfMessages \ --dimensions Name=QueueName,Value=my-dlq \ --statistic Average \ --period 300 \ --threshold 1 \ --comparison-operator GreaterThanOrEqualToThreshold \ --evaluation-periods 1 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts
# Also monitor DLQ age aws cloudwatch put-metric-alarm \ --alarm-name sqs-dlq-old-messages \ --alarm-description "DLQ messages are aging" \ --namespace AWS/SQS \ --metric-name ApproximateAgeOfOldestMessage \ --dimensions Name=QueueName,Value=my-dlq \ --statistic Maximum \ --period 300 \ --threshold 3600 \ --comparison-operator GreaterThanOrEqualToThreshold \ --evaluation-periods 2 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts ```
Create a DLQ monitoring script:
```bash #!/bin/bash MAIN_QUEUE="https://sqs.us-east-1.amazonaws.com/123456789012/my-queue" DLQ="https://sqs.us-east-1.amazonaws.com/123456789012/my-dlq"
echo "SQS Queue Health Check" echo "====================="
echo "Main Queue:" aws sqs get-queue-attributes \ --queue-url $MAIN_QUEUE \ --attribute-names ApproximateNumberOfMessages,ApproximateNumberOfMessagesNotVisible,VisibilityTimeout,RedrivePolicy \ --query 'Attributes'
echo "" echo "Dead Letter Queue:" aws sqs get-queue-attributes \ --queue-url $DLQ \ --attribute-names ApproximateNumberOfMessages,ApproximateAgeOfOldestMessage
echo "" echo "Recent DLQ traffic (messages sent to DLQ in last hour):" aws cloudwatch get-metric-statistics \ --namespace AWS/SQS \ --metric-name NumberOfMessagesSent \ --dimensions Name=QueueName,Value=my-dlq \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 300 \ --statistics Sum ```