Introduction
Celery tasks fail with ConnectionRefusedError when the broker (Redis/RabbitMQ) is unreachable, and without proper retry configuration, these failures are permanent. Even with retries enabled, linear retry intervals can overwhelm a recovering broker. The combination of connection refused errors during broker outages and unbounded retries creates thundering herd problems where thousands of tasks retry simultaneously, preventing the broker from recovering. Proper configuration requires exponential backoff with jitter, connection retry settings, and a maximum retry limit that routes permanently failed tasks to a dead letter queue.
Symptoms
[2026-04-09 10:00:00,000: ERROR/ForkPoolWorker-3] Task myapp.tasks.process_order[abc123] raised unexpected: ConnectionRefusedError(61, 'Connection refused')
File "kombu/connection.py", line 275, in connect
return self._ensure_connection()Or retry storm:
[2026-04-09 10:00:01,000: WARNING] Retrying task in 1 second
[2026-04-09 10:00:02,000: WARNING] Retrying task in 1 second
[2026-04-09 10:00:03,000: WARNING] Retrying task in 1 second
# All tasks retrying at the same time - broker cannot recoverCommon Causes
- Broker not running: Redis or RabbitMQ process crashed or not started
- No retry configuration: Tasks fail permanently on connection errors
- Linear retry interval: Fixed delay causes thundering herd on recovery
- Unlimited retries: Tasks retry forever, clogging the queue
- Connection pool exhausted: Celery workers open too many broker connections
- Task result backend unreachable: Same connection issue for storing results
Step-by-Step Fix
Step 1: Configure task retry with exponential backoff
```python from celery import Celery from kombu import Queue
app = Celery('myapp', broker='redis://localhost:6379/0')
app.conf.update( task_default_retry_delay=10, # Initial delay: 10 seconds task_default_max_retries=5, # Max 5 retries before giving up task_acks_late=True, # Acknowledge after task completes worker_prefetch_multiplier=1, # One task at a time per worker )
@app.task(bind=True, max_retries=5) def process_order(self, order_id): try: result = send_to_external_api(order_id) return result except ConnectionError as exc: # Exponential backoff: 10s, 20s, 40s, 80s, 160s raise self.retry(exc=exc, countdown=2 ** self.request.retries * 10) except Exception as exc: # Non-retryable errors fail immediately raise self.retry(exc=exc, countdown=60, max_retries=3) ```
Step 2: Configure broker connection resilience
```python app.conf.update( broker_connection_retry=True, broker_connection_retry_on_startup=True, broker_connection_max_retries=10, broker_connection_retry_interval=5, broker_pool_limit=20, # Connection pool size broker_heartbeat=10, # Detect dead connections broker_heartbeat_checkrate=30, )
# For Redis broker specifically app.conf.update( redis_retry_on_timeout=True, redis_socket_keepalive=True, redis_backend_health_check_interval=30, ) ```
Step 3: Route failed tasks to dead letter queue
```python from kombu import Exchange, Queue
app.conf.task_queues = ( Queue('default', Exchange('default'), routing_key='default'), Queue('dead_letter', Exchange('dead_letter'), routing_key='dead_letter'), )
@app.task(bind=True, max_retries=5) def process_order(self, order_id): try: return send_to_external_api(order_id) except Exception as exc: if self.request.retries >= self.max_retries: # Route to dead letter queue self.app.send_task( 'myapp.tasks.handle_dead_letter', args=[order_id, str(exc)], queue='dead_letter', ) return {'status': 'moved_to_dead_letter'} raise self.retry(exc=exc) ```
Prevention
- Always set max_retries on tasks that depend on external services
- Use exponential backoff with jitter to prevent thundering herd on recovery
- Configure broker heartbeat to detect and recover from dead connections
- Monitor Celery queue length and retry rates with Flower or Prometheus
- Set task_acks_late=True to prevent message loss on worker crash
- Implement dead letter queue handling for permanently failed tasks
- Test broker failure scenarios in staging by stopping and restarting the broker