Introduction

Celery tasks fail with ConnectionRefusedError when the broker (Redis/RabbitMQ) is unreachable, and without proper retry configuration, these failures are permanent. Even with retries enabled, linear retry intervals can overwhelm a recovering broker. The combination of connection refused errors during broker outages and unbounded retries creates thundering herd problems where thousands of tasks retry simultaneously, preventing the broker from recovering. Proper configuration requires exponential backoff with jitter, connection retry settings, and a maximum retry limit that routes permanently failed tasks to a dead letter queue.

Symptoms

bash
[2026-04-09 10:00:00,000: ERROR/ForkPoolWorker-3] Task myapp.tasks.process_order[abc123] raised unexpected: ConnectionRefusedError(61, 'Connection refused')
  File "kombu/connection.py", line 275, in connect
    return self._ensure_connection()

Or retry storm:

bash
[2026-04-09 10:00:01,000: WARNING] Retrying task in 1 second
[2026-04-09 10:00:02,000: WARNING] Retrying task in 1 second
[2026-04-09 10:00:03,000: WARNING] Retrying task in 1 second
# All tasks retrying at the same time - broker cannot recover

Common Causes

  • Broker not running: Redis or RabbitMQ process crashed or not started
  • No retry configuration: Tasks fail permanently on connection errors
  • Linear retry interval: Fixed delay causes thundering herd on recovery
  • Unlimited retries: Tasks retry forever, clogging the queue
  • Connection pool exhausted: Celery workers open too many broker connections
  • Task result backend unreachable: Same connection issue for storing results

Step-by-Step Fix

Step 1: Configure task retry with exponential backoff

```python from celery import Celery from kombu import Queue

app = Celery('myapp', broker='redis://localhost:6379/0')

app.conf.update( task_default_retry_delay=10, # Initial delay: 10 seconds task_default_max_retries=5, # Max 5 retries before giving up task_acks_late=True, # Acknowledge after task completes worker_prefetch_multiplier=1, # One task at a time per worker )

@app.task(bind=True, max_retries=5) def process_order(self, order_id): try: result = send_to_external_api(order_id) return result except ConnectionError as exc: # Exponential backoff: 10s, 20s, 40s, 80s, 160s raise self.retry(exc=exc, countdown=2 ** self.request.retries * 10) except Exception as exc: # Non-retryable errors fail immediately raise self.retry(exc=exc, countdown=60, max_retries=3) ```

Step 2: Configure broker connection resilience

```python app.conf.update( broker_connection_retry=True, broker_connection_retry_on_startup=True, broker_connection_max_retries=10, broker_connection_retry_interval=5, broker_pool_limit=20, # Connection pool size broker_heartbeat=10, # Detect dead connections broker_heartbeat_checkrate=30, )

# For Redis broker specifically app.conf.update( redis_retry_on_timeout=True, redis_socket_keepalive=True, redis_backend_health_check_interval=30, ) ```

Step 3: Route failed tasks to dead letter queue

```python from kombu import Exchange, Queue

app.conf.task_queues = ( Queue('default', Exchange('default'), routing_key='default'), Queue('dead_letter', Exchange('dead_letter'), routing_key='dead_letter'), )

@app.task(bind=True, max_retries=5) def process_order(self, order_id): try: return send_to_external_api(order_id) except Exception as exc: if self.request.retries >= self.max_retries: # Route to dead letter queue self.app.send_task( 'myapp.tasks.handle_dead_letter', args=[order_id, str(exc)], queue='dead_letter', ) return {'status': 'moved_to_dead_letter'} raise self.retry(exc=exc) ```

Prevention

  • Always set max_retries on tasks that depend on external services
  • Use exponential backoff with jitter to prevent thundering herd on recovery
  • Configure broker heartbeat to detect and recover from dead connections
  • Monitor Celery queue length and retry rates with Flower or Prometheus
  • Set task_acks_late=True to prevent message loss on worker crash
  • Implement dead letter queue handling for permanently failed tasks
  • Test broker failure scenarios in staging by stopping and restarting the broker