Introduction

Gunicorn workers are killed with a SIGABRT signal when they fail to respond to the master process's heartbeat within the configured timeout period (default 30 seconds). This typically happens when a worker is stuck processing a slow request -- a long-running database query, an external API call with no timeout, or a CPU-bound operation blocking the worker. In production, worker timeouts cause intermittent 502 Bad Gateway errors for users, reduced throughput as workers are recycled, and potential cascading failures when the replacement workers inherit the same slow workload.

Symptoms

In Gunicorn error logs:

bash
[2024-03-15 14:23:01 +0000] [12] [CRITICAL] WORKER TIMEOUT (pid:456)
[2024-03-15 14:23:01 +0000] [456] [INFO] Worker exiting (pid: 456)
[2024-03-15 14:23:02 +0000] [12] [INFO] Booting worker with pid: 789

Nginx access logs show intermittent 502 errors:

bash
10.0.1.50 - - [15/Mar/2024:14:23:01 +0000] "POST /api/reports/generate HTTP/1.1" 502 166 "-" "python-requests/2.28.0"

Gunicorn access log shows the slow request:

bash
10.0.1.50 - - [15/Mar/2024:14:22:35 +0000] "POST /api/reports/generate HTTP/1.1" 200 45231 "-" "python-requests/2.28.0" 28.543

The request took 28.5 seconds -- dangerously close to the 30-second timeout.

Common Causes

  • External API call without timeout: requests.get() without timeout parameter hangs indefinitely
  • Slow database query: Unindexed query or large result set processing
  • CPU-bound operation in sync worker: Data processing, image resizing, or CSV generation blocking the worker
  • Deadlock in application code: Database lock or threading deadlock prevents worker from responding to heartbeat
  • Timeout value too low: Default 30-second timeout insufficient for legitimate long-running requests
  • Worker class mismatch: Using sync workers for I/O-bound workloads that should use gevent workers

Step-by-Step Fix

Step 1: Identify the slow request path

Enable request timing to find the problematic endpoint:

bash
# Start gunicorn with access log format including response time
gunicorn myapp:app \
    --access-logfile - \
    --access-logformat '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" %(D)s'

The %D field shows request time in microseconds. Filter for requests over 20 seconds:

bash
grep -E '" [0-9]+ [0-9]+ "[^"]*" "[^"]*" [2-9][0-9]{7}' access.log

Step 2: Increase timeout for legitimately slow endpoints

If the endpoint genuinely needs more than 30 seconds:

bash
gunicorn myapp:app \
    --workers 4 \
    --timeout 120 \
    --graceful-timeout 30 \
    --keep-alive 5

The --graceful-timeout 30 allows workers in the process of shutting down to finish their current request within 30 seconds. --keep-alive 5 reduces connection overhead.

Step 3: Add timeouts to all external calls

```python import requests

# WRONG - no timeout, worker hangs forever response = requests.get("https://api.external.com/data")

# CORRECT - timeout ensures the worker is freed response = requests.get( "https://api.external.com/data", timeout=(3.05, 25), # 3s connect, 25s read ) ```

For database queries, set statement timeout at the connection level:

```python from sqlalchemy import event, text

@event.listens_for(engine, "connect") def set_statement_timeout(dbapi_connection, connection_record): cursor = dbapi_connection.cursor() cursor.execute("SET statement_timeout = '25000'") # 25 seconds cursor.close() ```

Step 4: Offload long-running work to background tasks

```python from celery import Celery

@app.task def generate_report(report_id, params): # This runs in a Celery worker, not a Gunicorn worker report = build_report(params) save_report(report_id, report) return report_id

# Flask endpoint - returns immediately @app.route("/api/reports/generate", methods=["POST"]) def start_report_generation(): task = generate_report.delay(request.json["report_id"], request.json["params"]) return {"task_id": task.id, "status_url": f"/api/reports/status/{task.id}"}, 202 ```

Step 5: Use async workers for I/O-bound workloads

bash
# For I/O-heavy applications with many concurrent connections
gunicorn myapp:app \
    --worker-class gevent \
    --workers 4 \
    --timeout 120 \
    --worker-connections 1000

Gevent workers use greenlets to handle many concurrent connections in fewer OS threads.

Prevention

  • Set --timeout to slightly more than your slowest expected request
  • Add application-level request timeout middleware that logs warnings at 80% of the Gunicorn timeout
  • Use APM tools like New Relic or Datadog to track p95 and p99 request latencies
  • Configure alerting on Gunicorn worker restart rate
  • Use --max-requests 1000 to recycle workers periodically and prevent memory leaks
  • Always set explicit timeouts on external service calls -- never rely on OS-level TCP timeout defaults