Introduction
Gunicorn workers are killed with a SIGABRT signal when they fail to respond to the master process's heartbeat within the configured timeout period (default 30 seconds). This typically happens when a worker is stuck processing a slow request -- a long-running database query, an external API call with no timeout, or a CPU-bound operation blocking the worker. In production, worker timeouts cause intermittent 502 Bad Gateway errors for users, reduced throughput as workers are recycled, and potential cascading failures when the replacement workers inherit the same slow workload.
Symptoms
In Gunicorn error logs:
[2024-03-15 14:23:01 +0000] [12] [CRITICAL] WORKER TIMEOUT (pid:456)
[2024-03-15 14:23:01 +0000] [456] [INFO] Worker exiting (pid: 456)
[2024-03-15 14:23:02 +0000] [12] [INFO] Booting worker with pid: 789Nginx access logs show intermittent 502 errors:
10.0.1.50 - - [15/Mar/2024:14:23:01 +0000] "POST /api/reports/generate HTTP/1.1" 502 166 "-" "python-requests/2.28.0"Gunicorn access log shows the slow request:
10.0.1.50 - - [15/Mar/2024:14:22:35 +0000] "POST /api/reports/generate HTTP/1.1" 200 45231 "-" "python-requests/2.28.0" 28.543The request took 28.5 seconds -- dangerously close to the 30-second timeout.
Common Causes
- External API call without timeout:
requests.get()without timeout parameter hangs indefinitely - Slow database query: Unindexed query or large result set processing
- CPU-bound operation in sync worker: Data processing, image resizing, or CSV generation blocking the worker
- Deadlock in application code: Database lock or threading deadlock prevents worker from responding to heartbeat
- Timeout value too low: Default 30-second timeout insufficient for legitimate long-running requests
- Worker class mismatch: Using sync workers for I/O-bound workloads that should use gevent workers
Step-by-Step Fix
Step 1: Identify the slow request path
Enable request timing to find the problematic endpoint:
# Start gunicorn with access log format including response time
gunicorn myapp:app \
--access-logfile - \
--access-logformat '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" %(D)s'The %D field shows request time in microseconds. Filter for requests over 20 seconds:
grep -E '" [0-9]+ [0-9]+ "[^"]*" "[^"]*" [2-9][0-9]{7}' access.logStep 2: Increase timeout for legitimately slow endpoints
If the endpoint genuinely needs more than 30 seconds:
gunicorn myapp:app \
--workers 4 \
--timeout 120 \
--graceful-timeout 30 \
--keep-alive 5The --graceful-timeout 30 allows workers in the process of shutting down to finish their current request within 30 seconds. --keep-alive 5 reduces connection overhead.
Step 3: Add timeouts to all external calls
```python import requests
# WRONG - no timeout, worker hangs forever response = requests.get("https://api.external.com/data")
# CORRECT - timeout ensures the worker is freed response = requests.get( "https://api.external.com/data", timeout=(3.05, 25), # 3s connect, 25s read ) ```
For database queries, set statement timeout at the connection level:
```python from sqlalchemy import event, text
@event.listens_for(engine, "connect") def set_statement_timeout(dbapi_connection, connection_record): cursor = dbapi_connection.cursor() cursor.execute("SET statement_timeout = '25000'") # 25 seconds cursor.close() ```
Step 4: Offload long-running work to background tasks
```python from celery import Celery
@app.task def generate_report(report_id, params): # This runs in a Celery worker, not a Gunicorn worker report = build_report(params) save_report(report_id, report) return report_id
# Flask endpoint - returns immediately @app.route("/api/reports/generate", methods=["POST"]) def start_report_generation(): task = generate_report.delay(request.json["report_id"], request.json["params"]) return {"task_id": task.id, "status_url": f"/api/reports/status/{task.id}"}, 202 ```
Step 5: Use async workers for I/O-bound workloads
# For I/O-heavy applications with many concurrent connections
gunicorn myapp:app \
--worker-class gevent \
--workers 4 \
--timeout 120 \
--worker-connections 1000Gevent workers use greenlets to handle many concurrent connections in fewer OS threads.
Prevention
- Set
--timeoutto slightly more than your slowest expected request - Add application-level request timeout middleware that logs warnings at 80% of the Gunicorn timeout
- Use APM tools like New Relic or Datadog to track p95 and p99 request latencies
- Configure alerting on Gunicorn worker restart rate
- Use
--max-requests 1000to recycle workers periodically and prevent memory leaks - Always set explicit timeouts on external service calls -- never rely on OS-level TCP timeout defaults