Introduction
A load balancer marks backend instances as healthy based on health check responses (typically HTTP 200 from a /health endpoint). However, if the health check endpoint returns 200 while the actual application logic is broken, the load balancer continues sending traffic to failing instances. This creates a situation where monitoring shows "all healthy" but users experience errors. This is a subtle and dangerous failure mode.
Symptoms
- Load balancer dashboard shows all targets as "healthy"
- Users receive 500 errors from the application
- Health check endpoint returns 200 OK:
- ```bash
- curl http://backend-server:8080/health
- {"status": "ok"}
`- But application endpoints return errors:
- ```bash
- curl http://backend-server:8080/api/data
- {"error": "Internal Server Error"}
`- Error rate spikes while health check success rate remains at 100%
Common Causes
- Health check endpoint does not verify critical dependencies (database, cache, queue)
- Health check runs in a separate process/thread from the application
- Application has exhausted its connection pool but health endpoint uses a direct connection
- Health check cached response does not reflect current state
- Application deadlock affecting request handlers but not the health endpoint
Step-by-Step Fix
- 1.Implement a deep health check that verifies all dependencies:
- 2.```python
- 3.# /health endpoint - checks all critical dependencies
- 4.@app.route('/health')
- 5.def health_check():
- 6.checks = {}
- 7.healthy = True
# Check database try: db.execute('SELECT 1') checks['database'] = 'ok' except Exception as e: checks['database'] = f'error: {str(e)}' healthy = False
# Check Redis/cache try: redis_client.ping() checks['redis'] = 'ok' except Exception as e: checks['redis'] = f'error: {str(e)}' healthy = False
# Check disk space import shutil total, used, free = shutil.disk_usage('/') if free / total < 0.05: # Less than 5% free checks['disk'] = f'critical: {free/total*100:.1f}% free' healthy = False else: checks['disk'] = f'ok: {free/total*100:.1f}% free'
# Check connection pool pool_stats = engine.pool.status() if pool_stats['overflow'] > pool_stats['size']: checks['connection_pool'] = 'warning: pool exhausted' healthy = False else: checks['connection_pool'] = 'ok'
status_code = 200 if healthy else 503 return jsonify(checks), status_code ```
- 1.Configure load balancer to use the deep health check:
AWS ALB:
``bash
aws elbv2 modify-target-group \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456:targetgroup/my-tg/abc123 \
--health-check-path /health \
--health-check-interval-seconds 10 \
--health-check-timeout-seconds 5 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 3 \
--matcher HttpCode=200
Nginx upstream health check: ```nginx upstream backend { server 10.0.1.1:8080; server 10.0.1.2:8080; server 10.0.1.3:8080;
# Health check (Nginx Plus) health_check uri=/health interval=10 fails=3 passes=2; } ```
- 1.Add a separate readiness probe (Kubernetes):
- 2.```yaml
- 3.readinessProbe:
- 4.httpGet:
- 5.path: /health
- 6.port: 8080
- 7.initialDelaySeconds: 5
- 8.periodSeconds: 10
- 9.failureThreshold: 3
- 10.livenessProbe:
- 11.httpGet:
- 12.path: /healthz # Separate simple endpoint for liveness
- 13.port: 8080
- 14.initialDelaySeconds: 15
- 15.periodSeconds: 20
- 16.
` - 17.Monitor the discrepancy between health check and error rate:
- 18.```bash
- 19.# Compare health check success rate with application error rate
- 20.# Using Prometheus metrics:
- 21.# alert: HealthCheckVsErrorRateMismatch
- 22.# expr: |
- 23.# rate(http_requests_total{status=~"5.."}[5m]) > 0
- 24.# and
- 25.# probe_success{job="health-check"} == 1
- 26.# for: 2m
- 27.# labels:
- 28.# severity: critical
- 29.# annotations:
- 30.# summary: "Health check passes but application returns 5xx errors"
- 31.
` - 32.Implement circuit breaker at the application level:
- 33.```python
- 34.from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=30) def call_database(query): return db.execute(query)
# When circuit is open, health check returns 503 @app.route('/health') def health(): try: call_database('SELECT 1') return {'status': 'ok'}, 200 except CircuitBreakerError: return {'status': 'circuit open'}, 503 ```
Prevention
- Health checks must verify all critical dependencies (database, cache, queue, disk)
- Use separate endpoints:
/healthzfor liveness,/healthfor readiness - Set health check interval to 10 seconds or less
- Configure load balancer to mark unhealthy after 2-3 consecutive failures
- Monitor application error rate independently from health check status
- Alert on the delta between health check success and application error rate
- Use chaos engineering to test health check accuracy by intentionally breaking dependencies