Introduction When load balancer health checks incorrectly mark healthy backends as down, traffic is unnecessarily removed from those backends. This reduces capacity and can cause cascading failures.
Symptoms - Backend flapping between healthy and unhealthy - Healthy backend serving traffic then suddenly removed - Health check logs showing intermittent timeouts - Backend marked unhealthy during deployment rollout - Only some health check intervals failing
Common Causes - Health check timeout too short for backend response - Check interval too frequent causing backend overload - Health check endpoint depends on external services - Backend garbage collection causing pause - Health check during backend restart or deploy
Step-by-Step Fix 1. **Check current health check configuration': ```bash # AWS ALB aws elbv2 describe-target-groups --query "TargetGroups[*].{Name:TargetGroupName,Interval:HealthCheckIntervalSeconds,Timeout:HealthCheckTimeoutSeconds,Threshold:HealthyThresholdCount}" # HAProxy grep -E "inter|fall|rise" /etc/haproxy/haproxy.cfg ```
- 1.**Adjust health check parameters':
- 2.```yaml
- 3.# Recommended settings
- 4.interval: 30s # Not too frequent
- 5.timeout: 10s # Allow for slow responses
- 6.healthy_threshold: 2 # Not too many
- 7.unhealthy_threshold: 3 # Not too sensitive
- 8.
` - 9.**Use a lightweight health check endpoint':
- 10.```python
- 11.@app.route('/health')
- 12.def health():
- 13.return {"status": "ok"}, 200 # No DB checks, no external calls
- 14.
`