Introduction When load balancer health checks incorrectly mark healthy backends as down, traffic is unnecessarily removed from those backends. This reduces capacity and can cause cascading failures.

Symptoms - Backend flapping between healthy and unhealthy - Healthy backend serving traffic then suddenly removed - Health check logs showing intermittent timeouts - Backend marked unhealthy during deployment rollout - Only some health check intervals failing

Common Causes - Health check timeout too short for backend response - Check interval too frequent causing backend overload - Health check endpoint depends on external services - Backend garbage collection causing pause - Health check during backend restart or deploy

Step-by-Step Fix 1. **Check current health check configuration': ```bash # AWS ALB aws elbv2 describe-target-groups --query "TargetGroups[*].{Name:TargetGroupName,Interval:HealthCheckIntervalSeconds,Timeout:HealthCheckTimeoutSeconds,Threshold:HealthyThresholdCount}" # HAProxy grep -E "inter|fall|rise" /etc/haproxy/haproxy.cfg ```

  1. 1.**Adjust health check parameters':
  2. 2.```yaml
  3. 3.# Recommended settings
  4. 4.interval: 30s # Not too frequent
  5. 5.timeout: 10s # Allow for slow responses
  6. 6.healthy_threshold: 2 # Not too many
  7. 7.unhealthy_threshold: 3 # Not too sensitive
  8. 8.`
  9. 9.**Use a lightweight health check endpoint':
  10. 10.```python
  11. 11.@app.route('/health')
  12. 12.def health():
  13. 13.return {"status": "ok"}, 200 # No DB checks, no external calls
  14. 14.`

Prevention - Use separate readiness and liveness endpoints - Set health check timeout to 3x p99 response time - Monitor health check failure rate per backend - Implement graceful shutdown with deregistration delay - Use startup probes for slow-starting backends