Introduction

Load balancers use health check endpoints to determine whether backend instances should receive traffic. A superficial health check (e.g., just returning HTTP 200) may pass even when the backend is degraded -- unable to process requests, database connections exhausted, or dependent services unreachable. This causes the load balancer to continue routing traffic to unhealthy backends, resulting in client-facing errors.

Symptoms

  • Load balancer shows all backends as healthy but clients receive 500 errors
  • Health check endpoint returns 200 OK while the main API returns errors
  • Gradual increase in error rate as more backends degrade
  • Removing and re-adding a backend to the pool temporarily improves performance
  • Error message: No load balancer error -- clients see 500/502 from the degraded backend

Common Causes

  • Health check endpoint only verifies the process is running, not service functionality
  • Health check not testing critical dependencies (database, cache, message queue)
  • Health check interval too long, not detecting degradation quickly enough
  • Health check threshold (unhealthy count) too high, tolerating too many failures
  • Backend degrading slowly (memory leak, connection pool exhaustion) below the health check radar

Step-by-Step Fix

  1. 1.Check the current health check configuration: See what is being verified.
  2. 2.```bash
  3. 3.# AWS ALB example
  4. 4.aws elbv2 describe-target-health \
  5. 5.--target-group-arn $TG_ARN
  6. 6.# Check health check path, interval, and thresholds

# Check what the health endpoint actually does curl -v https://backend.example.com/health ```

  1. 1.Implement a comprehensive health check endpoint: Verify all dependencies.
  2. 2.```python
  3. 3.# Python FastAPI health check
  4. 4.@app.get("/health")
  5. 5.async def health_check():
  6. 6.checks = {}
  7. 7.# Database
  8. 8.try:
  9. 9.await db.execute("SELECT 1")
  10. 10.checks["database"] = "healthy"
  11. 11.except Exception as e:
  12. 12.checks["database"] = f"unhealthy: {str(e)}"

# Redis cache try: await redis.ping() checks["cache"] = "healthy" except Exception as e: checks["cache"] = f"unhealthy: {str(e)}"

# Return 200 only if all critical dependencies are healthy status = all(v == "healthy" for v in checks.values()) return JSONResponse( status_code=200 if status else 503, content={"status": "healthy" if status else "degraded", "checks": checks} ) ```

jq '.TargetHealthDescriptions[]

Prevention

  • Implement health checks that verify all critical dependencies, not just process liveness
  • Use separate readiness and liveness probes -- readiness for load balancer, liveness for restart
  • Set health check intervals to 10 seconds or less for rapid degradation detection
  • Monitor the gap between health check status and actual error rates
  • Implement graceful degradation that returns 503 from health checks when non-critical dependencies fail
  • Test health check behavior by simulating dependency failures in staging environments