Introduction
Load balancers use health check endpoints to determine whether backend instances should receive traffic. A superficial health check (e.g., just returning HTTP 200) may pass even when the backend is degraded -- unable to process requests, database connections exhausted, or dependent services unreachable. This causes the load balancer to continue routing traffic to unhealthy backends, resulting in client-facing errors.
Symptoms
- Load balancer shows all backends as healthy but clients receive 500 errors
- Health check endpoint returns 200 OK while the main API returns errors
- Gradual increase in error rate as more backends degrade
- Removing and re-adding a backend to the pool temporarily improves performance
- Error message: No load balancer error -- clients see 500/502 from the degraded backend
Common Causes
- Health check endpoint only verifies the process is running, not service functionality
- Health check not testing critical dependencies (database, cache, message queue)
- Health check interval too long, not detecting degradation quickly enough
- Health check threshold (unhealthy count) too high, tolerating too many failures
- Backend degrading slowly (memory leak, connection pool exhaustion) below the health check radar
Step-by-Step Fix
- 1.Check the current health check configuration: See what is being verified.
- 2.```bash
- 3.# AWS ALB example
- 4.aws elbv2 describe-target-health \
- 5.--target-group-arn $TG_ARN
- 6.# Check health check path, interval, and thresholds
# Check what the health endpoint actually does curl -v https://backend.example.com/health ```
- 1.Implement a comprehensive health check endpoint: Verify all dependencies.
- 2.```python
- 3.# Python FastAPI health check
- 4.@app.get("/health")
- 5.async def health_check():
- 6.checks = {}
- 7.# Database
- 8.try:
- 9.await db.execute("SELECT 1")
- 10.checks["database"] = "healthy"
- 11.except Exception as e:
- 12.checks["database"] = f"unhealthy: {str(e)}"
# Redis cache try: await redis.ping() checks["cache"] = "healthy" except Exception as e: checks["cache"] = f"unhealthy: {str(e)}"
# Return 200 only if all critical dependencies are healthy status = all(v == "healthy" for v in checks.values()) return JSONResponse( status_code=200 if status else 503, content={"status": "healthy" if status else "degraded", "checks": checks} ) ```
| jq '.TargetHealthDescriptions[] |
|---|
Prevention
- Implement health checks that verify all critical dependencies, not just process liveness
- Use separate readiness and liveness probes -- readiness for load balancer, liveness for restart
- Set health check intervals to 10 seconds or less for rapid degradation detection
- Monitor the gap between health check status and actual error rates
- Implement graceful degradation that returns 503 from health checks when non-critical dependencies fail
- Test health check behavior by simulating dependency failures in staging environments