Introduction

A load balancer marks backend instances as healthy based on health check responses (typically HTTP 200 from a /health endpoint). However, if the health check endpoint returns 200 while the actual application logic is broken, the load balancer continues sending traffic to failing instances. This creates a situation where monitoring shows "all healthy" but users experience errors. This is a subtle and dangerous failure mode.

Symptoms

  • Load balancer dashboard shows all targets as "healthy"
  • Users receive 500 errors from the application
  • Health check endpoint returns 200 OK:
  • ```bash
  • curl http://backend-server:8080/health
  • {"status": "ok"}
  • `
  • But application endpoints return errors:
  • ```bash
  • curl http://backend-server:8080/api/data
  • {"error": "Internal Server Error"}
  • `
  • Error rate spikes while health check success rate remains at 100%

Common Causes

  • Health check endpoint does not verify critical dependencies (database, cache, queue)
  • Health check runs in a separate process/thread from the application
  • Application has exhausted its connection pool but health endpoint uses a direct connection
  • Health check cached response does not reflect current state
  • Application deadlock affecting request handlers but not the health endpoint

Step-by-Step Fix

  1. 1.Implement a deep health check that verifies all dependencies:
  2. 2.```python
  3. 3.# /health endpoint - checks all critical dependencies
  4. 4.@app.route('/health')
  5. 5.def health_check():
  6. 6.checks = {}
  7. 7.healthy = True

# Check database try: db.execute('SELECT 1') checks['database'] = 'ok' except Exception as e: checks['database'] = f'error: {str(e)}' healthy = False

# Check Redis/cache try: redis_client.ping() checks['redis'] = 'ok' except Exception as e: checks['redis'] = f'error: {str(e)}' healthy = False

# Check disk space import shutil total, used, free = shutil.disk_usage('/') if free / total < 0.05: # Less than 5% free checks['disk'] = f'critical: {free/total*100:.1f}% free' healthy = False else: checks['disk'] = f'ok: {free/total*100:.1f}% free'

# Check connection pool pool_stats = engine.pool.status() if pool_stats['overflow'] > pool_stats['size']: checks['connection_pool'] = 'warning: pool exhausted' healthy = False else: checks['connection_pool'] = 'ok'

status_code = 200 if healthy else 503 return jsonify(checks), status_code ```

  1. 1.Configure load balancer to use the deep health check:

AWS ALB: ``bash aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456:targetgroup/my-tg/abc123 \ --health-check-path /health \ --health-check-interval-seconds 10 \ --health-check-timeout-seconds 5 \ --healthy-threshold-count 2 \ --unhealthy-threshold-count 3 \ --matcher HttpCode=200

Nginx upstream health check: ```nginx upstream backend { server 10.0.1.1:8080; server 10.0.1.2:8080; server 10.0.1.3:8080;

# Health check (Nginx Plus) health_check uri=/health interval=10 fails=3 passes=2; } ```

  1. 1.Add a separate readiness probe (Kubernetes):
  2. 2.```yaml
  3. 3.readinessProbe:
  4. 4.httpGet:
  5. 5.path: /health
  6. 6.port: 8080
  7. 7.initialDelaySeconds: 5
  8. 8.periodSeconds: 10
  9. 9.failureThreshold: 3
  10. 10.livenessProbe:
  11. 11.httpGet:
  12. 12.path: /healthz # Separate simple endpoint for liveness
  13. 13.port: 8080
  14. 14.initialDelaySeconds: 15
  15. 15.periodSeconds: 20
  16. 16.`
  17. 17.Monitor the discrepancy between health check and error rate:
  18. 18.```bash
  19. 19.# Compare health check success rate with application error rate
  20. 20.# Using Prometheus metrics:
  21. 21.# alert: HealthCheckVsErrorRateMismatch
  22. 22.# expr: |
  23. 23.# rate(http_requests_total{status=~"5.."}[5m]) > 0
  24. 24.# and
  25. 25.# probe_success{job="health-check"} == 1
  26. 26.# for: 2m
  27. 27.# labels:
  28. 28.# severity: critical
  29. 29.# annotations:
  30. 30.# summary: "Health check passes but application returns 5xx errors"
  31. 31.`
  32. 32.Implement circuit breaker at the application level:
  33. 33.```python
  34. 34.from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=30) def call_database(query): return db.execute(query)

# When circuit is open, health check returns 503 @app.route('/health') def health(): try: call_database('SELECT 1') return {'status': 'ok'}, 200 except CircuitBreakerError: return {'status': 'circuit open'}, 503 ```

Prevention

  • Health checks must verify all critical dependencies (database, cache, queue, disk)
  • Use separate endpoints: /healthz for liveness, /health for readiness
  • Set health check interval to 10 seconds or less
  • Configure load balancer to mark unhealthy after 2-3 consecutive failures
  • Monitor application error rate independently from health check status
  • Alert on the delta between health check success and application error rate
  • Use chaos engineering to test health check accuracy by intentionally breaking dependencies