Fix Load Balancer Health Check Passing But Application Erroring

Introduction

A load balancer marks backend instances as healthy based on health check responses (typically HTTP 200 from a /health endpoint). However, if the health check endpoint returns 200 while the actual application logic is broken, the load balancer continues sending traffic to failing instances. This creates a situation where monitoring shows "all healthy" but users experience errors. This is a subtle and dangerous failure mode.

Symptoms

Load balancer dashboard shows all targets as "healthy"
Users receive 500 errors from the application
Health check endpoint returns 200 OK:
```bash
curl http://backend-server:8080/health
{"status": "ok"}
`
But application endpoints return errors:
```bash
curl http://backend-server:8080/api/data
{"error": "Internal Server Error"}
`
Error rate spikes while health check success rate remains at 100%

Common Causes

Health check endpoint does not verify critical dependencies (database, cache, queue)
Health check runs in a separate process/thread from the application
Application has exhausted its connection pool but health endpoint uses a direct connection
Health check cached response does not reflect current state
Application deadlock affecting request handlers but not the health endpoint

Step-by-Step Fix

1.Implement a deep health check that verifies all dependencies:
2.```python
3.# /health endpoint - checks all critical dependencies
4.@app.route('/health')
5.def health_check():
6.checks = {}
7.healthy = True

# Check database try: db.execute('SELECT 1') checks['database'] = 'ok' except Exception as e: checks['database'] = f'error: {str(e)}' healthy = False

# Check Redis/cache try: redis_client.ping() checks['redis'] = 'ok' except Exception as e: checks['redis'] = f'error: {str(e)}' healthy = False

# Check disk space import shutil total, used, free = shutil.disk_usage('/') if free / total < 0.05: # Less than 5% free checks['disk'] = f'critical: {free/total*100:.1f}% free' healthy = False else: checks['disk'] = f'ok: {free/total*100:.1f}% free'

# Check connection pool pool_stats = engine.pool.status() if pool_stats['overflow'] > pool_stats['size']: checks['connection_pool'] = 'warning: pool exhausted' healthy = False else: checks['connection_pool'] = 'ok'

status_code = 200 if healthy else 503 return jsonify(checks), status_code ```

1.Configure load balancer to use the deep health check:

AWS ALB: ``bash aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456:targetgroup/my-tg/abc123 \ --health-check-path /health \ --health-check-interval-seconds 10 \ --health-check-timeout-seconds 5 \ --healthy-threshold-count 2 \ --unhealthy-threshold-count 3 \ --matcher HttpCode=200

Nginx upstream health check: ```nginx upstream backend { server 10.0.1.1:8080; server 10.0.1.2:8080; server 10.0.1.3:8080;

# Health check (Nginx Plus) health_check uri=/health interval=10 fails=3 passes=2; } ```

1.Add a separate readiness probe (Kubernetes):
2.```yaml
3.readinessProbe:
4.httpGet:
5.path: /health
6.port: 8080
7.initialDelaySeconds: 5
8.periodSeconds: 10
9.failureThreshold: 3
10.livenessProbe:
11.httpGet:
12.path: /healthz # Separate simple endpoint for liveness
13.port: 8080
14.initialDelaySeconds: 15
15.periodSeconds: 20
16.`
17.Monitor the discrepancy between health check and error rate:
18.```bash
19.# Compare health check success rate with application error rate
20.# Using Prometheus metrics:
21.# alert: HealthCheckVsErrorRateMismatch
22.# expr: |
23.# rate(http_requests_total{status=~"5.."}[5m]) > 0
24.# and
25.# probe_success{job="health-check"} == 1
26.# for: 2m
27.# labels:
28.# severity: critical
29.# annotations:
30.# summary: "Health check passes but application returns 5xx errors"
31.`
32.Implement circuit breaker at the application level:
33.```python
34.from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=30) def call_database(query): return db.execute(query)

# When circuit is open, health check returns 503 @app.route('/health') def health(): try: call_database('SELECT 1') return {'status': 'ok'}, 200 except CircuitBreakerError: return {'status': 'circuit open'}, 503 ```

Prevention

Health checks must verify all critical dependencies (database, cache, queue, disk)
Use separate endpoints: /healthz for liveness, /health for readiness
Set health check interval to 10 seconds or less
Configure load balancer to mark unhealthy after 2-3 consecutive failures
Monitor application error rate independently from health check status
Alert on the delta between health check success and application error rate
Use chaos engineering to test health check accuracy by intentionally breaking dependencies

Fix Site Down Load Balancer Health Check Passing But Application Erroring

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Share this guide

More Downtime Troubleshooting Guides

Fix Grafana Alerting Notification Channel Delivery Failing

Fix Prometheus Target Down Scraping Failing

Fix MetricBeat System Module Causing High System Overhead

Fix Filebeat Logstash Pipeline Backpressure

Fix Packetbeat Nginx Logs Missing Metrics After Update

Fix Nomad Client Drained With No Allocations Scheduled