Introduction

Load balancer health check failing errors occur when the load balancer's periodic probes to backend servers return unhealthy status, causing the load balancer to stop routing traffic to affected instances. Health checks are the primary mechanism for load balancers to detect failed, overloaded, or degraded backend servers and maintain high availability. When health checks fail, instances are marked unhealthy, traffic shifts to remaining healthy instances, and if all instances become unhealthy, the service becomes completely unavailable (503 Service Unavailable). The fix requires understanding health check protocols (HTTP, HTTPS, TCP, SSL), timeout and threshold configuration, backend application health endpoints, network security groups, connection draining, and graceful degradation patterns. This guide provides production-proven troubleshooting for load balancer health check scenarios across AWS ALB/NLB, NGINX, HAProxy, F5, and cloud load balancers.

Symptoms

  • Load balancer dashboard shows backend instances as Unhealthy
  • Traffic drops to zero when all instances fail health checks
  • Users receive 503 Service Unavailable or 502 Bad Gateway errors
  • Health check logs show timeout, connection refused, or non-200 status codes
  • Instances cycle between healthy and unhealthy states (flapping)
  • New instances never pass health checks after deployment
  • Health checks pass manually but fail from load balancer
  • Only some instances fail while others remain healthy

Common Causes

  • Health check endpoint returning non-200 status code (500, 503, 404)
  • Application not started or crashed on backend server
  • Health check timeout too short for application response time
  • Security group blocking load balancer health check probes
  • Backend server firewall rejecting health check source IPs
  • Health check path misconfigured (wrong URL, port, protocol)
  • Application listening on different port than health check
  • DNS resolution failure for hostname-based health checks
  • SSL/TLS certificate mismatch for HTTPS health checks
  • Backend server overloaded, cannot respond within timeout
  • Network ACL blocking health check traffic
  • Application health check logic too strict (failing on dependent service errors)

Step-by-Step Fix

### 1. Confirm health check failure diagnosis

Check load balancer health status:

```bash # AWS ALB/NLB - Check target group health aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:region:account-id:targetgroup/my-tg/1234567890abcdef

# Output shows health status: # { # "TargetHealthDescriptions": [ # { # "Target": {"Id": "i-1234567890abcdef0", "Port": 80}, # "HealthCheckPort": "80", # "TargetHealth": { # "State": "unhealthy", # "Reason": "Target.FailedHealthCheck", # "Description": "Health checks failed" # } # } # ] # }

# Check target group configuration aws elbv2 describe-target-group-attributes \ --target-group-arn arn:aws:elasticloadbalancing:region:account-id:targetgroup/my-tg/1234567890abcdef

# AWS Classic ELB aws elb describe-instance-health \ --load-balancer-name my-load-balancer

# Output: # InstanceStates: # - InstanceId: i-1234567890abcdef0 # State: OutOfService # ReasonCode: Target.FailedHealthCheck # Description: Health checks failed

# NGINX Plus health check status curl -s http://localhost:8080/api/2/http/servers | jq

# HAProxy runtime API echo "show servers state" | socat stdio /var/run/haproxy.sock ```

Test health check endpoint manually:

```bash # Test from backend server itself curl -v http://localhost/health

# Expected healthy response: # < HTTP/1.1 200 OK # < Content-Type: application/json # {"status": "healthy"}

# Test from another instance (simulating load balancer) curl -v http://<backend-ip>/health

# Test with timeout (match LB timeout) time curl --max-time 5 http://<backend-ip>/health

# If manual test fails, problem is application-side # If manual succeeds but LB fails, problem is network/LB config ```

### 2. Check health check configuration

Verify load balancer health check settings:

```bash # AWS ALB health check configuration aws elbv2 describe-target-groups \ --target-group-arn arn:aws:elasticloadbalancing:region:account-id:targetgroup/my-tg/1234567890abcdef \ --query 'TargetGroups[0].[HealthCheckProtocol,HealthCheckPort,HealthCheckPath,HealthCheckIntervalSeconds,HealthCheckTimeoutSeconds,HealthyThresholdCount,UnhealthyThresholdCount,HealthCheckMatcher]'

# Output: # [ # "HTTP", # "traffic-port", # "/health", # 30, # 5, # 5, # 2, # {"HttpCode": "200"} # ]

# Key parameters: # - HealthCheckProtocol: HTTP, HTTPS, TCP, SSL # - HealthCheckPort: Port or "traffic-port" # - HealthCheckPath: URL path for HTTP/HTTPS # - Interval: Seconds between checks (default 30s) # - Timeout: Wait time for response (default 5s) # - HealthyThreshold: Consecutive successes to mark healthy # - UnhealthyThreshold: Consecutive failures to mark unhealthy # - Matcher: Expected status codes (200, 200-299, 200,302)

# Fix: Update health check configuration aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:region:account-id:targetgroup/my-tg/1234567890abcdef \ --health-check-path /healthz \ --health-check-interval-seconds 30 \ --health-check-timeout-seconds 5 \ --healthy-threshold-count 2 \ --unhealthy-threshold-count 2 \ --health-check-matcher HttpCode=200

# NGINX health check configuration # /etc/nginx/nginx.conf upstream backend { server 10.0.1.1:8080; server 10.0.1.2:8080; server 10.0.1.3:8080;

# Active health checks (NGINX Plus) health_check interval=10s fails=3 passes=2 uri=/health match=health_check_match; }

match health_check_match { status 200; header Content-Type ~ text/html; body ~ "healthy"; }

# HAProxy health check configuration # /etc/haproxy/haproxy.cfg backend app_servers balance roundrobin option httpchk GET /health HTTP/1.1\r\nHost:\ localhost http-check expect status 200 http-check expect hdr Content-Type -m sub text/html server app1 10.0.1.1:8080 check inter 5s fall 3 rise 2 server app2 10.0.1.2:8080 check inter 5s fall 3 rise 2 server app3 10.0.1.3:8080 check inter 5s fall 3 rise 2

# inter: Health check interval (5s) # fall: Failures before marking down (3) # rise: Successes before marking up (2) ```

### 3. Fix security group rules

Ensure load balancer can reach backend:

```bash # AWS Security Group for backend instances # Check current rules aws ec2 describe-security-groups \ --group-ids sg-1234567890abcdef0 \ --query 'SecurityGroups[0].IpPermissions'

# Add rule to allow health checks from load balancer # First, get load balancer security group LB_SG=$(aws elbv2 describe-load-balancers \ --names my-alb \ --query 'LoadBalancers[0].SecurityGroups[0]' \ --output text)

# Add inbound rule to backend security group aws ec2 authorize-security-group-ingress \ --group-id sg-backend \ --protocol tcp \ --port 80 \ --source-group sg-$LB_SG

# Or allow from specific CIDR (load balancer subnet) aws ec2 authorize-security-group-ingress \ --group-id sg-backend \ --protocol tcp \ --port 80 \ --cidr 10.0.0.0/24

# Verify Network ACL allows traffic aws ec2 describe-network-acls \ --filters "Name=association.subnet-id,Values=subnet-12345678" \ --query 'NetworkAcls[0].Entries[?Egress==false]'

# NACL should allow inbound traffic on health check port # Rule example: # Rule# Type Protocol Port Range Source Allow/Deny # 100 HTTP TCP 80 0.0.0.0/0 ALLOW ```

### 4. Fix health check endpoint

Ensure application has proper health endpoint:

```python # Flask health check endpoint from flask import Flask, jsonify import psycopg2 import redis

app = Flask(__name__)

@app.route('/health') def health(): """Basic health check - is app running""" return jsonify({"status": "healthy"}), 200

@app.route('/healthz') def healthz(): """Readiness check - can app handle requests""" try: # Check database connection conn = psycopg2.connect("dbname=mydb user=myuser") conn.close()

# Check Redis connection r = redis.Redis(host='redis', port=6379) r.ping()

return jsonify({"status": "ready"}), 200 except Exception as e: return jsonify({"status": "not ready", "error": str(e)}), 503

@app.route('/live') def liveness(): """Liveness probe - is app deadlocked""" return jsonify({"status": "alive"}), 200

# Kubernetes-style probes # /healthz - Liveness (is process running) # /ready - Readiness (can handle traffic) # /startup - Startup (still initializing) ```

Spring Boot health endpoint:

```java // pom.xml - Add Spring Boot Actuator <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency>

// application.yml - Configure health endpoints management: endpoints: web: exposure: include: health,info endpoint: health: show-details: when_authorized probes: enabled: true # Kubernetes probes

// Health check endpoints: // /actuator/health - Full health with details // /actuator/health/liveness - Liveness probe // /actuator/health/readiness - Readiness probe

// Configure health check timeouts management: server: read-timeout: 5000 # 5 seconds ```

### 5. Tune health check timeouts

Adjust timeout and threshold values:

```bash # AWS ALB - Recommended settings for different workloads

# Standard web application aws elbv2 modify-target-group \ --target-group-arn $TG_ARN \ --health-check-interval-seconds 30 \ --health-check-timeout-seconds 5 \ --healthy-threshold-count 2 \ --unhealthy-threshold-count 2

# Slow-starting applications (Java, .NET) aws elbv2 modify-target-group \ --target-group-arn $TG_ARN \ --health-check-interval-seconds 30 \ --health-check-timeout-seconds 10 \ --healthy-threshold-count 3 \ --unhealthy-threshold-count 3

# High-availability critical services aws elbv2 modify-target-group \ --target-group-arn $TG_ARN \ --health-check-interval-seconds 10 \ --health-check-timeout-seconds 3 \ --healthy-threshold-count 2 \ --unhealthy-threshold-count 2

# Timeout calculation: # Time to mark unhealthy = Interval × UnhealthyThreshold # Example: 30s × 2 = 60s until marked unhealthy

# Time to mark healthy = Interval × HealthyThreshold # Example: 30s × 2 = 60s until marked healthy

# NGINX Plus timeout tuning upstream backend { server 10.0.1.1:8080;

# Health check tuning health_check interval=10s # Check every 10 seconds fails=3 # Mark down after 3 failures passes=2; # Mark up after 2 successes }

# HAProxy timeout tuning backend app_servers server app1 10.0.1.1:8080 check \ inter 5s # Check interval fall 3 # Failures before down rise 2 # Successes before up timeout 3s # Check timeout ```

### 6. Implement connection draining

Gracefully remove unhealthy instances:

```bash # AWS ALB - Enable connection draining (deregistration delay) aws elbv2 modify-target-group-attributes \ --target-group-arn $TG_ARN \ --attributes Key=deregistration_delay.timeout_seconds,Value=300

# Deregistration delay: # - Time to wait for in-flight requests to complete # - During this time, instance is "draining" - no new connections # - Typical value: 60-300 seconds

# AWS Classic ELB aws elb modify-load-balancer-attributes \ --load-balancer-name my-elb \ --load-balancer-attributes "{\"ConnectionDraining\":{\"Enabled\":true,\"Timeout\":300}}"

# NGINX Plus - Queue connections during graceful shutdown # Requires NGINX Plus API curl -X POST -d '{"state":"draining"}' \ http://localhost:8080/api/2/http/servers/backend/servers/0

# HAProxy - Graceful shutdown # Add server to maintenance mode echo "disable server backend/app1" | socat stdio /var/run/haproxy.sock

# Drain existing connections echo "shutdown sessions server backend/app1" | socat stdio /var/run/haproxy.sock

# Wait for connections to drain before stopping ```

### 7. Fix SSL/TLS health check issues

HTTPS health check configuration:

```bash # AWS ALB - HTTPS health check aws elbv2 modify-target-group \ --target-group-arn $TG_ARN \ --health-check-protocol HTTPS \ --health-check-port 443 \ --health-check-path /health

# If using self-signed certificate, ALB will fail validation # Solutions:

# Option 1: Use HTTP health check on HTTPS backend aws elbv2 modify-target-group \ --target-group-arn $TG_ARN \ --health-check-protocol HTTP \ --health-check-port 80

# Option 2: Use valid certificate (Let's Encrypt, ACM) # Certificate must match backend hostname

# Option 3: Disable SSL verification for health checks # (Not recommended for production)

# NLB - TCP health check with TLS aws elbv2 create-target-group \ --name my-nlb-tg \ --protocol TCP \ --port 443 \ --vpc-id vpc-12345678 \ --health-check-protocol TLS \ --health-check-port 443 \ --health-check-interval-seconds 30 \ --health-check-timeout-seconds 10

# For TLS health checks, certificate is validated # Use ACM certificate or trusted CA ```

### 8. Handle high-traffic scenarios

Health checks during traffic spikes:

```python # Problem: Health checks fail during traffic spikes due to overload # Solution: Implement health check degradation

@app.route('/health') def health(): """Health check that degrades gracefully under load"""

# Check system load load_avg = os.getloadavg()[0] cpu_percent = psutil.cpu_percent() memory_percent = psutil.virtual_memory().percent

# Get active connections connections = get_active_connections()

# If overloaded, return degraded status if load_avg > 10 or cpu_percent > 90 or memory_percent > 90: return jsonify({ "status": "degraded", "load": load_avg, "cpu": cpu_percent, "memory": memory_percent, "connections": connections }), 503 # Mark unhealthy to reduce load

# Basic health checks try: check_database() check_cache() except Exception as e: return jsonify({ "status": "unhealthy", "error": str(e) }), 503

return jsonify({ "status": "healthy", "load": load_avg, "cpu": cpu_percent, "memory": memory_percent }), 200

# Alternative: Separate health check port for LB # Run lightweight health server on separate port from http.server import HTTPServer, BaseHTTPRequestHandler

class HealthHandler(BaseHTTPRequestHandler): def do_GET(self): if self.path == '/health': self.send_response(200) self.send_header('Content-Type', 'application/json') self.end_headers() self.wfile.write(b'{"status":"healthy"}') else: self.send_response(404)

# Start health server on port 8081 HTTPServer(('0.0.0.0', 8081), HealthHandler).serve_forever()

# Configure LB to check port 8081 while app runs on 8080 ```

### 9. Monitor health check status

Set up health check monitoring:

```bash # CloudWatch alarms for unhealthy targets aws cloudwatch put-metric-alarm \ --alarm-name "ALB-UnhealthyHosts" \ --alarm-description "ALB has unhealthy targets" \ --metric-name UnHealthyHostCount \ --namespace AWS/ApplicationELB \ --statistic Average \ --period 60 \ --threshold 1 \ --comparison-operator GreaterThanOrEqualToThreshold \ --evaluation-periods 1 \ --dimensions Name=LoadBalancer,Value=app/my-alb/1234567890abcdef \ --alarm-actions arn:aws:sns:region:account:alerts

# CloudWatch metrics to monitor # - HealthyHostCount: Number of healthy targets # - UnHealthyHostCount: Number of unhealthy targets # - TargetConnectionErrorCount: Connection errors to targets # - TargetResponseTime: Average response time from targets # - HTTPCode_Target_5XX_Count: 5xx errors from targets

# Custom health check monitoring script #!/bin/bash # Monitor health check status

TARGET_GROUP_ARN="arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/1234567890abcdef"

# Get target health HEALTH=$(aws elbv2 describe-target-health \ --target-group-arn $TARGET_GROUP_ARN \ --query 'TargetHealthDescriptions[*].TargetHealth.State' \ --output text)

# Count healthy vs unhealthy HEALTHY=$(echo $HEALTH | tr ' ' '\n' | grep -c "healthy") UNHEALTHY=$(echo $HEALTH | tr ' ' '\n' | grep -c "unhealthy")

echo "Healthy: $HEALTHY, Unhealthy: $UNHEALTHY"

if [ "$UNHEALTHY" -gt 0 ]; then # Get details of unhealthy targets aws elbv2 describe-target-health \ --target-group-arn $TARGET_GROUP_ARN \ --query 'TargetHealthDescriptions[?TargetHealth.State==unhealthy]' \ --output table

# Send alert aws sns publish \ --topic-arn arn:aws:sns:region:account:alerts \ --subject "ALB Health Check Alert" \ --message "$UNHEALTHY targets are unhealthy" fi ```

Prometheus metrics:

```yaml # Prometheus alerting rules for health checks groups: - name: load_balancer_health rules: - alert: ALBUnhealthyTargets expr: aws_alb_target_group_unhealthy_host_count_average > 0 for: 5m labels: severity: warning annotations: summary: "ALB has unhealthy targets" description: "{{ $value }} targets unhealthy on {{ $labels.load_balancer }}"

  • alert: ALBAllTargetsUnhealthy
  • expr: aws_alb_target_group_healthy_host_count_average == 0
  • for: 2m
  • labels:
  • severity: critical
  • annotations:
  • summary: "ALB all targets unhealthy"
  • description: "All targets unhealthy on {{ $labels.load_balancer }} - service down!"
  • alert: ALBHealthCheckLatencyHigh
  • expr: aws_alb_target_group_target_response_time_average > 5
  • for: 5m
  • labels:
  • severity: warning
  • annotations:
  • summary: "ALB health check latency high"
  • description: "Health check response time is {{ $value }}s on {{ $labels.load_balancer }}"
  • `

Prevention

  • Use dedicated health check endpoints separate from business logic
  • Configure appropriate timeout and threshold values for workload
  • Implement health check degradation under high load
  • Enable connection draining for graceful removal
  • Set up monitoring and alerting for health check failures
  • Test health check configuration in staging before production
  • Document health check requirements for each service
  • Use TCP health checks for simple availability, HTTP for application health
  • Implement circuit breaker pattern to prevent cascading failures
  • Consider separate health check port for high-traffic services
  • **503 Service Unavailable**: All backend instances unhealthy
  • **502 Bad Gateway**: Load balancer cannot connect to backend
  • **504 Gateway Timeout**: Backend not responding within timeout
  • **Target.FailedHealthCheck**: Health check probe failed
  • **Target.ResponseCodeMismatch**: Health check returned unexpected status