Introduction

Load balancer health check failures occur when the load balancer cannot successfully probe backend instances, causing them to be marked unhealthy and removed from the rotation. This results in 503 Service Unavailable errors, reduced capacity, traffic shifts to remaining healthy instances, and potential cascading failures. Health checks verify backend availability by sending periodic probes (HTTP, HTTPS, TCP, or gRPC) and evaluating responses against expected criteria. Common causes include application not listening on health check port, health endpoint returning non-2xx status, health check timeout too short for application response, unhealthy threshold too aggressive, backend overloaded and timing out, firewall/security group blocking health check traffic, NAT or network routing issues, application deadlocked or unresponsive, TLS certificate mismatch for HTTPS health checks, and health check path misconfigured. The fix requires understanding health check protocols, proper timeout and threshold configuration, network path validation, and application-level health indicator implementation. This guide provides production-proven troubleshooting for health check failures across AWS ALB/NLB, NGINX, HAProxy, F5, and Kubernetes environments.

Symptoms

  • Load balancer shows backend instances as Unhealthy or OutOfService
  • 503 Service Unavailable errors to end users
  • Traffic routing to fewer backend instances than expected
  • CloudWatch/monitoring shows UnHealthyHostCount increasing
  • Health check logs show timeout, connection refused, or status mismatch
  • Backend instances cycling between healthy and unhealthy states
  • Auto-scaling launching new instances but they never become healthy
  • Target.ResponseCode shows 5xx or 0 (timeout)
  • Target.HealthReason shows Response code mismatch or Health checks timed out
  • Application logs show no incoming health check requests

Common Causes

  • Health check endpoint not implemented or returning wrong status code
  • Application not bound to expected port or interface
  • Health check timeout shorter than application response time
  • Security group or firewall blocking load balancer IP ranges
  • Backend instance CPU/memory exhausted, not responding
  • Application thread pool exhausted, health endpoint unresponsive
  • TLS certificate expired or hostname mismatch for HTTPS health checks
  • Health check interval too aggressive for application startup time
  • Network ACL or routing preventing load balancer to backend communication
  • Backend application crashed or in bad state
  • DNS resolution failure for health check hostname
  • Connection limit reached on backend instance
  • Health check path returns 200 but application not actually healthy
  • Container not ready to serve traffic (Kubernetes readiness probe)

Step-by-Step Fix

### 1. Diagnose health check failure

Check load balancer health status:

```bash # AWS ALB/NLB - Check target group health aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/name/xyz \ --query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State,TargetHealth.Reason,TargetHealth.Description]'

# AWS Classic ELB aws elb describe-instance-health --load-balancer-name my-lb

# NGINX Plus API curl http://localhost:8080/api/8/http/upstreams/backend/peers

# HAProxy stats page echo "show stat" | socat stdio /var/run/haproxy.sock | grep -E "^backend|DOWN"

# Kubernetes kubectl get endpoints kubectl describe service my-service kubectl get pods -l app=my-app -o wide ```

Analyze health check configuration:

```bash # AWS ALB - Get health check settings aws elbv2 describe-target-groups \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/name/xyz \ --query 'TargetGroups[0].{ Protocol:HealthCheckProtocol, Path:HealthCheckPath, Port:HealthCheckPort, IntervalSeconds:HealthCheckIntervalSeconds, TimeoutSeconds:HealthCheckTimeoutSeconds, HealthyThreshold:HealthyThresholdCount, UnhealthyThreshold:UnhealthyThresholdCount, Matcher:Matcher }'

# Expected output: # { # "Protocol": "HTTP", # "Path": "/health", # "Port": "traffic-port", # "IntervalSeconds": 30, # "TimeoutSeconds": 5, # "HealthyThreshold": 2, # "UnhealthyThreshold": 3, # "Matcher": {"HttpCode": "200"} # } ```

Test health check manually from backend:

```bash # SSH to backend instance and test locally curl -v http://localhost:8080/health

# Test with expected headers curl -v -H "Host: app.example.com" http://localhost:8080/health

# Check if application is listening netstat -tlnp | grep 8080 ss -tlnp | grep 8080 lsof -i :8080

# Check application logs tail -f /var/log/app/application.log | grep -i health ```

### 2. Fix health check endpoint

Implement proper health endpoint:

```python # Python/Flask - Health endpoint example from flask import Flask, jsonify import healthcheck # Custom health checks

app = Flask(__name__)

@app.route('/health') def health(): """ Health check endpoint for load balancer. Returns 200 OK when application can serve traffic. """ # Basic liveness - is the process running? return jsonify({"status": "healthy"}), 200

@app.route('/ready') def readiness(): """ Readiness check - is the application ready to serve traffic? Check database connections, cache, external dependencies. """ checks = { "database": healthcheck.check_database(), "cache": healthcheck.check_redis(), "external_api": healthcheck.check_external_api() }

if all(checks.values()): return jsonify({"status": "ready", "checks": checks}), 200 else: failed = [k for k, v in checks.items() if not v] return jsonify({"status": "not_ready", "failed": failed}), 503

# For load balancer health checks, use /health (liveness) # For Kubernetes readiness probes, use /ready ```

```go // Go - Health endpoint with dependency checks package main

import ( "encoding/json" "net/http" "context" "time" )

type HealthResponse struct { Status string json:"status" Checks map[string]bool json:"checks,omitempty" }

func healthHandler(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "application/json")

// Simple liveness check json.NewEncoder(w).Encode(HealthResponse{Status: "healthy"}) }

func readinessHandler(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "application/json")

ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second) defer cancel()

checks := map[string]bool{ "database": checkDatabase(ctx), "cache": checkCache(ctx), }

allHealthy := true for _, v := range checks { if !v { allHealthy = false break } }

if allHealthy { w.WriteHeader(http.StatusOK) json.NewEncoder(w).Encode(HealthResponse{Status: "ready", Checks: checks}) } else { w.WriteHeader(http.StatusServiceUnavailable) json.NewEncoder(w).Encode(HealthResponse{Status: "not_ready", Checks: checks}) } }

func checkDatabase(ctx context.Context) bool { // Implement database ping with timeout return true }

func checkCache(ctx context.Context) bool { // Implement cache ping with timeout return true } ```

Health check response requirements:

```bash # Health check must: # 1. Return within timeout period (typically 2-5 seconds) # 2. Return expected status code (200, 204, 301, 302) # 3. Not require authentication # 4. Not depend on external services (for basic liveness) # 5. Be idempotent and side-effect free

# Test response time time curl -s http://localhost:8080/health

# Should return in < 1 second for most configurations ```

### 3. Fix timeout and threshold configuration

AWS ALB health check tuning:

```bash # Default ALB settings may be too aggressive for some applications # Adjust based on application characteristics

# For slow-starting applications: aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/name/xyz \ --health-check-interval-seconds 30 \ --health-check-timeout-seconds 10 \ --healthy-threshold-count 2 \ --unhealthy-threshold-count 3

# Recommended settings for typical web applications: # - Interval: 30 seconds (balance between detection and load) # - Timeout: 5-10 seconds (allow for GC pauses, slow queries) # - Healthy threshold: 2 (confirm healthy before routing) # - Unhealthy threshold: 2-3 (avoid flapping)

# For Java applications with GC pauses: aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/name/xyz \ --health-check-timeout-seconds 15 \ --unhealthy-threshold-count 4 ```

NGINX health check configuration:

```nginx # NGINX Plus active health checks upstream backend { zone backend_zone 64k;

server backend1.example.com weight=5; server backend2.example.com weight=5; server backend3.example.com backup;

# Active health check settings health_check interval=10s fails=3 passes=2 uri=/health match=health_check_response; }

# Define expected health check response match health_check_response { status 200; body ~ "healthy"; header Content-Type ~ text/json; }

# For NGINX Open Source (passive health checks) upstream backend { server backend1.example.com max_fails=3 fail_timeout=30s; server backend2.example.com max_fails=3 fail_timeout=30s; server backend3.example.com backup;

proxy_next_upstream error timeout http_502 http_503 http_504; } ```

HAProxy health check configuration:

```haproxy global # Tuning for health check performance tune.healthcheck.check-timeout 5s

backend app_servers balance roundrobin

# HTTP health check option httpchk GET /health HTTP/1.1\r\nHost:\ app.example.com http-check expect status 200

# Health check timing timeout check 5s inter 5s # Interval between health checks fall 3 # Consecutive failures before marking down rise 2 # Consecutive successes before marking up

server app1 10.0.1.1:8080 check server app2 10.0.1.2:8080 check server app3 10.0.1.3:8080 check backup

# TCP health check (for databases, Redis, etc.) backend db_servers option tcp-check timeout check 10s inter 10s fall 3 rise 2

server db1 10.0.2.1:5432 check server db2 10.0.2.2:5432 check ```

Kubernetes liveness and readiness probes:

```yaml apiVersion: v1 kind: Pod metadata: name: my-app spec: containers: - name: app image: my-app:latest ports: - containerPort: 8080

# Liveness probe - is the container running? livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 # Wait before first check periodSeconds: 10 # Check interval timeoutSeconds: 5 # Request timeout failureThreshold: 3 # Consecutive failures before restart successThreshold: 1

# Readiness probe - is the container ready to serve traffic? readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 # Check sooner for readiness periodSeconds: 5 # More frequent checks timeoutSeconds: 3 failureThreshold: 3 successThreshold: 1

# Startup probe - for slow-starting containers startupProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 0 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 30 # Allow up to 5 minutes for startup ```

### 4. Fix security group and firewall rules

AWS Security Group configuration:

```bash # Allow load balancer to reach backend instances # Get load balancer security group aws elbv2 describe-load-balancers \ --names my-alb \ --query 'LoadBalancers[0].SecurityGroups[0]'

# Get load balancer IP ranges (for NLB or classic ELB) # AWS publishes these ranges # https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html

# Add inbound rule to backend security group aws ec2 authorize-security-group-ingress \ --group-id sg-backend-instances \ --protocol tcp \ --port 8080 \ --source-group sg-load-balancer

# Verify rule added aws ec2 describe-security-groups \ --group-ids sg-backend-instances \ --query 'SecurityGroups[0].IpPermissions'

# For specific ALB health check port aws ec2 authorize-security-group-ingress \ --group-id sg-backend-instances \ --protocol tcp \ --port 8080 \ --source-group sg-my-alb ```

Linux firewall (iptables/ufw):

```bash # Check current firewall rules sudo ufw status verbose sudo iptables -L -n -v

# Allow health check traffic from load balancer subnet sudo ufw allow from 10.0.0.0/16 to any port 8080 proto tcp

# Or for specific load balancer IPs sudo ufw allow from 10.0.1.100 to any port 8080 proto tcp sudo ufw allow from 10.0.1.101 to any port 8080 proto tcp

# Verify and reload sudo ufw status sudo ufw reload

# For iptables directly sudo iptables -A INPUT -p tcp -s 10.0.0.0/16 --dport 8080 -j ACCEPT sudo iptables-save > /etc/iptables/rules.v4 ```

### 5. Fix TLS certificate issues

HTTPS health check with certificate validation:

```bash # AWS ALB HTTPS health check aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/name/xyz \ --health-check-protocol HTTPS \ --health-check-path /health \ --health-check-port 443

# If using self-signed certificate, disable SSL verification # (Only for internal load balancers with private CAs)

# For NGINX Plus with HTTPS health checks upstream backend { server backend1.example.com:443; server backend2.example.com:443;

health_check interval=10s uri=/health; }

# If backend uses self-signed cert proxy_ssl_verify off; # Disable verification for internal traffic ```

Certificate validation troubleshooting:

```bash # Check certificate validity openssl s_client -connect backend.example.com:443 -servername backend.example.com </dev/null 2>/dev/null | openssl x509 -noout -dates

# Verify certificate chain openssl verify -CAfile /etc/ssl/certs/ca-certificates.crt backend.crt

# Check hostname matches certificate openssl s_client -connect backend.example.com:443 </dev/null 2>/dev/null | grep -A1 "subject="

# Common issues: # - Certificate expired # - Certificate issued for different hostname # - Missing intermediate certificate in chain # - Self-signed certificate not in trust store ```

### 6. Fix backend application issues

Application resource exhaustion:

```bash # Check CPU and memory on backend top -bn1 | head -20 free -m df -h

# Check application thread pool # Java: jstack <pid> | grep -c "RUNNABLE" # Check for thread starvation

# Check connection pool status # Database connections exhausted? # Redis connections exhausted?

# Review application logs for errors tail -100 /var/log/app/application.log

# Check for OOM killer dmesg | grep -i "killed process"

# Restart application if in bad state sudo systemctl restart my-app ```

Slow health check response:

```bash # Profile health endpoint response time # Add timing to health endpoint

# Python example with timing import time @app.route('/health') def health(): start = time.time()

# Health check logic result = {"status": "healthy"}

elapsed = time.time() - start result["response_time_ms"] = int(elapsed * 1000)

return jsonify(result), 200

# If response time > timeout, optimize: # - Remove external service checks from liveness endpoint # - Cache expensive health check results # - Use async health checks with timeout # - Move dependency checks to readiness probe only ```

### 7. Fix network routing issues

Verify network path:

```bash # From load balancer to backend (if accessible) # Or from another instance in same subnet

# Test connectivity nc -zv 10.0.1.100 8080 telnet 10.0.1.100 8080

# Trace route traceroute 10.0.1.100 mtr 10.0.1.100

# Check network ACLs aws ec2 describe-network-acls \ --filters "Name=association.subnet-id,Values=subnet-xxx"

# Verify route table aws ec2 describe-route-tables \ --filters "Name=association.subnet-id,Values=subnet-xxx"

# Check for NAT gateway issues (if backend in private subnet) aws ec2 describe-nat-gateways --filter "Name=subnet-id,Values=subnet-xxx" ```

DNS resolution for health checks:

```bash # If health check uses hostname instead of IP nslookup backend.example.com dig backend.example.com

# Check /etc/hosts for incorrect entries cat /etc/hosts

# For Kubernetes services kubectl get svc my-service -o yaml kubectl get endpoints my-service

# Verify CoreDNS is working kubectl get pods -n kube-system -l k8s-app=kube-dns kubectl run -it --rm dns-test --image=busybox:1.28 --restart=Never -- nslookup kubernetes ```

### 8. Advanced debugging

Enable load balancer access logs:

```bash # AWS ALB access logs aws elbv2 modify-load-balancer-attributes \ --load-balancer-arn arn:aws:elasticloadbalancing:region:account:loadbalancer/app/my-alb/xyz \ --attributes "Key=access_logs.s3.bucket"="my-alb-logs" \ "Key=access_logs.s3.enabled"="true"

# S3 log format includes health check results # Look for "target_status_code" field

# NGINX Plus logging # Already included in access_log with upstream status

# HAProxy detailed logging frontend http-in option httplog option log-health-checks ```

Health check packet capture:

```bash # Capture health check traffic on backend sudo tcpdump -i any -s 0 -w health-check.pcap port 8080

# Filter for health check path sudo tcpdump -i any -s 0 -w health-check.pcap 'tcp port 8080 and (tcp[tcpflags] & tcp-syn != 0)'

# Analyze in Wireshark: # - Look for SYN without ACK (connection refused) # - Look for RST packets (application not responding) # - Check time between request and response

# Real-time health check monitoring watch -n 1 'curl -s -o /dev/null -w "%{http_code} %{time_total}s" http://localhost:8080/health' ```

Prevention

  • Implement both liveness (/health) and readiness (/ready) endpoints
  • Set health check timeout 3x the p99 response time of health endpoint
  • Use unhealthy threshold of 2-3 to prevent flapping
  • Configure health check interval based on acceptable detection time
  • Exclude external dependencies from liveness checks
  • Monitor health check latency as leading indicator of problems
  • Document load balancer IP ranges for firewall rules
  • Use connection draining to avoid dropping in-flight requests
  • Test health check configuration during chaos engineering exercises
  • Set up alerts for UnHealthyHostCount > 0
  • **Load balancer SSL handshake failed**: TLS configuration mismatch
  • **Load balancer 503 Service Unavailable**: No healthy backends
  • **Load balancer sticky session failure**: Session affinity misconfigured
  • **Load balancer health check failure 503**: Backend returning errors