Introduction
Load balancer health check failures occur when backend instances fail to respond correctly to health probes, causing the load balancer to mark them unhealthy and stop routing traffic. This results in 503 Service Unavailable errors for users when all backends are unhealthy or capacity falls below threshold. Health checks validate backend availability by sending periodic requests to a configured endpoint or port. Common causes include health check timeout too short for application startup, health check path returning non-200 status, security group blocking health check traffic, backend application crashed or hung, network ACL preventing health check packets, health check interval too aggressive during deployments, application deadlock during health check execution, SSL/TLS certificate validation failures for HTTPS health checks, and backend connection pool exhaustion preventing health check responses. The fix requires understanding health check algorithms, proper timeout configuration, backend registration workflows, and debugging tools. This guide provides production-proven troubleshooting for load balancer health issues across AWS ALB/NLB, NGINX, HAProxy, Azure Load Balancer, and GCP Cloud Load Balancing.
Symptoms
- Load balancer returns
503 Service Unavailableto all requests - Backend instances show
unhealthyordrainingstatus - Health check endpoint returns
500or times out Target failed health check statusin load balancer logs- All instances in
InitialorUnusedstate - Intermittent 503s affecting subset of requests
No healthy upstreamin load balancer access logs- Connection refused or reset during health check
- Health check passes manually but load balancer marks unhealthy
- Deployment causes temporary 503 spike during instance rotation
Common Causes
- Health check timeout shorter than application response time
- Health check path endpoint broken or returning errors
- Security group missing rule for health check source IPs
- Backend application not started or crashed
- Network ACL blocking health check traffic
- SSL certificate mismatch for HTTPS health checks
- Application deadlocked or out of memory
- Health check interval too frequent during startup
- Load balancer backend pool exhausted
- Deregistration delay too short causing premature termination
Step-by-Step Fix
### 1. Diagnose health check status
Check backend health status:
```bash # AWS ALB - describe target health aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/1234567890
# Output: # { # "TargetHealthDescriptions": [ # { # "Target": {"Id": "i-abc123", "Port": 80}, # "HealthCheckPort": "80", # "TargetHealth": { # "State": "unhealthy", # "Reason": "Target.FailedHealthChecks", # "Description": "Health checks failed" # } # } # ] # }
# AWS ALB - check target group configuration aws elbv2 describe-target-group-attributes \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/1234567890
# Check health check settings aws elbv2 describe-target-groups \ --names my-target-group \ --query 'TargetGroups[0].{HealthCheckProtocol:HealthCheckProtocol,HealthCheckPath:HealthCheckPath,HealthCheckInterval:HealthCheckIntervalSeconds,HealthCheckTimeout:HealthCheckTimeoutSeconds,HealthyThreshold:HealthyThresholdCount,UnhealthyThreshold:UnhealthyThresholdCount,Matcher:Matcher}'
# Output: # { # "HealthCheckProtocol": "HTTP", # "HealthCheckPath": "/health", # "HealthCheckInterval": 30, # "HealthCheckTimeout": 5, # "HealthyThreshold": 2, # "UnhealthyThreshold": 3, # "Matcher": {"HttpCode": "200-399"} # } ```
NGINX upstream health:
```bash # Check upstream status via NGINX status module # Enable in nginx.conf: # location /nginx_status { # stub_status on; # allow 127.0.0.1; # deny all; # }
curl http://localhost/nginx_status
# Output: # Active connections: 10 # server accepts handled requests # 1000 1000 5000 # Reading: 5 Writing: 3 Waiting: 2
# Check upstream peers (NGINX Plus) curl http://localhost/api/3/http/upstreams
# Or use nginxcmd nginx -T | grep -A20 "upstream" ```
HAProxy backend status:
```bash # HAProxy stats socket echo "show stat" | socat stdio /var/run/haproxy.sock | cut -d',' -f1,2,17,18
# Output format: # pxname,svname,svstatus,check_status # frontend,fe_http,UP,OK # backend,be_app,s1,DOWN,SERVICE UNAVAILABLE # backend,be_app,s2,UP,OK
# Detailed backend info echo "show servers state" | socat stdio /var/run/haproxy.sock
# Show backend table echo "show table" | socat stdio /var/run/haproxy.sock ```
### 2. Fix health check configuration
AWS ALB health check tuning:
```yaml # Health check too aggressive (causes false failures) # Current settings: # - Interval: 5 seconds # - Timeout: 2 seconds # - Healthy threshold: 2 # - Unhealthy threshold: 2
# Problem: Application takes 3+ seconds to respond during startup # Health check times out at 2s, marks unhealthy after 2 failures (10s)
# Recommended settings for typical web applications: aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/1234567890 \ --health-check-protocol HTTP \ --health-check-port 80 \ --health-check-path /health \ --health-check-interval-seconds 30 \ --health-check-timeout-seconds 10 \ --healthy-threshold-count 2 \ --unhealthy-threshold-count 3 \ --health-check-success-codes "200-399"
# Settings explained: # - Interval 30s: Health check every 30 seconds (reduces load) # - Timeout 10s: Wait up to 10s for response (allows slow startup) # - Healthy 2: Need 2 consecutive successes to mark healthy # - Unhealthy 3: Need 3 consecutive failures to mark unhealthy # - Success codes 200-399: Accept 2xx and 3xx as healthy ```
Health check for slow-starting applications:
```yaml # Application takes 60+ seconds to start (Spring Boot, etc.) # Use startup health check vs runtime health check
# Option 1: Separate startup and readiness endpoints # /health/startup - returns 200 immediately when process alive # /health/ready - returns 200 only when fully initialized
# Target group for initial deployment aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/1234567890 \ --health-check-path /health/startup \ --health-check-interval-seconds 10 \ --health-check-timeout-seconds 5
# After deployment, switch to readiness check aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/1234567890 \ --health-check-path /health/ready \ --health-check-interval-seconds 30 \ --health-check-timeout-seconds 10
# Option 2: Use NLB with TCP health checks for faster detection aws elbv2 create-target-group \ --name my-tcp-tg \ --protocol TCP \ --port 8080 \ --vpc-id vpc-123456 \ --health-check-protocol TCP \ --health-check-interval-seconds 10 \ --health-check-timeout-seconds 5 ```
NGINX health check configuration:
```nginx # NGINX Plus active health checks upstream backend { zone backend_zone 64k;
server app1.example.com:8080 max_fails=3 fail_timeout=30s; server app2.example.com:8080 max_fails=3 fail_timeout=30s; server app3.example.com:8080 max_fails=3 fail_timeout=30s backup;
# Active health checks (NGINX Plus only) health_check interval=10s fails=3 passes=2 uri=/health match=healthy; }
# Health check match criteria match healthy { status 200-399; header Content-Type ~ text/; body ~ /status.*ok/i; }
server { listen 80;
location / { proxy_pass http://backend; proxy_connect_timeout 5s; proxy_read_timeout 30s; proxy_next_upstream error timeout http_502 http_503 http_504; proxy_next_upstream_tries 3; } }
# NGINX Open Source (passive health checks only) upstream backend { server app1.example.com:8080 max_fails=3 fail_timeout=30s; server app2.example.com:8080 max_fails=3 fail_timeout=30s;
# Passive: NGINX marks unhealthy after connection failures # max_fails: failures before marking unhealthy # fail_timeout: time to consider unhealthy, also max time for single request } ```
HAProxy health check configuration:
```haproxy # HAProxy configuration global log /dev/log local0 maxconn 4096
defaults log global mode http option httplog option dontlognull timeout connect 5s timeout client 50s timeout server 50s retries 3
frontend http_front bind *:80 default_backend http_back
backend http_back balance roundrobin
# Health check options option httpchk GET /health HTTP/1.1\r\nHost:\ localhost
# HTTP 200-399 = healthy http-check expect status 200-399
# Backend servers server app1 10.0.1.1:8080 check inter 5s fall 3 rise 2 server app2 10.0.1.2:8080 check inter 5s fall 3 rise 2 server app3 10.0.1.3:8080 check inter 5s fall 3 rise 2 backup
# Health check parameters: # - inter 5s: Check every 5 seconds # - fall 3: Mark unhealthy after 3 consecutive failures # - rise 2: Mark healthy after 2 consecutive successes ```
### 3. Fix security group and network issues
Security group rules for health checks:
```bash # AWS ALB health check source IPs # ALB health checks come from ALB subnet IPs, not a specific range
# For ALB in public subnets: # Allow inbound from 0.0.0.0/0 on health check port # Or restrict to ALB security group
# Create security group for ALB aws ec2 create-security-group \ --group-name alb-sg \ --description "Security group for ALB" \ --vpc-id vpc-123456
# Allow inbound HTTP/HTTPS aws ec2 authorize-security-group-ingress \ --group-id sg-alb123 \ --protocol tcp \ --port 80 \ --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress \ --group-id sg-alb123 \ --protocol tcp \ --port 443 \ --cidr 0.0.0.0/0
# Backend security group - allow from ALB SG aws ec2 authorize-security-group-ingress \ --group-id sg-backend456 \ --protocol tcp \ --port 8080 \ --source-group sg-alb123
# Health check specific rule aws ec2 authorize-security-group-ingress \ --group-id sg-backend456 \ --protocol tcp \ --port 80 \ --source-group sg-alb123
# Verify security group rules aws ec2 describe-security-groups \ --group-ids sg-backend456 \ --query 'SecurityGroups[0].IpPermissions' ```
Network ACL configuration:
```bash # Check NACL rules aws ec2 describe-network-acls \ --network-acl-ids nacl-123456 \ --query 'NetworkAcls[0].Entries'
# Ensure health check traffic allowed # Inbound: Allow ephemeral ports (1024-65535) for health check responses # Outbound: Allow health check port (80, 443, 8080)
aws ec2 create-network-acl-entry \ --network-acl-id nacl-123456 \ --rule-number 100 \ --protocol tcp \ --port-range From=80,To=80 \ --cidr-block 0.0.0.0/0 \ --egress
# For private subnets, ensure NAT Gateway route exists aws ec2 describe-route-tables \ --route-table-ids rtb-123456 ```
Test health check connectivity:
```bash # From backend instance, test response to health check path curl -v http://localhost:8080/health
# Simulate ALB health check (with expected headers) curl -v -H "User-Agent: ELB-HealthChecker/2.0" \ -H "Host: localhost" \ http://localhost:8080/health
# Test from different network perspective # Use VPC Reachability Analyzer aws ec2 create-network-insights-path \ --source <alb-eni-id> \ --destination <backend-eni-id> \ --protocol tcp \ --destination-port 80
aws ec2 start-network-insights-analysis \ --network-insights-path-id <path-id> ```
### 4. Fix application health check endpoint
Implement proper health check endpoint:
```java // Spring Boot Actuator health endpoint // application.yml management: endpoints: web: exposure: include: health,info endpoint: health: show-details: when_authorized # Or always for LB health check probes: enabled: true # Kubernetes readiness/liveness
// Health check response // GET /actuator/health // HTTP 200 OK // { // "status": "UP", // "components": { // "db": {"status": "UP"}, // "redis": {"status": "UP"}, // "diskSpace": {"status": "UP"} // } // }
// Custom health indicator @Component public class CustomHealthIndicator implements HealthIndicator { @Override public Health health() { try { // Check critical dependency checkExternalService(); return Health.up().build(); } catch (Exception e) { return Health.down(e).build(); } } } ```
Node.js health endpoint:
```javascript const express = require('express'); const app = express();
// Simple health check (process alive) app.get('/health/live', (req, res) => { res.status(200).json({ status: 'alive' }); });
// Readiness check (ready for traffic) app.get('/health/ready', async (req, res) => { const checks = { database: 'unknown', cache: 'unknown', external_api: 'unknown' };
let healthy = true;
try { await db.query('SELECT 1'); checks.database = 'ok'; } catch (e) { checks.database = 'error'; healthy = false; }
try { await redis.ping(); checks.cache = 'ok'; } catch (e) { checks.cache = 'error'; healthy = false; }
const status = healthy ? 200 : 503; res.status(status).json({ status: healthy ? 'ready' : 'not_ready', checks }); });
// Startup check (for slow-starting apps) let startupComplete = false; app.on('ready', () => { startupComplete = true; });
app.get('/health/startup', (req, res) => { if (startupComplete) { res.status(200).json({ status: 'started' }); } else { res.status(503).json({ status: 'starting' }); } }); ```
### 5. Fix connection draining and deregistration
Connection draining during deployments:
```bash # AWS ALB - deregistration delay # When instance is deregistered, ALB stops new connections # but allows existing connections to complete (up to delay)
aws elbv2 modify-target-group-attributes \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/1234567890 \ --attributes Key=deregistration_delay.timeout_seconds,Value=300
# Common settings: # - 300 seconds (5 min): Typical for HTTP requests # - 60 seconds: Fast iteration, short-lived requests # - 600 seconds (10 min): Long-running requests, WebSocket
# For Lambda targets aws elbv2 modify-target-group-attributes \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/my-lambda-tg/1234567890 \ --attributes Key=lambda_multi_value_headers.enabled,Value=true ```
Graceful shutdown with draining:
```java // Spring Boot graceful shutdown // application.yml server: shutdown: graceful # Spring Boot 2.3+
spring: lifecycle: timeout-per-shutdown-phase: 30s # Max wait time
// AWS CodeDeploy lifecycle hook # appspec.yml version: 0.0 os: linux files: - source: /app destination: /opt/app hooks: ApplicationStop: - location: scripts/stop.sh timeout: 300 runas: root BeforeAllowTraffic: - location: scripts/before-allow.sh timeout: 60 AfterAllowTraffic: - location: scripts/after-allow.sh timeout: 60
# stop.sh - graceful shutdown with draining #!/bin/bash echo "Starting graceful shutdown..."
# Mark instance as unhealthy in target group aws elbv2 modify-target-group-attributes \ --target-group-arn $TARGET_GROUP_ARN \ --attributes Key=deregistration_delay.timeout_seconds,Value=30
# Wait for ALB to stop sending traffic sleep 30
# Stop application systemctl stop myapp
# Wait for connections to drain for i in {1..30}; do connections=$(ss -tlnp | grep :8080 | wc -l) if [ "$connections" -eq 0 ]; then echo "All connections drained" break fi echo "Waiting for connections to drain... ($connections)" sleep 10 done ```
### 6. Debug intermittent 503 errors
Intermittent failures often indicate capacity issues:
```bash # Check target group capacity aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB \ --metric-name RequestCountPerTarget \ --dimensions Name=TargetGroup,Value=arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/1234567890 \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Average
# Check unhealthy host count aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB \ --metric-name UnHealthyHostCount \ --dimensions Name=TargetGroup,Value=arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/1234567890 \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Average
# Check target response time aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB \ --metric-name TargetResponseTime \ --dimensions Name=TargetGroup,Value=arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/1234567890 \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Average ```
Access log analysis:
```bash # ALB access logs (S3 bucket) # Parse with Athena or locally
# Download and parse aws s3 cp s3://alb-logs-bucket/AWSLogs/account/elb/region/2026/03/31/ . --recursive
# Extract 503 errors zcat *.gz | awk '$9 == "503"' | head -20
# Key fields in ALB logs: # 1: type # 2: timestamp # 3: elb # 4: client_ip:port # 5: target_ip:port # 6: request_processing_time # 7: target_processing_time # 8: response_processing_time # 9: elb_status_code # 10: target_status_code # 11: received_bytes # 12: sent_bytes
# Analyze 503 patterns zcat *.gz | awk '{ if ($9 == "503") { print $2, $5, $10 # timestamp, target, target_status } }' | sort | uniq -c | sort -rn | head -20
# Check for target-specific issues zcat *.gz | awk '{print $5}' | sort | uniq -c | sort -rn ```
### 7. Fix SSL/TLS health check issues
HTTPS health check configuration:
```bash # ALB HTTPS health check with self-signed cert # By default, ALB validates certificate chain
# Option 1: Use valid certificate on backend # Install Let's Encrypt or CA-signed cert
# Option 2: Disable certificate validation (HTTP health check) aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/1234567890 \ --health-check-protocol HTTP \ --health-check-port 8080
# Option 3: Use TCP health check (no SSL validation) aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/1234567890 \ --health-check-protocol TCP \ --health-check-port 443
# For NLB with TLS passthrough aws elbv2 create-target-group \ --name my-nlb-tg \ --protocol TCP \ --port 443 \ --vpc-id vpc-123456 \ --health-check-protocol TCP \ --health-check-port 443 ```
### 8. Monitor load balancer health
CloudWatch alarms:
```bash # Alarm for unhealthy targets aws cloudwatch put-metric-alarm \ --alarm-name "ALB-UnhealthyHosts" \ --alarm-description "More than 50% of targets unhealthy" \ --metric-name UnHealthyHostCount \ --namespace AWS/ApplicationELB \ --statistic Average \ --period 60 \ --threshold 1 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 2 \ --dimensions Name=TargetGroup,Value=arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/1234567890 \ --alarm-actions arn:aws:sns:region:account:alerts-topic
# Alarm for 503 errors aws cloudwatch put-metric-alarm \ --alarm-name "ALB-503Errors" \ --alarm-description "503 error rate above 5%" \ --metric-name HTTPCode_ELB_5XX_Count \ --namespace AWS/ApplicationELB \ --statistic Sum \ --period 60 \ --threshold 100 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 2 \ --dimensions Name=LoadBalancer,Value=app/my-alb/1234567890 \ --alarm-actions arn:aws:sns:region:account:alerts-topic
# Alarm for high latency aws cloudwatch put-metric-alarm \ --alarm-name "ALB-HighLatency" \ --alarm-description "Target response time above 1s" \ --metric-name TargetResponseTime \ --namespace AWS/ApplicationELB \ --statistic Average \ --period 60 \ --threshold 1 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 3 \ --dimensions Name=TargetGroup,Value=arn:aws:elasticloadbalancing:region:account:targetgroup/my-tg/1234567890 \ --alarm-actions arn:aws:sns:region:account:alerts-topic ```
Grafana dashboard:
```yaml # Grafana alerting rules for load balancer groups: - name: load_balancer rules: - alert: UnhealthyTargets expr: aws_alb_unhealthy_host_count > 0 for: 5m labels: severity: critical annotations: summary: "Load balancer has unhealthy targets"
- alert: High503Rate
- expr: |
- sum(rate(aws_alb_http_5xx_count[5m]))
- / sum(rate(aws_alb_request_count[5m])) > 0.05
- for: 5m
- labels:
- severity: critical
- annotations:
- summary: "Load balancer 503 error rate above 5%"
- alert: HighTargetLatency
- expr: aws_alb_target_response_time_avg > 1
- for: 10m
- labels:
- severity: warning
- annotations:
- summary: "Target response time above 1 second"
`
Prevention
- Configure health check timeouts based on p99 response times
- Implement separate liveness and readiness endpoints
- Set appropriate deregistration delay for request duration
- Use connection draining during deployments
- Monitor unhealthy host count with alerts
- Test health check endpoint under load before production
- Document health check requirements for each service
- Use TCP health checks for simple availability when HTTP is unreliable
Related Errors
- **502 Bad Gateway**: Backend connection failed or refused
- **504 Gateway Timeout**: Backend response timeout
- **503 Service Unavailable**: No healthy backends available
- **Connection Refused**: Backend not listening on health check port
- **SSL Certificate Error**: HTTPS health check certificate validation failed