Introduction

AWS Application Load Balancer (ALB) target group health check failures occur when backend targets (EC2 instances, ECS tasks, IP addresses, or Lambda functions) fail to respond correctly to health check probes, causing the ALB to mark them unhealthy and stop routing traffic. When all targets are unhealthy, the ALB returns 503 Service Unavailable to all client requests. Health checks validate target availability by sending periodic HTTP/HTTPS/TCP requests to a configured path and port. Common causes include health check path returning non-200 status code, health check timeout shorter than application response time, security group blocking health check traffic from ALB, target application not started or crashed, network ACL blocking health check packets, wrong health check port configuration, target failed to register with target group, deregistration delay too short causing premature termination during deployments, and target capacity insufficient for health check validation. The fix requires understanding ALB health check algorithm, target registration workflows, security group configuration, and debugging tools. This guide provides production-proven troubleshooting for ALB health check issues across EC2, ECS, IP, and Lambda targets.

Symptoms

  • ALB returns 503 Service Unavailable to all requests
  • Target group shows Status: unhealthy or Status: unused
  • EC2 Health Checks show Target.FailedHealthChecks
  • Health checks failed in AWS Console
  • Targets show Initial status and never become Healthy
  • Intermittent 503s as targets cycle between healthy/unhealthy
  • Deployment causes all targets to go unhealthy
  • Target.ResponseCodeMismatch in CloudWatch
  • Target.Timeout errors in health check logs
  • Target deregistration takes too long or fails

Common Causes

  • Health check path /health returns 500 instead of 200
  • Application takes longer than timeout to respond
  • Security group missing inbound rule for ALB
  • NACL blocking traffic on health check port
  • Target application not listening on health check port
  • Health check protocol mismatch (HTTP vs HTTPS)
  • Target failed IAM role or permission issues
  • Container health check command failing (ECS)
  • Lambda function timeout or error
  • Deregistration delay too short for connection draining

Step-by-Step Fix

### 1. Diagnose target health status

Check target health via Console:

``` # AWS Console > EC2 > Target Groups # Select target group > Targets tab

# Status meanings: # - Initial: Health check not yet completed (first check pending) # - Healthy: Target passed health checks # - Unhealthy: Target failed health checks # - Unused: Target group protocol doesn't match load balancer # - Draining: Target is being deregistered

# Click on unhealthy target to see: # - Reason: Target.FailedHealthChecks # - Description: Health checks failed # - Port: Health check port # - Health Check Port: Configured port ```

Check target health via CLI:

```bash # Describe target health aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123def456

# Output: # { # "TargetHealthDescriptions": [ # { # "Target": { # "Id": "i-0abc123def456", # "Port": 8080 # }, # "HealthCheckPort": "8080", # "TargetHealth": { # "State": "unhealthy", # "Reason": "Target.FailedHealthChecks", # "Description": "Health checks failed" # } # } # ] # }

# Check target group configuration aws elbv2 describe-target-groups \ --names my-target-group \ --query 'TargetGroups[0].{ Protocol:Protocol, Port:Port, HealthCheckProtocol:HealthCheckProtocol, HealthCheckPath:HealthCheckPath, HealthCheckPort:HealthCheckPort, HealthCheckInterval:HealthCheckIntervalSeconds, HealthCheckTimeout:HealthCheckTimeoutSeconds, HealthyThreshold:HealthyThresholdCount, UnhealthyThreshold:UnhealthyThresholdCount, Matcher:Matcher }'

# Output: # { # "Protocol": "HTTP", # "Port": 80, # "HealthCheckProtocol": "HTTP", # "HealthCheckPath": "/health", # "HealthCheckPort": "8080", # "HealthCheckInterval": 30, # "HealthCheckTimeout": 5, # "HealthyThreshold": 2, # "UnhealthyThreshold": 3, # "Matcher": {"HttpCode": "200-399"} # } ```

Check CloudWatch metrics:

```bash # Unhealthy host count aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB \ --metric-name UnHealthyHostCount \ --dimensions Name=TargetGroup,Value=arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123def456 \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Average

# Request count per target aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB \ --metric-name RequestCountPerTarget \ --dimensions Name=TargetGroup,Value=arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123def456 \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Average

# Target response time aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB \ --metric-name TargetResponseTime \ --dimensions Name=TargetGroup,Value=arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123def456 \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Average ```

### 2. Fix health check configuration

Update health check settings:

```bash # Standard web application settings aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123def456 \ --health-check-protocol HTTP \ --health-check-port 8080 \ --health-check-path /health \ --health-check-interval-seconds 30 \ --health-check-timeout-seconds 10 \ --healthy-threshold-count 2 \ --unhealthy-threshold-count 3 \ --health-check-success-codes "200-399"

# Settings explained: # - Protocol: HTTP, HTTPS, or TCP # - Port: Traffic port (8080) or specific port # - Path: Health check endpoint (/health) # - Interval: Seconds between health checks (30) # - Timeout: Seconds to wait for response (10) # - Healthy threshold: Consecutive successes to mark healthy (2) # - Unhealthy threshold: Consecutive failures to mark unhealthy (3) # - Success codes: HTTP codes considered healthy (200-399)

# For slow-starting applications aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123def456 \ --health-check-interval-seconds 10 \ --health-check-timeout-seconds 15 \ --unhealthy-threshold-count 5 ```

Health check path requirements:

```bash # Health check endpoint should: # - Return 200 OK when application is healthy # - Return 5xx when application is unhealthy # - Respond quickly (< timeout seconds) # - Not depend on external services (database, cache) # - Be lightweight (no heavy processing)

# Example Spring Boot health endpoint # GET /health # HTTP 200 OK # { # "status": "UP", # "components": { # "diskSpace": {"status": "UP"}, # "ping": {"status": "UP"} # } # }

# Example Node.js health endpoint app.get('/health', (req, res) => { res.status(200).json({ status: 'healthy' }); });

# For TCP health check (layer 4) aws elbv2 modify-target-group \ --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123def456 \ --health-check-protocol TCP \ --health-check-port 8080 ```

### 3. Fix security group rules

Configure security group for ALB:

```bash # ALB security group (allows internet traffic) aws ec2 authorize-security-group-ingress \ --group-id sg-alb123 \ --protocol tcp \ --port 80 \ --cidr 0.0.0.0/0

aws ec2 authorize-security-group-ingress \ --group-id sg-alb123 \ --protocol tcp \ --port 443 \ --cidr 0.0.0.0/0

# Target security group (allows ALB traffic) # CRITICAL: Health checks come from ALB, not internet aws ec2 authorize-security-group-ingress \ --group-id sg-target456 \ --protocol tcp \ --port 8080 \ --source-group sg-alb123 # Reference ALB security group

# Verify security group rules aws ec2 describe-security-groups \ --group-ids sg-target456 \ --query 'SecurityGroups[0].IpPermissions'

# Output should include: # { # "IpProtocol": "tcp", # "FromPort": 8080, # "ToPort": 8080, # "UserIdGroupPairs": [ # { # "GroupId": "sg-alb123" # } # ] # } ```

Check NACL rules:

```bash # Describe NACL for subnet aws ec2 describe-network-acls \ --filters "Name=association.subnet-id,Values=subnet-abc123"

# Ensure inbound rules allow: # - Ephemeral ports (1024-65535) for health check responses # - Health check port from ALB subnet

# Ensure outbound rules allow: # - Health check port to ALB subnet # - Ephemeral ports for responses

# If NACL blocking, add rules: aws ec2 create-network-acl-entry \ --network-acl-id nacl-123 \ --rule-number 100 \ --protocol tcp \ --port-range From=8080,To=8080 \ --cidr-block 10.0.0.0/16 \ --rule-action allow \ --egress ```

### 4. Fix EC2 target issues

EC2 target registration:

```bash # Register EC2 instance with target group aws elbv2 register-targets \ --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123def456 \ --targets Id=i-0abc123def456,Port=8080

# Verify registration aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123def456 \ --targets Id=i-0abc123def456,Port=8080

# Check instance state aws ec2 describe-instances \ --instance-ids i-0abc123def456 \ --query 'Reservations[0].Instances[0].{State:State.Name,Status:StateReason}'

# Verify application is running aws ssm send-command \ --document-name "AWS-RunShellScript" \ --targets "Key=instanceIds,Values=i-0abc123def456" \ --parameters commands="systemctl status myapp"

# Check application listening on port aws ssm send-command \ --document-name "AWS-RunShellScript" \ --targets "Key=instanceIds,Values=i-0abc123def456" \ --parameters commands="ss -tlnp | grep 8080" ```

EC2 user data for automatic registration:

```bash #!/bin/bash # EC2 User Data script

# Install application yum install -y myapp

# Start application systemctl start myapp systemctl enable myapp

# Register with ALB INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id) REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/[a-z]$//')

aws elbv2 register-targets \ --target-group-arn arn:aws:elasticloadbalancing:$REGION:123456789012:targetgroup/my-tg/abc123 \ --targets Id=$INSTANCE_ID,Port=8080

# Verify health sleep 30 curl -f http://localhost:8080/health || systemctl restart myapp ```

### 5. Fix ECS target issues

ECS task health check:

json // Task definition with health check { "family": "my-app", "containerDefinitions": [ { "name": "app", "image": "my-app:latest", "portMappings": [ { "containerPort": 8080, "protocol": "tcp" } ], "healthCheck": { "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"], "interval": 30, "timeout": 5, "retries": 3, "startPeriod": 60 } } ] }

ECS service with load balancer:

```bash # Create service with load balancer aws ecs create-service \ --cluster my-cluster \ --service-name my-service \ --task-definition my-app:1 \ --desired-count 2 \ --launch-type FARGATE \ --load-balancers targetGroupArn=arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123,containerName=app,containerPort=8080 \ --network-configuration "awsvpcConfiguration={subnets=[subnet-abc123],securityGroups=[sg-target456],assignPublicIp=ENABLED}"

# Check service health aws ecs describe-services \ --cluster my-cluster \ --services my-service \ --query 'services[0].{Status:status,RunningCount:runningCount,PendingCount:pendingCount,UnhealthyCount:unhealthyHosts}'

# Check task health aws ecs list-tasks \ --cluster my-cluster \ --service-name my-service

aws ecs describe-tasks \ --cluster my-cluster \ --tasks <task-arn> \ --query 'tasks[0].{LastStatus:lastStatus,Health:containers[0].health}' ```

### 6. Fix deregistration issues

Configure deregistration delay:

```bash # Update deregistration delay (connection draining) aws elbv2 modify-target-group-attributes \ --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123def456 \ --attributes Key=deregistration_delay.timeout_seconds,Value=300

# Common settings: # - 300 seconds (5 min): Typical for HTTP requests # - 60 seconds: Fast iteration, short requests # - 600 seconds (10 min): Long-running requests, WebSocket

# For Lambda targets aws elbv2 modify-target-group-attributes \ --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-lambda-tg/abc123def456 \ --attributes Key=lambda_multi_value_headers.enabled,Value=true

# Deregister target manually aws elbv2 deregister-targets \ --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123def456 \ --targets Id=i-0abc123def456,Port=8080

# Check deregistration status aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123def456 \ --targets Id=i-0abc123def456,Port=8080

# Output during draining: # "TargetHealth": { # "State": "draining", # "Reason": "Target.DeregistrationInProgress", # "Description": "Target deregistration in progress" # } ```

### 7. Debug with ALB access logs

Enable access logging:

```bash # Enable ALB access logs aws elbv2 modify-load-balancer-attributes \ --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/abc123 \ --attributes Key=access_logs.s3.enabled,Value=true \ Key=access_logs.s3.bucket,Value=my-alb-logs \ Key=access_logs.s3.prefix,Value=alb-logs

# Logs stored in S3: # s3://my-alb-logs/alb-logs/AWSAccountId/elasticloadbalancing/region/year/month/day/ ```

Analyze access logs:

```bash # Download logs aws s3 cp s3://my-alb-logs/alb-logs/2026/04/01/ . --recursive

# Parse logs (gzip compressed) zcat *.gz | head -20

# ALB log format: # type timestamp elb client_ip:port target_ip:port request_processing_time target_processing_time response_processing_time elb_status_code target_status_code received_bytes sent_bytes request user_agent ssl_cipher ssl_protocol target_group_arn trace_id domain_name response_reason

# Filter 503 errors zcat *.gz | awk '$11 == "503"' | head -20

# Check target response codes zcat *.gz | awk '{print $12}' | sort | uniq -c | sort -rn

# Check for timeouts zcat *.gz | awk '$4 > 10' | head -20 # Request processing > 10s zcat *.gz | awk '$5 > 10' | head -20 # Target processing > 10s ```

### 8. Monitor and alert on health check failures

CloudWatch alarms:

```bash # Alarm for unhealthy targets aws cloudwatch put-metric-alarm \ --alarm-name "ALB-UnhealthyHosts" \ --alarm-description "All targets unhealthy" \ --metric-name UnHealthyHostCount \ --namespace AWS/ApplicationELB \ --statistic Average \ --period 60 \ --threshold 1 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 2 \ --dimensions Name=TargetGroup,Value=arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts-topic

# Alarm for 503 errors aws cloudwatch put-metric-alarm \ --alarm-name "ALB-503Errors" \ --alarm-description "503 error rate high" \ --metric-name HTTPCode_ELB_5XX_Count \ --namespace AWS/ApplicationELB \ --statistic Sum \ --period 60 \ --threshold 100 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 2 \ --dimensions Name=LoadBalancer,Value=app/my-alb/abc123 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts-topic

# Alarm for high latency aws cloudwatch put-metric-alarm \ --alarm-name "ALB-HighLatency" \ --alarm-description "Target response time above 1s" \ --metric-name TargetResponseTime \ --namespace AWS/ApplicationELB \ --statistic Average \ --period 60 \ --threshold 1 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 3 \ --dimensions Name=TargetGroup,Value=arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts-topic ```

Prevention

  • Configure health check timeouts based on p99 response times
  • Use separate health check endpoint that doesn't depend on external services
  • Set appropriate deregistration delay for request duration
  • Monitor unhealthy host count with CloudWatch alarms
  • Test health check endpoint under load before production
  • Document health check requirements for each service
  • Use TCP health checks for simple availability when HTTP is unreliable
  • Implement connection draining for graceful deployments
  • **503 Service Unavailable**: No healthy targets available
  • **502 Bad Gateway**: Target connection failed
  • **504 Gateway Timeout**: Target response timeout
  • **Target.ResponseCodeMismatch**: Health check returned non-200
  • **Target.Timeout**: Health check request timed out