Introduction
AWS Application Load Balancer (ALB) 503 Service Unavailable errors occur when the ALB cannot route requests to healthy targets. The error indicates all targets in the target group are unhealthy, connection attempts are failing, or the ALB itself is misconfigured. Unlike 502 Bad Gateway (which means the target responded with invalid data), 503 means no valid response was received from any target. Common causes include health check failures, security group blocking, target application crashes, or ALB listener/routing misconfiguration.
Symptoms
- ALB returns HTTP 503 Service Unavailable to all requests
- CloudWatch metric
HTTPCode_ELB_503_Countspikes - Target group shows all targets as
unhealthyorunused - ALB access logs show
target_status_code: -(no target response) - Issue appears after deploy, security group change, Auto Scaling event, or region outage
- Health check endpoint returns 200 when accessed directly but targets still marked unhealthy
Common Causes
- All targets marked unhealthy due to health check failures
- Security group blocking ALB health check traffic
- Target group port doesn't match application listening port
- Application not started or crashed on target instances
- Health check path incorrect or requires authentication
- Target registration pending or deregistration in progress
- ALB subnet or routing misconfiguration
Step-by-Step Fix
### 1. Check target health status
Verify target group health:
```bash # Check target health via AWS CLI aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:REGION:ACCOUNT:targetgroup/TG-NAME/ID
# Output shows each target's state: # { # "TargetHealthDescriptions": [ # { # "Target": {"Id": "i-0abc123", "Port": 80}, # "HealthCheckPort": "80", # "TargetHealth": { # "State": "unhealthy", # "Reason": "Target.Failed Health Checks", # "Description": "Health checks failed" # } # } # ] # }
# Check via console # EC2 > Target Groups > [select your TG] > Targets tab # Look for "unhealthy" status and reason ```
Target health states:
- initial: Health checks in progress
- healthy: Passing health checks
- unhealthy: Failing health checks
- unused: Target not in AZ or deregistering
- draining: Connection draining in progress
### 2. Check health check configuration
Verify health check settings match application:
```bash # Get target group health check config aws elbv2 describe-target-groups \ --names <target-group-name> \ --query 'TargetGroups[0].{ Protocol:Protocol, Port:Port, HealthCheckProtocol:HealthCheckProtocol, HealthCheckPort:HealthCheckPort, HealthCheckPath:HealthCheckPath, HealthCheckIntervalSeconds:HealthCheckIntervalSeconds, HealthCheckTimeoutSeconds:HealthCheckTimeoutSeconds, HealthyThresholdCount:HealthyThresholdCount, UnhealthyThresholdCount:UnhealthyThresholdCount, Matcher:Matcher }'
# Expected output: # { # "Protocol": "HTTP", # "Port": 80, # "HealthCheckProtocol": "HTTP", # "HealthCheckPort": "traffic-port", # "HealthCheckPath": "/health", # "HealthCheckIntervalSeconds": 30, # "HealthCheckTimeoutSeconds": 5, # "HealthyThresholdCount": 2, # "UnhealthyThresholdCount": 3, # "Matcher": {"HttpCode": "200"} # } ```
Common misconfigurations:
- HealthCheckPath points to non-existent endpoint
- Matcher.HttpCode expects 200 but app returns 204 or 301
- HealthCheckPort doesn't match application port
- HealthCheckTimeoutSeconds too short for app response time
Update health check settings:
bash
# Update health check path
aws elbv2 modify-target-group \
--target-group-arn <target-group-arn> \
--health-check-path /healthz \
--health-check-port 8080 \
--health-check-interval-seconds 30 \
--health-check-timeout-seconds 5 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 3 \
--matcher HttpCode=200-299
### 3. Verify security group allows ALB traffic
Security groups must allow health check traffic:
```bash # Get ALB security group aws elbv2 describe-load-balancers \ --names <alb-name> \ --query 'LoadBalancers[0].SecurityGroups'
# Get target instance security group aws ec2 describe-instances \ --filters "Name=instance-state-name,Values=running" \ --query 'Reservations[*].Instances[*].[InstanceId,SecurityGroups[*].GroupId]'
# Check security group rules aws ec2 describe-security-groups \ --group-ids <target-sg-id> \ --query 'SecurityGroups[0].IpPermissions'
# Expected: Inbound rule allowing ALB security group or CIDR # { # "IpProtocol": "tcp", # "FromPort": 8080, # "ToPort": 8080, # "UserIdGroupPairs": [ # {"GroupId": "sg-alb-id"} # ] # } ```
Required security group rules:
``` Inbound Rules: - Type: Custom TCP - Port: <health-check-port> (e.g., 8080) - Source: <ALB-security-group-id> OR <VPC-CIDR>
Outbound Rules: - Type: All traffic (for health check responses) - Destination: 0.0.0.0/0 OR <ALB-CIDR> ```
Update security group:
bash
# Add inbound rule for ALB
aws ec2 authorize-security-group-ingress \
--group-id <target-sg-id> \
--protocol tcp \
--port 8080 \
--source-group <alb-sg-id>
### 4. Check ALB health check IP ranges
AWS uses specific IP ranges for health checks:
``` # ALB health check source IPs (by region) # US-East-1: 130.211.0.0/22 and 35.191.0.0/16 # US-West-2: Similar ranges
# Get current region's health check IPs # https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/health-checks.html
# Security group must allow these ranges if not using ALB SG as source ```
For Network Load Balancer (NLB) targets: - Health checks come from NLB private IP - Allow traffic from NLB subnet CIDR
### 5. Test health check endpoint directly
Verify application responds correctly:
```bash # SSH to target instance ssh ec2-user@<instance-ip>
# Test health endpoint locally curl -I http://localhost:8080/health
# Test from instance private IP curl -I http://<instance-private-ip>:8080/health
# Expected response: # HTTP/1.1 200 OK # Content-Type: application/json # Connection: keep-alive
# Check response time (must be < timeout) time curl -s http://localhost:8080/health
# Check application is listening on correct interface netstat -tlnp | grep :8080
# Expected: 0.0.0.0:8080 or ::8080 # Problematic: 127.0.0.1:8080 (localhost only) ```
### 6. Check application status on targets
Verify application is running:
```bash # Check via SSM Session Manager (no SSH needed) aws ssm start-session --target <instance-id>
# Check application process sudo systemctl status myapp
# Check application logs sudo journalctl -u myapp -n 50 --no-pager
# Check listening port sudo ss -tlnp | grep :8080
# Check for recent crashes sudo journalctl -u myapp --since "1 hour ago" | grep -iE "error|crash|fatal" ```
For containerized applications (ECS):
```bash # Check ECS task status aws ecs describe-tasks \ --cluster <cluster-name> \ --tasks <task-arn> \ --query 'tasks[0].{status:lastStatus,containers:containers}'
# Check container health aws ecs describe-tasks \ --cluster <cluster-name> \ --tasks <task-arn> \ --query 'tasks[0].containers[0].health' ```
### 7. Check ALB listener and routing rules
Verify ALB is configured correctly:
```bash # Get listener configuration aws elbv2 describe-listeners \ --load-balancer-arn <alb-arn>
# Expected: # { # "Protocol": "HTTP", # "Port": 80, # "DefaultActions": [ # { # "Type": "forward", # "TargetGroupArn": "arn:aws:..." # } # ] # }
# Get routing rules aws elbv2 describe-rules \ --listener-arn <listener-arn> ```
Common listener issues: - Default action points to wrong target group - Listener port doesn't match expected (80 vs 443) - HTTPS listener with invalid/expired certificate - Rule conditions too restrictive, no targets match
### 8. Check Auto Scaling health status
Auto Scaling may mark instances unhealthy:
```bash # Get Auto Scaling group instances aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names <asg-name> \ --query 'AutoScalingGroups[0].Instances[]'
# Check instance health status aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names <asg-name> \ --query 'AutoScalingGroups[0].Instances[*].{ InstanceId:InstanceId, HealthStatus:HealthStatus, LifecycleState:LifecycleState }'
# HealthStatus can be: # - Healthy: ELB and ASG health checks pass # - Unhealthy: ELB or ASG health check failed ```
If instances marked unhealthy:
```bash # Check ASG health check settings aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names <asg-name> \ --query 'AutoScalingGroups[0].{ HealthCheckType:HealthCheckType, HealthCheckGracePeriod:HealthCheckGracePeriod }'
# HealthCheckType: # - EC2: Only EC2 status checks # - ELB: Uses ALB health checks (recommended) ```
### 9. Check ALB CloudWatch metrics
Monitor ALB health with metrics:
```bash # Get 503 count aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB \ --metric-name HTTPCode_ELB_503_Count \ --dimensions Name=LoadBalancer,Value=<alb-id> \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Sum
# Get unhealthy host count aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB \ --metric-name UnHealthyHostCount \ --dimensions Name=TargetGroup,Value=<tg-id> \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Average
# Get target response time aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB \ --metric-name TargetResponseTime \ --dimensions Name=TargetGroup,Value=<tg-id> \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Average ```
Key metrics to alert:
- HTTPCode_ELB_503_Count > 0: Service unavailable
- UnHealthyHostCount > 0: Targets failing health
- TargetResponseTime > threshold: Slow responses
### 10. Check ALB access logs
Enable and analyze access logs:
```bash # Enable access logs aws elbv2 modify-load-balancer-attributes \ --load-balancer-arn <alb-arn> \ --attributes Key=access_logs.s3.enabled,Value=true \ Key=access_logs.s3.bucket,Value=<log-bucket-name>
# Parse access logs # Format: timestamp elb client:port target:port request_processing_time target_processing_time response_processing_time elb_status_code target_status_code aws s3 cp s3://<log-bucket>/AWSLogs/<account-id>/elasticloadbalancing/<region>/<date>/ . --recursive
# Find 503 errors zcat *.gz | awk '$10 == 503 {print}' | head -20
# Check target_status_code for 503s zcat *.gz | awk '$10 == 503 {print $11}' | sort | uniq -c
# target_status_code = - means no target responded # target_status_code = 0 means connection failed before response ```
Prevention
- Configure health check endpoints that return quickly (< 2 seconds)
- Set
HealthCheckTimeoutSecondsto 3x p99 response time - Use
HealthyThresholdCount: 2andUnhealthyThresholdCount: 3 - Enable access logs for troubleshooting
- Set up CloudWatch alarms for
UnHealthyHostCountand503errors - Configure Auto Scaling health checks to use ELB
- Test health check behavior in staging before production
- Document expected startup time and set
HealthCheckGracePeriod
Related Errors
- **502 Bad Gateway**: Target responded with invalid/empty response
- **504 Gateway Timeout**: Target response exceeded timeout
- **Connection Refused**: Security group or application not listening
- **Target.Failed Health Checks**: All targets marked unhealthy