# Fix AWS ECS Service Unstable
Your ECS service keeps stopping tasks, deployments fail repeatedly, or tasks are in a constant cycle of starting and stopping. The service status shows "ACTIVE" but the running count never reaches the desired count. Meanwhile, your application is experiencing intermittent outages.
ECS service instability is often a cascade of issues: health checks failing, resource constraints, networking problems, or application errors. Let's methodically track down the root cause.
Diagnosis Commands
First, check the service status:
aws ecs describe-services \
--cluster my-cluster \
--services my-service \
--query 'services[*].[serviceName,status,runningCount,desiredCount,deployments[0].status]' \
--output tableGet detailed deployment info:
aws ecs describe-services \
--cluster my-cluster \
--services my-service \
--query 'services[*].deployments[*].[id,status,taskDefinition,rolloutState,rolloutStateReason]'List the tasks and their status:
aws ecs list-tasks \
--cluster my-cluster \
--service-name my-service \
--query 'taskArns'Check why tasks are stopping:
aws ecs describe-tasks \
--cluster my-cluster \
--tasks $(aws ecs list-tasks --cluster my-cluster --service-name my-service --query 'taskArns[]' --output text) \
--query 'tasks[*].[taskArn,lastStatus,stoppedReason,stopCode,containers[*].lastStatus]' \
--output tableGet task definition details:
aws ecs describe-task-definition \
--task-definition my-task-def:1 \
--query 'taskDefinition.[family,revision,requiresCompatibilities,networkMode,cpu,memory]'Check CloudWatch for task events:
aws logs filter-log-events \
--log-group-name /aws/ecs/my-cluster \
--start-time $(date -u -d '1 hour ago' +%s)000 \
--query 'events[*].message'Look for stopped task events:
aws ecs describe-tasks \
--cluster my-cluster \
--tasks $(aws ecs list-tasks --cluster my-cluster --service-name my-service --desired-status STOPPED --query 'taskArns[]' --output text) \
--query 'tasks[*].[taskArn,stoppedReason,stopCode,containers[*].exitCode]'Check service events:
aws ecs describe-services \
--cluster my-cluster \
--services my-service \
--query 'services[*].events[:10].[createdAt,message]'Common Causes and Solutions
Health Check Failures
The most common cause of instability is failing health checks. Check the target group health:
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/1234567890123456 \
--query 'TargetHealthDescriptions[*].[Target.Id,Target.Port,TargetHealth.State,TargetHealth.Reason]'If targets are unhealthy, check the health check configuration:
aws elbv2 describe-target-groups \
--target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/1234567890123456 \
--query 'TargetGroups[*].HealthCheckConfig'Common health check issues:
Health check path returns wrong status:
Make sure your app returns 200 OK on the health check path:
```bash # Test from a running task aws ecs execute-command \ --cluster my-cluster \ --task $(aws ecs list-tasks --cluster my-cluster --service-name my-service --query 'taskArns[0]' --output text) \ --container my-container \ --command "curl -v http://localhost:8080/health" \ --interactive
# Or check container logs aws logs get-log-events \ --log-group-name /ecs/my-service \ --log-stream-name ecs/my-container/$(aws ecs list-tasks --cluster my-cluster --service-name my-service --query 'taskArns[0]' --output text | cut -d'/' -f3) \ --limit 100 ```
Update health check configuration:
aws elbv2 modify-target-group \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/1234567890123456 \
--health-check-path /health \
--health-check-interval-seconds 30 \
--health-check-timeout-seconds 5 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 3Container health check (container-level):
Add a container health check in your task definition:
{
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
}Register a new task definition:
aws ecs register-task-definition \
--family my-task-def \
--container-definitions file://containers.json \
--cpu 256 \
--memory 512 \
--network-mode awsvpc \
--requires-compatibilities FARGATE \
--execution-role-arn arn:aws:iam::123456789012:role/ecsTaskExecutionRole \
--task-role-arn arn:aws:iam::123456789012:role/ecsTaskRoleOut of Memory Errors
Tasks killed by OOM often show exit code 137:
aws ecs describe-tasks \
--cluster my-cluster \
--tasks $(aws ecs list-tasks --cluster my-cluster --service-name my-service --desired-status STOPPED --query 'taskArns[]' --output text) \
--query 'tasks[*].[taskArn,containers[*].exitCode,containers[*].reason]'Check if memory is the issue:
aws cloudwatch get-metric-statistics \
--namespace ECS/ContainerInsights \
--metric-name MemoryUtilized \
--dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=my-service \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 \
--statistics Maximum \
--output tableIncrease task memory:
```bash aws ecs register-task-definition \ --family my-task-def \ --container-definitions file://containers.json \ --cpu 512 \ --memory 1024 \ --network-mode awsvpc \ --requires-compatibilities FARGATE
# Update service with new task definition aws ecs update-service \ --cluster my-cluster \ --service my-service \ --task-definition my-task-def:2 ```
Insufficient CPU
High CPU utilization can cause timeouts and crashes:
aws cloudwatch get-metric-statistics \
--namespace ECS/ContainerInsights \
--metric-name CpuUtilized \
--dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=my-service \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 \
--statistics Maximum \
--output tableIf CPU is consistently high, scale up:
aws ecs update-service \
--cluster my-cluster \
--service my-service \
--task-definition my-task-def:2Application Errors
Check container logs for application errors:
```bash # Get the log group from task definition LOG_GROUP=$(aws ecs describe-task-definition \ --task-definition my-task-def \ --query 'taskDefinition.containerDefinitions[0].logConfiguration.options["awslogs-group"]' \ --output text)
# Get recent logs aws logs tail $LOG_GROUP --since 1h --format short ```
Or via CloudWatch Logs Insights:
aws logs start-query \
--log-group-name /ecs/my-service \
--start-time $(date -u -d '1 hour ago' +%s)000 \
--end-time $(date -u +%s)000 \
--query-string 'fields @timestamp, @message | filter @message like /ERROR|Exception|Failed/ | sort @timestamp desc'Deployment Configuration Issues
Check deployment settings:
aws ecs describe-services \
--cluster my-cluster \
--services my-service \
--query 'services[*].deploymentConfiguration'For rolling deployments, ensure minimum healthy percent allows enough tasks:
aws ecs update-service \
--cluster my-cluster \
--service my-service \
--deployment-configuration minimumHealthyPercent=100,maximumPercent=200This ensures old tasks stay running until new ones are healthy.
For blue/green deployments via CodeDeploy:
aws deploy get-deployment-group \
--application-name my-app \
--deployment-group-name my-dg \
--query 'deploymentGroupInfo.[deploymentConfigName,serviceRoleArn]'Networking Issues
For FARGATE tasks with awsvpc networking, check the security groups:
aws ecs describe-services \
--cluster my-cluster \
--services my-service \
--query 'services[*].networkConfiguration.awsvpcConfiguration'Verify security group rules allow required traffic:
aws ec2 describe-security-groups \
--group-ids sg-12345678 \
--query 'SecurityGroups[*].IpPermissions[*].[FromPort,ToPort,IpProtocol,IpRanges[*].CidrIp]'Check if tasks can reach dependencies:
aws ecs execute-command \
--cluster my-cluster \
--task $(aws ecs list-tasks --cluster my-cluster --service-name my-service --query 'taskArns[0]' --output text) \
--container my-container \
--command "nslookup mydb.xxxxx.us-east-1.rds.amazonaws.com" \
--interactiveService Auto Scaling Issues
Auto scaling can cause instability if configured incorrectly:
```bash aws application-autoscaling describe-scalable-targets \ --service-namespace ecs \ --resource-ids service/my-cluster/my-service
aws application-autoscaling describe-scaling-policies \ --service-namespace ecs \ --resource-id service/my-cluster/my-service ```
Check if scaling activities are causing churn:
aws application-autoscaling describe-scaling-activities \
--service-namespace ecs \
--resource-id service/my-cluster/my-service \
--max-results 10Adjust scaling policies to be less aggressive:
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/my-cluster/my-service \
--policy-name my-scaling-policy \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration file://scaling-config.jsonWhere scaling-config.json contains:
{
"TargetValue": 70.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
},
"ScaleOutCooldown": 300,
"ScaleInCooldown": 300
}Circuit Breaker
Enable deployment circuit breaker to fail fast:
aws ecs update-service \
--cluster my-cluster \
--service my-service \
--enable-execute-command \
--deployment-configuration '{"deploymentCircuitBreaker": {"enable": true, "rollback": true}}'This automatically rolls back failed deployments.
Verification Steps
After making changes, verify service stability:
```bash # Watch service events aws ecs describe-services \ --cluster my-cluster \ --services my-service \ --query 'services[*].events[:5].[createdAt,message]' \ --output table
# Check running vs desired count aws ecs describe-services \ --cluster my-cluster \ --services my-service \ --query 'services[*].[serviceName,runningCount,desiredCount,status]'
# Monitor task health over time watch -n 30 'aws ecs list-tasks --cluster my-cluster --service-name my-service --query length\(taskArns\)'
# Verify target group health aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/1234567890123456 \ --query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]' ```
Create CloudWatch alarms for monitoring:
aws cloudwatch put-metric-alarm \
--alarm-name ecs-service-running-count \
--alarm-description "ECS service running count below desired" \
--namespace AWS/ECS \
--metric-name RunningTaskCount \
--dimensions Name=ServiceName,Value=my-service Name=ClusterName,Value=my-cluster \
--statistic Average \
--period 60 \
--threshold 1 \
--comparison-operator LessThanThreshold \
--evaluation-periods 3 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:alerts