Fix AWS ECS Service Unstable

# Fix AWS ECS Service Unstable

Your ECS service keeps stopping tasks, deployments fail repeatedly, or tasks are in a constant cycle of starting and stopping. The service status shows "ACTIVE" but the running count never reaches the desired count. Meanwhile, your application is experiencing intermittent outages.

ECS service instability is often a cascade of issues: health checks failing, resource constraints, networking problems, or application errors. Let's methodically track down the root cause.

Diagnosis Commands

First, check the service status:

bash

aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[*].[serviceName,status,runningCount,desiredCount,deployments[0].status]' \
  --output table

Get detailed deployment info:

bash

aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[*].deployments[*].[id,status,taskDefinition,rolloutState,rolloutStateReason]'

List the tasks and their status:

bash

aws ecs list-tasks \
  --cluster my-cluster \
  --service-name my-service \
  --query 'taskArns'

Check why tasks are stopping:

bash

aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks $(aws ecs list-tasks --cluster my-cluster --service-name my-service --query 'taskArns[]' --output text) \
  --query 'tasks[*].[taskArn,lastStatus,stoppedReason,stopCode,containers[*].lastStatus]' \
  --output table

Get task definition details:

bash

aws ecs describe-task-definition \
  --task-definition my-task-def:1 \
  --query 'taskDefinition.[family,revision,requiresCompatibilities,networkMode,cpu,memory]'

Check CloudWatch for task events:

bash

aws logs filter-log-events \
  --log-group-name /aws/ecs/my-cluster \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --query 'events[*].message'

Look for stopped task events:

bash

aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks $(aws ecs list-tasks --cluster my-cluster --service-name my-service --desired-status STOPPED --query 'taskArns[]' --output text) \
  --query 'tasks[*].[taskArn,stoppedReason,stopCode,containers[*].exitCode]'

Check service events:

bash

aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[*].events[:10].[createdAt,message]'

Common Causes and Solutions

Health Check Failures

The most common cause of instability is failing health checks. Check the target group health:

bash

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/1234567890123456 \
  --query 'TargetHealthDescriptions[*].[Target.Id,Target.Port,TargetHealth.State,TargetHealth.Reason]'

If targets are unhealthy, check the health check configuration:

bash

aws elbv2 describe-target-groups \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/1234567890123456 \
  --query 'TargetGroups[*].HealthCheckConfig'

Common health check issues:

Health check path returns wrong status:

Make sure your app returns 200 OK on the health check path:

```bash # Test from a running task aws ecs execute-command \ --cluster my-cluster \ --task $(aws ecs list-tasks --cluster my-cluster --service-name my-service --query 'taskArns[0]' --output text) \ --container my-container \ --command "curl -v http://localhost:8080/health" \ --interactive

# Or check container logs aws logs get-log-events \ --log-group-name /ecs/my-service \ --log-stream-name ecs/my-container/$(aws ecs list-tasks --cluster my-cluster --service-name my-service --query 'taskArns[0]' --output text | cut -d'/' -f3) \ --limit 100 ```

Update health check configuration:

bash

aws elbv2 modify-target-group \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/1234567890123456 \
  --health-check-path /health \
  --health-check-interval-seconds 30 \
  --health-check-timeout-seconds 5 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3

Container health check (container-level):

Add a container health check in your task definition:

json

{
  "healthCheck": {
    "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
    "interval": 30,
    "timeout": 5,
    "retries": 3,
    "startPeriod": 60
  }
}

bash

aws ecs register-task-definition \
  --family my-task-def \
  --container-definitions file://containers.json \
  --cpu 256 \
  --memory 512 \
  --network-mode awsvpc \
  --requires-compatibilities FARGATE \
  --execution-role-arn arn:aws:iam::123456789012:role/ecsTaskExecutionRole \
  --task-role-arn arn:aws:iam::123456789012:role/ecsTaskRole

Out of Memory Errors

Tasks killed by OOM often show exit code 137:

bash

aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks $(aws ecs list-tasks --cluster my-cluster --service-name my-service --desired-status STOPPED --query 'taskArns[]' --output text) \
  --query 'tasks[*].[taskArn,containers[*].exitCode,containers[*].reason]'

Check if memory is the issue:

bash

aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name MemoryUtilized \
  --dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=my-service \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Maximum \
  --output table

Increase task memory:

```bash aws ecs register-task-definition \ --family my-task-def \ --container-definitions file://containers.json \ --cpu 512 \ --memory 1024 \ --network-mode awsvpc \ --requires-compatibilities FARGATE

# Update service with new task definition aws ecs update-service \ --cluster my-cluster \ --service my-service \ --task-definition my-task-def:2 ```

Insufficient CPU

High CPU utilization can cause timeouts and crashes:

bash

aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name CpuUtilized \
  --dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=my-service \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Maximum \
  --output table

If CPU is consistently high, scale up:

bash

aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --task-definition my-task-def:2

Application Errors

Check container logs for application errors:

```bash # Get the log group from task definition LOG_GROUP=$(aws ecs describe-task-definition \ --task-definition my-task-def \ --query 'taskDefinition.containerDefinitions[0].logConfiguration.options["awslogs-group"]' \ --output text)

# Get recent logs aws logs tail $LOG_GROUP --since 1h --format short ```

Or via CloudWatch Logs Insights:

bash

aws logs start-query \
  --log-group-name /ecs/my-service \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --end-time $(date -u +%s)000 \
  --query-string 'fields @timestamp, @message | filter @message like /ERROR|Exception|Failed/ | sort @timestamp desc'

Deployment Configuration Issues

Check deployment settings:

bash

aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[*].deploymentConfiguration'

For rolling deployments, ensure minimum healthy percent allows enough tasks:

bash

aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --deployment-configuration minimumHealthyPercent=100,maximumPercent=200

This ensures old tasks stay running until new ones are healthy.

For blue/green deployments via CodeDeploy:

bash

aws deploy get-deployment-group \
  --application-name my-app \
  --deployment-group-name my-dg \
  --query 'deploymentGroupInfo.[deploymentConfigName,serviceRoleArn]'

Networking Issues

For FARGATE tasks with awsvpc networking, check the security groups:

bash

aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[*].networkConfiguration.awsvpcConfiguration'

Verify security group rules allow required traffic:

bash

aws ec2 describe-security-groups \
  --group-ids sg-12345678 \
  --query 'SecurityGroups[*].IpPermissions[*].[FromPort,ToPort,IpProtocol,IpRanges[*].CidrIp]'

Check if tasks can reach dependencies:

bash

aws ecs execute-command \
  --cluster my-cluster \
  --task $(aws ecs list-tasks --cluster my-cluster --service-name my-service --query 'taskArns[0]' --output text) \
  --container my-container \
  --command "nslookup mydb.xxxxx.us-east-1.rds.amazonaws.com" \
  --interactive

Service Auto Scaling Issues

Auto scaling can cause instability if configured incorrectly:

```bash aws application-autoscaling describe-scalable-targets \ --service-namespace ecs \ --resource-ids service/my-cluster/my-service

aws application-autoscaling describe-scaling-policies \ --service-namespace ecs \ --resource-id service/my-cluster/my-service ```

Check if scaling activities are causing churn:

bash

aws application-autoscaling describe-scaling-activities \
  --service-namespace ecs \
  --resource-id service/my-cluster/my-service \
  --max-results 10

Adjust scaling policies to be less aggressive:

bash

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/my-cluster/my-service \
  --policy-name my-scaling-policy \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration file://scaling-config.json

Where scaling-config.json contains:

json

{
  "TargetValue": 70.0,
  "PredefinedMetricSpecification": {
    "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
  },
  "ScaleOutCooldown": 300,
  "ScaleInCooldown": 300
}

Circuit Breaker

Enable deployment circuit breaker to fail fast:

bash

aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --enable-execute-command \
  --deployment-configuration '{"deploymentCircuitBreaker": {"enable": true, "rollback": true}}'

This automatically rolls back failed deployments.

Verification Steps

After making changes, verify service stability:

```bash # Watch service events aws ecs describe-services \ --cluster my-cluster \ --services my-service \ --query 'services[*].events[:5].[createdAt,message]' \ --output table

# Check running vs desired count aws ecs describe-services \ --cluster my-cluster \ --services my-service \ --query 'services[*].[serviceName,runningCount,desiredCount,status]'

# Monitor task health over time watch -n 30 'aws ecs list-tasks --cluster my-cluster --service-name my-service --query length$taskArns$'

# Verify target group health aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/1234567890123456 \ --query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]' ```

Create CloudWatch alarms for monitoring:

bash

aws cloudwatch put-metric-alarm \
  --alarm-name ecs-service-running-count \
  --alarm-description "ECS service running count below desired" \
  --namespace AWS/ECS \
  --metric-name RunningTaskCount \
  --dimensions Name=ServiceName,Value=my-service Name=ClusterName,Value=my-cluster \
  --statistic Average \
  --period 60 \
  --threshold 1 \
  --comparison-operator LessThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts

Diagnosis Commands

Common Causes and Solutions

Health Check Failures

Out of Memory Errors

Insufficient CPU

Application Errors

Deployment Configuration Issues

Networking Issues

Service Auto Scaling Issues

Circuit Breaker

Verification Steps

Share this guide

More AWS Troubleshooting Guides

AWS DynamoDB Contributor Insights Not Showing

AWS DynamoDB DAX Cache Miss

AWS DynamoDB Global Table Replication Lag

AWS Step Functions Workflow Stuck Waiting

AWS Step Functions Execution Throttled

AWS EventBridge Pipe Source Error