# Fix AWS ECS Service Unstable

Your ECS service keeps stopping tasks, deployments fail repeatedly, or tasks are in a constant cycle of starting and stopping. The service status shows "ACTIVE" but the running count never reaches the desired count. Meanwhile, your application is experiencing intermittent outages.

ECS service instability is often a cascade of issues: health checks failing, resource constraints, networking problems, or application errors. Let's methodically track down the root cause.

Diagnosis Commands

First, check the service status:

bash
aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[*].[serviceName,status,runningCount,desiredCount,deployments[0].status]' \
  --output table

Get detailed deployment info:

bash
aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[*].deployments[*].[id,status,taskDefinition,rolloutState,rolloutStateReason]'

List the tasks and their status:

bash
aws ecs list-tasks \
  --cluster my-cluster \
  --service-name my-service \
  --query 'taskArns'

Check why tasks are stopping:

bash
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks $(aws ecs list-tasks --cluster my-cluster --service-name my-service --query 'taskArns[]' --output text) \
  --query 'tasks[*].[taskArn,lastStatus,stoppedReason,stopCode,containers[*].lastStatus]' \
  --output table

Get task definition details:

bash
aws ecs describe-task-definition \
  --task-definition my-task-def:1 \
  --query 'taskDefinition.[family,revision,requiresCompatibilities,networkMode,cpu,memory]'

Check CloudWatch for task events:

bash
aws logs filter-log-events \
  --log-group-name /aws/ecs/my-cluster \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --query 'events[*].message'

Look for stopped task events:

bash
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks $(aws ecs list-tasks --cluster my-cluster --service-name my-service --desired-status STOPPED --query 'taskArns[]' --output text) \
  --query 'tasks[*].[taskArn,stoppedReason,stopCode,containers[*].exitCode]'

Check service events:

bash
aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[*].events[:10].[createdAt,message]'

Common Causes and Solutions

Health Check Failures

The most common cause of instability is failing health checks. Check the target group health:

bash
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/1234567890123456 \
  --query 'TargetHealthDescriptions[*].[Target.Id,Target.Port,TargetHealth.State,TargetHealth.Reason]'

If targets are unhealthy, check the health check configuration:

bash
aws elbv2 describe-target-groups \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/1234567890123456 \
  --query 'TargetGroups[*].HealthCheckConfig'

Common health check issues:

Health check path returns wrong status:

Make sure your app returns 200 OK on the health check path:

```bash # Test from a running task aws ecs execute-command \ --cluster my-cluster \ --task $(aws ecs list-tasks --cluster my-cluster --service-name my-service --query 'taskArns[0]' --output text) \ --container my-container \ --command "curl -v http://localhost:8080/health" \ --interactive

# Or check container logs aws logs get-log-events \ --log-group-name /ecs/my-service \ --log-stream-name ecs/my-container/$(aws ecs list-tasks --cluster my-cluster --service-name my-service --query 'taskArns[0]' --output text | cut -d'/' -f3) \ --limit 100 ```

Update health check configuration:

bash
aws elbv2 modify-target-group \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/1234567890123456 \
  --health-check-path /health \
  --health-check-interval-seconds 30 \
  --health-check-timeout-seconds 5 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3

Container health check (container-level):

Add a container health check in your task definition:

json
{
  "healthCheck": {
    "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
    "interval": 30,
    "timeout": 5,
    "retries": 3,
    "startPeriod": 60
  }
}

Register a new task definition:

bash
aws ecs register-task-definition \
  --family my-task-def \
  --container-definitions file://containers.json \
  --cpu 256 \
  --memory 512 \
  --network-mode awsvpc \
  --requires-compatibilities FARGATE \
  --execution-role-arn arn:aws:iam::123456789012:role/ecsTaskExecutionRole \
  --task-role-arn arn:aws:iam::123456789012:role/ecsTaskRole

Out of Memory Errors

Tasks killed by OOM often show exit code 137:

bash
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks $(aws ecs list-tasks --cluster my-cluster --service-name my-service --desired-status STOPPED --query 'taskArns[]' --output text) \
  --query 'tasks[*].[taskArn,containers[*].exitCode,containers[*].reason]'

Check if memory is the issue:

bash
aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name MemoryUtilized \
  --dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=my-service \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Maximum \
  --output table

Increase task memory:

```bash aws ecs register-task-definition \ --family my-task-def \ --container-definitions file://containers.json \ --cpu 512 \ --memory 1024 \ --network-mode awsvpc \ --requires-compatibilities FARGATE

# Update service with new task definition aws ecs update-service \ --cluster my-cluster \ --service my-service \ --task-definition my-task-def:2 ```

Insufficient CPU

High CPU utilization can cause timeouts and crashes:

bash
aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name CpuUtilized \
  --dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=my-service \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Maximum \
  --output table

If CPU is consistently high, scale up:

bash
aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --task-definition my-task-def:2

Application Errors

Check container logs for application errors:

```bash # Get the log group from task definition LOG_GROUP=$(aws ecs describe-task-definition \ --task-definition my-task-def \ --query 'taskDefinition.containerDefinitions[0].logConfiguration.options["awslogs-group"]' \ --output text)

# Get recent logs aws logs tail $LOG_GROUP --since 1h --format short ```

Or via CloudWatch Logs Insights:

bash
aws logs start-query \
  --log-group-name /ecs/my-service \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --end-time $(date -u +%s)000 \
  --query-string 'fields @timestamp, @message | filter @message like /ERROR|Exception|Failed/ | sort @timestamp desc'

Deployment Configuration Issues

Check deployment settings:

bash
aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[*].deploymentConfiguration'

For rolling deployments, ensure minimum healthy percent allows enough tasks:

bash
aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --deployment-configuration minimumHealthyPercent=100,maximumPercent=200

This ensures old tasks stay running until new ones are healthy.

For blue/green deployments via CodeDeploy:

bash
aws deploy get-deployment-group \
  --application-name my-app \
  --deployment-group-name my-dg \
  --query 'deploymentGroupInfo.[deploymentConfigName,serviceRoleArn]'

Networking Issues

For FARGATE tasks with awsvpc networking, check the security groups:

bash
aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[*].networkConfiguration.awsvpcConfiguration'

Verify security group rules allow required traffic:

bash
aws ec2 describe-security-groups \
  --group-ids sg-12345678 \
  --query 'SecurityGroups[*].IpPermissions[*].[FromPort,ToPort,IpProtocol,IpRanges[*].CidrIp]'

Check if tasks can reach dependencies:

bash
aws ecs execute-command \
  --cluster my-cluster \
  --task $(aws ecs list-tasks --cluster my-cluster --service-name my-service --query 'taskArns[0]' --output text) \
  --container my-container \
  --command "nslookup mydb.xxxxx.us-east-1.rds.amazonaws.com" \
  --interactive

Service Auto Scaling Issues

Auto scaling can cause instability if configured incorrectly:

```bash aws application-autoscaling describe-scalable-targets \ --service-namespace ecs \ --resource-ids service/my-cluster/my-service

aws application-autoscaling describe-scaling-policies \ --service-namespace ecs \ --resource-id service/my-cluster/my-service ```

Check if scaling activities are causing churn:

bash
aws application-autoscaling describe-scaling-activities \
  --service-namespace ecs \
  --resource-id service/my-cluster/my-service \
  --max-results 10

Adjust scaling policies to be less aggressive:

bash
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/my-cluster/my-service \
  --policy-name my-scaling-policy \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration file://scaling-config.json

Where scaling-config.json contains:

json
{
  "TargetValue": 70.0,
  "PredefinedMetricSpecification": {
    "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
  },
  "ScaleOutCooldown": 300,
  "ScaleInCooldown": 300
}

Circuit Breaker

Enable deployment circuit breaker to fail fast:

bash
aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --enable-execute-command \
  --deployment-configuration '{"deploymentCircuitBreaker": {"enable": true, "rollback": true}}'

This automatically rolls back failed deployments.

Verification Steps

After making changes, verify service stability:

```bash # Watch service events aws ecs describe-services \ --cluster my-cluster \ --services my-service \ --query 'services[*].events[:5].[createdAt,message]' \ --output table

# Check running vs desired count aws ecs describe-services \ --cluster my-cluster \ --services my-service \ --query 'services[*].[serviceName,runningCount,desiredCount,status]'

# Monitor task health over time watch -n 30 'aws ecs list-tasks --cluster my-cluster --service-name my-service --query length\(taskArns\)'

# Verify target group health aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/1234567890123456 \ --query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]' ```

Create CloudWatch alarms for monitoring:

bash
aws cloudwatch put-metric-alarm \
  --alarm-name ecs-service-running-count \
  --alarm-description "ECS service running count below desired" \
  --namespace AWS/ECS \
  --metric-name RunningTaskCount \
  --dimensions Name=ServiceName,Value=my-service Name=ClusterName,Value=my-cluster \
  --statistic Average \
  --period 60 \
  --threshold 1 \
  --comparison-operator LessThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts