# AWS ECS Task Stopped

Common Error Patterns

ECS task failures typically show:

bash
Task stopped at: 2024-01-15T10:30:00Z. Reason: Essential container in task exited
bash
CannotPullContainerError: inspect image has been retried 5 time(s)
bash
ResourceInitializationError: failed to validate logger args
bash
Task failed ELB health checks in (target-group arn:aws:elasticloadbalancing:...)

Root Causes and Solutions

1. Container Exit Code Analysis

Check the container exit code to understand failure:

bash
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks arn:aws:ecs:us-east-1:123456789012:task/my-task \
  --query 'tasks[0].containers[*].[name,exitCode,reason]'
Exit CodeMeaningCommon Cause
0SuccessApplication completed intentionally
1General errorApplication error
137SIGKILLOut of memory or manual kill
139Segmentation faultApplication bug
255Exit status out of rangeApplication error

Solution for Exit Code 137:

Increase task memory:

bash
aws ecs register-task-definition \
  --family my-task \
  --memory 1024 \
  --container-definitions '[
    {
      "name": "my-container",
      "image": "my-image",
      "memory": 512,
      "memoryReservation": 256
    }
  ]'

2. Image Pull Failures

ECS cannot pull the container image.

Solution:

Check image exists and is accessible:

```bash # Verify image exists docker pull my-registry/my-image:latest

# For ECR, ensure permissions aws ecr describe-images \ --repository-name my-repo \ --image-ids imageTag=latest ```

Ensure task execution role has ECR permissions:

json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": "logs:CreateLogStream",
      "Resource": "*"
    }
  ]
}

3. Health Check Failures

Container fails health checks and is stopped.

Solution:

Review health check configuration:

bash
aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[0].deployments[0].healthCheckGracePeriodSeconds'

Increase health check grace period:

bash
aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --health-check-grace-period-seconds 300

Verify container health check:

json
{
  "healthCheck": {
    "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
    "interval": 30,
    "timeout": 5,
    "retries": 3,
    "startPeriod": 60
  }
}

4. Resource Constraints

Insufficient CPU or memory for the task.

Solution:

Check service events:

bash
aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[0].events[:5]'

Increase task resources:

bash
aws ecs register-task-definition \
  --family my-task \
  --cpu 512 \
  --memory 1024 \
  --requires-compatibilities FARGATE

5. Environment Variable Issues

Missing or incorrect environment variables.

Solution:

Check task definition environment:

bash
aws ecs describe-task-definition \
  --task-definition my-task \
  --query 'taskDefinition.containerDefinitions[0].environment'

Use Secrets Manager or Parameter Store for sensitive values:

json
{
  "secrets": [
    {
      "name": "DATABASE_PASSWORD",
      "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:my-secret"
    }
  ]
}

6. Network Configuration Issues

Task cannot communicate with required services.

Solution:

Verify VPC configuration:

bash
aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[0].networkConfiguration'

Check security groups allow required traffic:

bash
aws ec2 describe-security-groups \
  --group-ids sg-0123456789abcdef0

Ensure tasks can reach: - Container registry (ECR, Docker Hub) - External services (APIs, databases) - Internal services (ALB, other services)

7. Task Definition Issues

Invalid or outdated task definition.

Solution:

Validate task definition:

bash
aws ecs describe-task-definition \
  --task-definition my-task:1 \
  --query 'taskDefinition'

Common issues: - Invalid log configuration - Missing essential flag on container - Invalid port mappings - Incorrect CPU/memory ratio

Debugging Commands

```bash # Get task details aws ecs describe-tasks --cluster my-cluster --tasks my-task-id

# View task logs aws logs get-log-events \ --log-group-name /ecs/my-task \ --log-stream-name ecs/my-container/my-task-id

# List stopped tasks aws ecs list-tasks \ --cluster my-cluster \ --desired-status STOPPED

# Get task definition aws ecs describe-task-definition --task-definition my-task

# Check service events aws ecs describe-services --cluster my-cluster --services my-service ```

Fargate-Specific Issues

Platform Version Issues

bash
aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --platform-version LATEST

Subnet Configuration

Ensure tasks have: - Public subnet with NAT Gateway (for public images) - Private subnet with VPC endpoints (for ECR) - Security groups allowing required traffic

ECS Exec Debugging

Enable ECS Exec for interactive debugging:

```bash # Enable ECS Exec aws ecs update-service \ --cluster my-cluster \ --service my-service \ --enable-execute-command

# Connect to container aws ecs execute-command \ --cluster my-cluster \ --task my-task-id \ --container my-container \ --command "/bin/bash" \ --interactive ```

Prevention Tips

  1. 1.Set up CloudWatch alarms for task failures
  2. 2.Use health checks with appropriate grace periods
  3. 3.Configure proper resource limits
  4. 4.Implement circuit breakers in code
  5. 5.Use X-Ray for distributed tracing

Quick Reference

IssueCommand
View task statusaws ecs describe-tasks
Check exit codes--query 'tasks[0].containers[*].exitCode'
View service eventsaws ecs describe-services
Check logsaws logs get-log-events
List stopped tasksaws ecs list-tasks --desired-status STOPPED
  • [AWS Lambda Timeout](#)
  • [AWS EC2 Instance Not Reachable](#)
  • [Docker Build Failed in CI](#)