Introduction

When an ECR lifecycle policy deletes images that are still referenced by ECS task definitions, Kubernetes deployments, or Docker Compose files, subsequent image pulls fail with ImagePullBackOff or ManifestUnknown errors. This commonly happens when lifecycle rules are too aggressive with the "untagged" or "image count more than N" settings.

Symptoms

  • ECS task fails with: CannotPullContainerError: failed to resolve reference "account.dkr.ecr.region.amazonaws.com/repo:tag": manifest unknown
  • Kubernetes pod shows ErrImagePull with manifest unknown or image not found
  • ECR describe-images returns no results for the expected tag
  • CloudTrail shows BatchDeleteImage events around the time of the failure

Common Causes

  • Lifecycle rule with "image count more than 10" deletes tagged images when tag is not excluded
  • "Untagged" image rule removes images that lost their tag during redeployment
  • Multiple lifecycle rules with conflicting priorities
  • CI/CD pipeline reuses tags (e.g., "latest") causing older images to become untagged
  • No "tagged" image exclusion on count-based rules

Step-by-Step Fix

  1. 1.Identify deleted images in CloudTrail:
  2. 2.```bash
  3. 3.aws cloudtrail lookup-events \
  4. 4.--lookup-attributes AttributeKey=EventName,AttributeValue=BatchDeleteImage \
  5. 5.--start-time $(date -d '24 hours ago' +%s) \
  6. 6.--query 'Events[*].{Time:EventTime,User:Username}'
  7. 7.`
  8. 8.Check current lifecycle policy:
  9. 9.```bash
  10. 10.aws ecr get-lifecycle-policy --repository-name my-repo
  11. 11.`
  12. 12.Look for rules with "selection": {"tagStatus": "untagged"} or "countType": "imageCountMoreThan" without tag protection.
  13. 13.Update lifecycle policy to protect tagged images:
  14. 14.```bash
  15. 15.aws ecr put-lifecycle-policy \
  16. 16.--repository-name my-repo \
  17. 17.--lifecycle-policy-text '{
  18. 18."rules": [
  19. 19.{
  20. 20."rulePriority": 1,
  21. 21."description": "Keep last 20 tagged images",
  22. 22."selection": {
  23. 23."tagStatus": "tagged",
  24. 24."tagPrefixList": ["prod", "staging"],
  25. 25."countType": "imageCountMoreThan",
  26. 26."countNumber": 20
  27. 27.},
  28. 28."action": {"type": "expire"}
  29. 29.},
  30. 30.{
  31. 31."rulePriority": 2,
  32. 32."description": "Delete untagged images older than 7 days",
  33. 33."selection": {
  34. 34."tagStatus": "untagged",
  35. 35."countType": "sinceImagePushed",
  36. 36."countUnit": "days",
  37. 37."countNumber": 7
  38. 38.},
  39. 39."action": {"type": "expire"}
  40. 40.}
  41. 41.]
  42. 42.}'
  43. 43.`
  44. 44.Rebuild and push the missing image:
  45. 45.```bash
  46. 46.docker build -t account.dkr.ecr.region.amazonaws.com/repo:tag .
  47. 47.aws ecr get-login-password | docker login --username AWS --password-stdin account.dkr.ecr.region.amazonaws.com
  48. 48.docker push account.dkr.ecr.region.amazonaws.com/repo:tag
  49. 49.`
  50. 50.Restart affected services:
  51. 51.```bash
  52. 52.aws ecs update-service --cluster my-cluster --service my-service --force-new-deployment
  53. 53.`

Prevention

  • Always use immutable tags (include commit SHA or build number)
  • Add tagPrefixList to protect production and staging image tags
  • Set minimum image count thresholds above your rollback window needs
  • Enable ECR repository scanning for lifecycle policy changes
  • Use imageDigest in task definitions instead of mutable tags