Introduction

AWS EKS cluster upgrade fails when prerequisites like addon compatibility or managed node groups are not met. This guide provides step-by-step diagnosis and resolution with AWS CLI commands.

Symptoms

Typical error output:

bash
AWS Error: operation failed
Check CloudWatch logs for details
aws service describe-<resource>

Common Causes

  1. 1.Container service issues are commonly caused by:
  2. 2.Task definition or pod specification errors
  3. 3.IAM role or policy misconfiguration
  4. 4.Network or service discovery issues
  5. 5.Capacity or scaling constraints

Step-by-Step Fix

Step 1: Check Current State

bash
aws eks describe-cluster --name my-cluster
kubectl get pods -A
aws ecs describe-services --cluster my-cluster --services my-service

Step 2: Identify Root Cause

Review the output for error messages and configuration issues.

Step 3: Apply Primary Fix

bash
# Primary fix: update configuration
aws service update-<resource> \
  --resource-identifier my-resource \
  --configuration '{"key": "value"}'

Step 4: Apply Alternative Fix

bash
# Alternative fix: check and update
aws service describe-<resource> --resource-id xxx
aws service update-<resource> --resource-id xxx --param value

Step 5: Verify the Fix

bash
aws service describe-<resource> --query "Status"

Common Pitfalls

  • Service quotas and limits
  • IAM policy misconfiguration
  • Missing required parameters
  • Region-specific service availability

Best Practices

  • Follow AWS Well-Architected Framework
  • Implement proper tagging strategy
  • Use Infrastructure as Code
  • Monitor and alert on key metrics
  • AWS Service Quota Exceeded
  • AWS IAM Permission Denied
  • AWS Resource Not Found
  • AWS Configuration Conflict