Fix AWS EC2 Instance Terminated Unexpectedly

# Fix AWS EC2 Instance Terminated Unexpectedly

You log into the AWS console and notice your production EC2 instance is gone. No warning, no error message in your application logs—just a terminated state or nothing at all. This scenario is more common than you'd think, and the causes range from Spot instance interruptions to accidental human error.

Understanding What Happened

When an EC2 instance terminates, AWS doesn't just delete it silently. The termination event is recorded in multiple places, and understanding where to look is half the battle.

The first thing to check is whether the instance was a Spot instance. Spot instances can be interrupted with very short notice (sometimes just 2 minutes) when AWS needs the capacity back. If you weren't prepared for this, it can feel like a sudden, unexplained termination.

Diagnosis Commands

Start by checking the instance's termination reason through CloudTrail or the EC2 API. If you have the instance ID, you can still query information about terminated instances for a period of time:

bash

aws ec2 describe-instances \
  --instance-ids i-1234567890abcdef0 \
  --query 'Reservations[*].Instances[*].[InstanceId,State.Name,StateTransitionReason,InstanceLifecycle]' \
  --output table

If the instance was a Spot instance, you'll see spot in the InstanceLifecycle field. To check Spot interruption notices:

bash

aws ec2 describe-spot-instance-requests \
  --spot-instance-request-ids sir-12345678 \
  --query 'SpotInstanceRequests[*].[SpotInstanceRequestId,Status.Code,Status.Message]' \
  --output table

For a broader view of recent terminations in your account, query CloudTrail:

bash

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=TerminateInstances \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ) \
  --query 'Events[*].[EventTime,Username,Resources[0].ResourceName]' \
  --output table

Check if Auto Scaling was involved by examining your Auto Scaling groups:

bash

aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name my-asg \
  --max-items 20 \
  --query 'Activities[*].[ActivityId,Time,Description,Cause]' \
  --output table

If you have CloudWatch configured, check for system status checks that might have triggered a replacement:

bash

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name StatusCheckFailed_System \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 300 \
  --statistics Maximum \
  --output table

Common Causes and Solutions

Spot Instance Interruption

If your instance was a Spot instance, you need to architect for interruption resilience:

bash

# Create a Spot instance with interruption handling
aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type t3.micro \
  --spot-options '{"SpotInstanceType": "persistent", "InstanceInterruptionBehavior": "stop"}' \
  --user-data file://spot-handler.sh

The user-data script should include logic to handle the 2-minute warning:

bash

#!/bin/bash
# Save this as spot-handler.sh
while true; do
  if [ -f /spot/interrupt ]; then
    # Graceful shutdown logic here
    systemctl stop my-application
    sync
    break
  fi
  sleep 5
done &

A better approach is to use a Spot Fleet with capacity-optimized allocation:

bash

aws ec2 request-spot-fleet \
  --spot-fleet-request-config file://spot-fleet-config.json

Where spot-fleet-config.json includes:

json

{
  "IamFleetRole": "arn:aws:iam::123456789012:role/aws-ec2-spot-fleet-tagging-role",
  "AllocationStrategy": "capacity-optimized",
  "TargetCapacity": 2,
  "SpotPrice": "0.05",
  "LaunchSpecifications": [
    {
      "ImageId": "ami-0abcdef1234567890",
      "InstanceType": "t3.micro",
      "KeyName": "my-key-pair"
    }
  ]
}

Accidental Termination

If CloudTrail shows a user terminated the instance, enable termination protection on your critical instances:

bash

aws ec2 modify-instance-attribute \
  --instance-id i-1234567890abcdef0 \
  --disable-api-termination

You can also set this at launch time:

bash

aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type t3.micro \
  --disable-api-termination

Auto Scaling Replacement

Auto Scaling might terminate instances during scale-in events or health check failures. To understand why:

bash

aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-name my-asg \
  --query 'AutoScalingGroups[*].[AutoScalingGroupName,MinSize,MaxSize,DesiredCapacity]'

Check the termination policies:

bash

aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-name my-asg \
  --query 'AutoScalingGroups[*].TerminationPolicies'

If you need instances to stay running, consider using instance protection:

bash

aws autoscaling set-instance-protection \
  --instance-ids i-1234567890abcdef0 \
  --auto-scaling-group-name my-asg \
  --protected-from-scale-in

System Status Check Failure

If the underlying hardware failed, AWS might terminate and replace the instance. This is more common with instance types that don't support automatic recovery. Enable detailed monitoring:

bash

aws ec2 monitor-instances --instance-ids i-1234567890abcdef0

Then create a CloudWatch alarm for automatic recovery:

bash

aws cloudwatch put-metric-alarm \
  --alarm-name "recover-instance-i-1234567890abcdef0" \
  --alarm-description "Recover instance on system status check failure" \
  --metric-name StatusCheckFailed_System \
  --namespace AWS/EC2 \
  --statistic Maximum \
  --period 60 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:automate:us-east-1:ec2:recover

Prevention Best Practices

Always use termination protection for production instances. This adds a confirmation step that prevents accidental termination:

bash

aws ec2 modify-instance-attribute \
  --instance-id i-1234567890abcdef0 \
  --disable-api-termination

For stateful workloads, implement regular backup strategies using EBS snapshots:

```bash # Create a snapshot aws ec2 create-snapshot \ --volume-id vol-1234567890abcdef0 \ --description "Daily backup $(date +%Y-%m-%d)"

# Or use AWS Backup for automated snapshots aws backup create-backup-plan --backup-plan file://backup-plan.json ```

Tag your instances so you can quickly identify their purpose:

bash

aws ec2 create-tags \
  --resources i-1234567890abcdef0 \
  --tags Key=Environment,Value=production Key=Critical,Value=true

Verification Steps

After implementing your solution, verify the configuration is correct:

```bash # Check termination protection is enabled aws ec2 describe-instance-attribute \ --instance-id i-1234567890abcdef0 \ --attribute disableApiTermination \ --query 'DisableApiTermination.Value'

# Verify instance protection in Auto Scaling aws autoscaling describe-auto-scaling-instances \ --instance-ids i-1234567890abcdef0 \ --query 'AutoScalingInstances[*].ProtectedFromScaleIn'

# Check Spot request status if applicable aws ec2 describe-spot-instance-requests \ --filters Name=instance-id,Values=i-1234567890abcdef0 \ --query 'SpotInstanceRequests[*].Status.Code' ```

For comprehensive monitoring, set up a CloudWatch dashboard that tracks termination events and status checks across all your instances. This gives you visibility into patterns that might indicate systemic issues.

Understanding What Happened

Diagnosis Commands

Common Causes and Solutions

Spot Instance Interruption

Accidental Termination

Auto Scaling Replacement

System Status Check Failure

Prevention Best Practices

Verification Steps

Share this guide

More AWS Troubleshooting Guides

AWS DynamoDB Contributor Insights Not Showing

AWS DynamoDB DAX Cache Miss

AWS DynamoDB Global Table Replication Lag

AWS Step Functions Workflow Stuck Waiting

AWS Step Functions Execution Throttled

AWS EventBridge Pipe Source Error