# Fix AWS EC2 Instance Terminated Unexpectedly
You log into the AWS console and notice your production EC2 instance is gone. No warning, no error message in your application logs—just a terminated state or nothing at all. This scenario is more common than you'd think, and the causes range from Spot instance interruptions to accidental human error.
Understanding What Happened
When an EC2 instance terminates, AWS doesn't just delete it silently. The termination event is recorded in multiple places, and understanding where to look is half the battle.
The first thing to check is whether the instance was a Spot instance. Spot instances can be interrupted with very short notice (sometimes just 2 minutes) when AWS needs the capacity back. If you weren't prepared for this, it can feel like a sudden, unexplained termination.
Diagnosis Commands
Start by checking the instance's termination reason through CloudTrail or the EC2 API. If you have the instance ID, you can still query information about terminated instances for a period of time:
aws ec2 describe-instances \
--instance-ids i-1234567890abcdef0 \
--query 'Reservations[*].Instances[*].[InstanceId,State.Name,StateTransitionReason,InstanceLifecycle]' \
--output tableIf the instance was a Spot instance, you'll see spot in the InstanceLifecycle field. To check Spot interruption notices:
aws ec2 describe-spot-instance-requests \
--spot-instance-request-ids sir-12345678 \
--query 'SpotInstanceRequests[*].[SpotInstanceRequestId,Status.Code,Status.Message]' \
--output tableFor a broader view of recent terminations in your account, query CloudTrail:
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=TerminateInstances \
--start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ) \
--query 'Events[*].[EventTime,Username,Resources[0].ResourceName]' \
--output tableCheck if Auto Scaling was involved by examining your Auto Scaling groups:
aws autoscaling describe-scaling-activities \
--auto-scaling-group-name my-asg \
--max-items 20 \
--query 'Activities[*].[ActivityId,Time,Description,Cause]' \
--output tableIf you have CloudWatch configured, check for system status checks that might have triggered a replacement:
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name StatusCheckFailed_System \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 \
--statistics Maximum \
--output tableCommon Causes and Solutions
Spot Instance Interruption
If your instance was a Spot instance, you need to architect for interruption resilience:
# Create a Spot instance with interruption handling
aws ec2 run-instances \
--image-id ami-0abcdef1234567890 \
--instance-type t3.micro \
--spot-options '{"SpotInstanceType": "persistent", "InstanceInterruptionBehavior": "stop"}' \
--user-data file://spot-handler.shThe user-data script should include logic to handle the 2-minute warning:
#!/bin/bash
# Save this as spot-handler.sh
while true; do
if [ -f /spot/interrupt ]; then
# Graceful shutdown logic here
systemctl stop my-application
sync
break
fi
sleep 5
done &A better approach is to use a Spot Fleet with capacity-optimized allocation:
aws ec2 request-spot-fleet \
--spot-fleet-request-config file://spot-fleet-config.jsonWhere spot-fleet-config.json includes:
{
"IamFleetRole": "arn:aws:iam::123456789012:role/aws-ec2-spot-fleet-tagging-role",
"AllocationStrategy": "capacity-optimized",
"TargetCapacity": 2,
"SpotPrice": "0.05",
"LaunchSpecifications": [
{
"ImageId": "ami-0abcdef1234567890",
"InstanceType": "t3.micro",
"KeyName": "my-key-pair"
}
]
}Accidental Termination
If CloudTrail shows a user terminated the instance, enable termination protection on your critical instances:
aws ec2 modify-instance-attribute \
--instance-id i-1234567890abcdef0 \
--disable-api-terminationYou can also set this at launch time:
aws ec2 run-instances \
--image-id ami-0abcdef1234567890 \
--instance-type t3.micro \
--disable-api-terminationAuto Scaling Replacement
Auto Scaling might terminate instances during scale-in events or health check failures. To understand why:
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-name my-asg \
--query 'AutoScalingGroups[*].[AutoScalingGroupName,MinSize,MaxSize,DesiredCapacity]'Check the termination policies:
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-name my-asg \
--query 'AutoScalingGroups[*].TerminationPolicies'If you need instances to stay running, consider using instance protection:
aws autoscaling set-instance-protection \
--instance-ids i-1234567890abcdef0 \
--auto-scaling-group-name my-asg \
--protected-from-scale-inSystem Status Check Failure
If the underlying hardware failed, AWS might terminate and replace the instance. This is more common with instance types that don't support automatic recovery. Enable detailed monitoring:
aws ec2 monitor-instances --instance-ids i-1234567890abcdef0Then create a CloudWatch alarm for automatic recovery:
aws cloudwatch put-metric-alarm \
--alarm-name "recover-instance-i-1234567890abcdef0" \
--alarm-description "Recover instance on system status check failure" \
--metric-name StatusCheckFailed_System \
--namespace AWS/EC2 \
--statistic Maximum \
--period 60 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--evaluation-periods 2 \
--alarm-actions arn:aws:automate:us-east-1:ec2:recoverPrevention Best Practices
Always use termination protection for production instances. This adds a confirmation step that prevents accidental termination:
aws ec2 modify-instance-attribute \
--instance-id i-1234567890abcdef0 \
--disable-api-terminationFor stateful workloads, implement regular backup strategies using EBS snapshots:
```bash # Create a snapshot aws ec2 create-snapshot \ --volume-id vol-1234567890abcdef0 \ --description "Daily backup $(date +%Y-%m-%d)"
# Or use AWS Backup for automated snapshots aws backup create-backup-plan --backup-plan file://backup-plan.json ```
Tag your instances so you can quickly identify their purpose:
aws ec2 create-tags \
--resources i-1234567890abcdef0 \
--tags Key=Environment,Value=production Key=Critical,Value=trueVerification Steps
After implementing your solution, verify the configuration is correct:
```bash # Check termination protection is enabled aws ec2 describe-instance-attribute \ --instance-id i-1234567890abcdef0 \ --attribute disableApiTermination \ --query 'DisableApiTermination.Value'
# Verify instance protection in Auto Scaling aws autoscaling describe-auto-scaling-instances \ --instance-ids i-1234567890abcdef0 \ --query 'AutoScalingInstances[*].ProtectedFromScaleIn'
# Check Spot request status if applicable aws ec2 describe-spot-instance-requests \ --filters Name=instance-id,Values=i-1234567890abcdef0 \ --query 'SpotInstanceRequests[*].Status.Code' ```
For comprehensive monitoring, set up a CloudWatch dashboard that tracks termination events and status checks across all your instances. This gives you visibility into patterns that might indicate systemic issues.