What's Actually Happening
Your Terraform state file is damaged, unreadable, or contains inconsistent data. When you run any Terraform command, you receive JSON parsing errors, checksum failures, or the state doesn't accurately reflect your actual infrastructure. This is a critical situation requiring careful recovery.
The Error You'll See
``` Error: Failed to load state: state data is not valid JSON
Error reading state: invalid character 'e' looking for beginning of value
Error: state snapshot was created by Terraform v1.5.0, which is newer than current v1.2.0
Error: Checksum verification failed for state file
Error: Invalid state file: missing required field 'version'
Error: Unexpected EOF in state file
Error: State contains resources with duplicate IDs ```
Why This Happens
State corruption typically results from:
- 1.Incomplete writes - Terraform crashed or was killed mid-operation while writing state
- 2.Concurrent modifications - Multiple Terraform processes modified state simultaneously without proper locking
- 3.Storage failures - S3, network, or disk issues during state persistence
- 4.Manual edits - Someone modified the state file directly with incorrect JSON
- 5.Version incompatibility - Using newer Terraform version on state, then reverting
- 6.Git merge conflicts - State file conflicts improperly resolved
- 7.Backend misconfiguration - Switching backends without proper state migration
- 8.Disk full - Storage ran out of space during state write
Step 1: Stop All Operations Immediately
Do not make the situation worse:
```bash # Kill any running Terraform processes pkill -f terraform
# Check for background processes ps aux | grep terraform
# Release any stale locks terraform force-unlock -force LOCK_ID # If applicable
# For S3 + DynamoDB: aws dynamodb scan --table-name terraform-locks ```
**Critical: Do NOT run terraform apply or terraform destroy until state is recovered.**
Step 2: Create Backup of Corrupted State
Before any recovery attempt, preserve the corrupted state:
```bash # Backup corrupted state locally terraform state pull > corrupted-state-backup-$(date +%Y%m%d-%H%M%S).tfstate
# For local state files cp terraform.tfstate terraform.tfstate.corrupted-backup-$(date +%Y%m%d-%H%M%S)
# Backup the automatic backup if it exists cp terraform.tfstate.backup terraform.tfstate.backup-$(date +%Y%m%d-%H%M%S) 2>/dev/null || true
# Store these backups in a safe location mkdir -p ~/terraform-state-backups/ mv *.tfstate* ~/terraform-state-backups/ ```
Step 3: Diagnose the Corruption Type
Determine what kind of corruption occurred:
```bash # Check if file is valid JSON jq . terraform.tfstate 2>&1 | head -20
# Python JSON validation with error location python3 -c "import json; json.load(open('terraform.tfstate'))" 2>&1
# Check file structure head -100 terraform.tfstate tail -100 terraform.tfstate wc -l terraform.tfstate
# Check state version terraform state pull 2>&1 | grep -i "version" ```
Specific diagnosis patterns:
```bash # For truncated files - check if JSON ends properly tail -c 20 terraform.tfstate
# For encoding issues file terraform.tfstate
# For zero-byte or empty files ls -la terraform.tfstate ```
Step 4: Attempt Automatic Recovery
Terraform has some built-in recovery mechanisms:
```bash # Try pulling state - sometimes backend has valid copy terraform state pull > recovered-state.tfstate
# Check if recovered state is valid jq . recovered-state.tfstate
# If valid, push it back terraform state push recovered-state.tfstate ```
Use the automatic backup:
```bash # Terraform creates .tfstate.backup before writes # Check if backup exists and is valid ls -la terraform.tfstate.backup jq . terraform.tfstate.backup
# If backup is valid cp terraform.tfstate.backup terraform.tfstate terraform plan -refresh-only ```
Step 5: Restore from Backend Version History
If using S3 with versioning enabled:
```bash # List available versions aws s3api list-object-versions \ --bucket my-terraform-state \ --prefix prod/terraform.tfstate \ --query 'Versions[].{Date:LastModified,VersionId:VersionId}' \ --output table
# Download a specific version aws s3api get-object \ --bucket my-terraform-state \ --key prod/terraform.tfstate \ --version-id YOUR_VERSION_ID \ recovered-state.tfstate
# Verify recovered state jq . recovered-state.tfstate
# Restore to backend aws s3 cp recovered-state.tfstate s3://my-terraform-state/prod/terraform.tfstate ```
For Git-tracked state (though this is not recommended):
git log --oneline terraform.tfstate
git show HEAD~5:terraform.tfstate > recovered-state.tfstateStep 6: Manual JSON Repair
For minor JSON corruption, attempt manual repair:
```python # Save this as repair-state.py import json import sys
def repair_json(content): # Attempt to find truncation point lines = content.strip().split('\n')
# Find last complete line for i in range(len(lines) - 1, -1, -1): line = lines[i].strip() if not line: continue # Check if line ends properly if line.endswith(',') or line.endswith('}') or line.endswith(']'): break
# Truncate to valid portion content = '\n'.join(lines[:i+1])
# Count open brackets/arrays open_braces = content.count('{') - content.count('}') open_arrays = content.count '[' - content.count(']')
# Add closing brackets while open_arrays > 0: content += ']' open_arrays -= 1 while open_braces > 0: content += '}' open_braces -= 1
return content
with open('terraform.tfstate', 'r') as f: content = f.read()
repaired = repair_json(content)
try: data = json.loads(repaired) with open('repaired-state.tfstate', 'w') as f: json.dump(data, f, indent=2) print("Repair successful") except json.JSONDecodeError as e: print(f"Repair failed: {e}") sys.exit(1) ```
Run the repair:
python3 repair-state.py
terraform state push repaired-state.tfstateStep 7: Rebuild State from Existing Infrastructure
When no backup is available and repair fails:
```bash # First, inventory your actual infrastructure # AWS example aws ec2 describe-instances --query 'Reservations[].Instances[].InstanceId' aws rds describe-db-instances --query 'DBInstances[].DBInstanceIdentifier' aws s3 ls aws vpc describe-vpcs --query 'Vpcs[].VpcId'
# Create an empty state rm terraform.tfstate terraform init
# Import each resource terraform import aws_vpc.main vpc-12345678 terraform import aws_subnet.public[0] subnet-11111111 terraform import aws_instance.web i-0123456789abcdef0 terraform import aws_db_instance.production my-db-instance
# For Terraform 1.5+, use import blocks import { to = aws_vpc.main id = "vpc-12345678" } ```
Generate configuration from imports (Terraform 1.5+):
terraform plan -generate-config-out=imported-resources.tfStep 8: Fix Duplicate Resource IDs
If state has duplicate entries:
```bash # List all resources terraform state list
# Find duplicates terraform state list | sort | uniq -d
# Remove the duplicate terraform state rm aws_instance.web_duplicate
# Keep the correct one terraform state show aws_instance.web ```
Step 9: Verify State Integrity After Recovery
Comprehensive validation:
```bash # Check state is valid JSON terraform state pull | jq .
# Count resources terraform state list | wc -l terraform state pull | jq '.resources | length'
# Check for orphaned or missing resources terraform plan -refresh-only
# Compare with actual infrastructure terraform state pull | jq '.resources[].attributes.id' ```
Step 10: Enable State Protection
Prevent future corruption:
```bash # Enable S3 bucket versioning aws s3api put-bucket-versioning \ --bucket my-terraform-state \ --versioning-configuration Status=Enabled
# Create backup lifecycle policy aws s3api put-bucket-lifecycle-configuration \ --bucket my-terraform-state \ --lifecycle-configuration '{ "Rules": [{ "ID": "StateBackupPolicy", "Status": "Enabled", "NoncurrentVersionExpiration": {"NoncurrentDays": 90}, "NoncurrentVersionTransition": { "NoncurrentDays": 30, "StorageClass": "GLACIER" } }] }' ```
Configure proper backend:
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}Verify Complete Recovery
Final validation steps:
```bash # Verify state works terraform state list terraform plan -refresh-only
# Should output: No changes. Infrastructure matches configuration.
# Test a minor operation terraform plan ```
Document the incident:
echo "$(date): State corruption recovery - $(whoami)" >> ~/terraform-state-backups/recovery-log.txt
echo "Cause: [documented cause]" >> ~/terraform-state-backups/recovery-log.txt
echo "Resolution: [steps taken]" >> ~/terraform-state-backups/recovery-log.txt