What's Actually Happening

Your Terraform state file is damaged, unreadable, or contains inconsistent data. When you run any Terraform command, you get JSON parsing errors, checksum failures, or the state doesn't match your actual infrastructure.

The Error You'll See

``` Error: Failed to load state: state data is not valid JSON

Error: state snapshot was created by Terraform v1.5.0, which is newer than current v1.2.0; upgrade to Terraform v1.5.0 or greater to work with this state

Error: Checksum verification failed for state file

Error: failed to persist state to backend: the state is out of date and must be refreshed

Error: Invalid state file: missing required field 'version'

Error: Unexpected EOF in state file

Error: State contains resources with duplicate IDs ```

Why This Happens

State corruption typically occurs from:

  1. 1.Incomplete writes - Terraform crashed while writing state
  2. 2.Concurrent access - Multiple Terraform processes modified state simultaneously
  3. 3.Storage failures - S3, network, or disk issues during write
  4. 4.Manual edits - Someone modified the state file directly
  5. 5.Version incompatibility - Using newer Terraform on older state
  6. 6.Git merge conflicts - State file conflicts from branching
  7. 7.Backend configuration changes - Switching backends incorrectly

Step 1: Stop All Operations Immediately

Don't make it worse:

```bash # Kill any running Terraform processes pkill -f terraform

# Check for locks in remote backend # For S3 + DynamoDB: aws dynamodb scan --table-name terraform-locks

# Remove any stale locks if safe terraform force-unlock -force LOCK_ID ```

**Do NOT run terraform apply or terraform destroy until state is recovered.**

Step 2: Create Backup of Corrupted State

Before attempting any recovery:

```bash # Backup the corrupted state locally terraform state pull > corrupted-state-backup-$(date +%Y%m%d-%H%M%S).tfstate

# If using local state, copy it cp terraform.tfstate terraform.tfstate.corrupted-backup-$(date +%Y%m%d-%H%M%S)

# Also backup state backup if it exists cp terraform.tfstate.backup terraform.tfstate.backup-$(date +%Y%m%d-%H%M%S) ```

Step 3: Attempt Automatic Recovery

Terraform has built-in recovery for some issues:

```bash # Try to recover from backup terraform state pull

# If that fails, try manual backup restoration # For local state: cp terraform.tfstate.backup terraform.tfstate

# For remote state, restore from version-controlled backup aws s3 cp s3://my-terraform-state-bucket/backups/terraform.tfstate-2026-04-02 ./terraform.tfstate ```

Step 4: Diagnose Specific Corruption Types

JSON Parse Errors:

```bash # Check if file is valid JSON jq . terraform.tfstate 2>&1 | head -20

# Find where JSON is broken python3 -c "import json; json.load(open('terraform.tfstate'))" 2>&1

# Try to find the corruption point head -100 terraform.tfstate tail -100 terraform.tfstate ```

Version Mismatch:

```bash # Check state version terraform state pull | jq '.terraform_version'

# Check your Terraform version terraform version

# Upgrade Terraform to match or newer # Download from: https://www.terraform.io/downloads ```

Checksum Errors:

bash
# The state file includes a checksum
# Try to regenerate it by reformatting
terraform state pull > temp.tfstate
terraform state push temp.tfstate

Step 5: Manual State Repair

For serious corruption requiring manual intervention:

```bash # Pull state to local file for editing terraform state pull > state.json

# Try to fix JSON issues python3 << 'EOF' import json

with open('state.json', 'r') as f: content = f.read()

# Try to find and fix truncated JSON try: data = json.loads(content) print("State is valid JSON") except json.JSONDecodeError as e: print(f"JSON error at position {e.pos}: {e.msg}") # Attempt repair if content[-1] != '}': # Try to close the JSON properly lines = content.split('\n') # Remove incomplete last line for i in range(len(lines)-1, -1, -1): if lines[i].strip(): break content = '\n'.join(lines[:i+1]) # Add closing brackets open_brackets = content.count('{') - content.count('}') open_arrays = content.count('[') - content.count(']') content += '}' * open_brackets + ']' * open_arrays

try: data = json.loads(content) with open('state-repaired.json', 'w') as f: json.dump(data, f, indent=2) print("Repair successful") except: print("Manual repair required") EOF ```

Step 6: Restore from Backup

If you have state versioning enabled:

```bash # List available backups (S3 versioning) aws s3api list-object-versions \ --bucket my-terraform-state \ --prefix prod/terraform.tfstate \ --query 'Versions[].{Date:LastModified,VersionId:VersionId}'

# Restore a specific version aws s3api get-object \ --bucket my-terraform-state \ --key prod/terraform.tfstate \ --version-id YOUR_VERSION_ID \ recovered-state.tfstate

# For local versioning with Git git log --oneline terraform.tfstate git show HEAD~1:terraform.tfstate > recovered-state.tfstate ```

Step 7: Rebuild State from Infrastructure

When no backup is available, rebuild state:

```bash # First, understand what resources exist # For AWS: aws resourcegroupstaggingapi get-resources --tag-filters Key=Environment,Values=prod

# List all resources manually aws ec2 describe-instances --query 'Reservations[].Instances[].InstanceId' aws rds describe-db-instances --query 'DBInstances[].DBInstanceIdentifier' aws s3 ls

# Import resources into new state terraform import aws_vpc.main vpc-12345678 terraform import aws_subnet.public subnet-12345678 terraform import aws_instance.web i-0123456789abcdef0 terraform import aws_db_instance.main my-db-instance

# Use import blocks for bulk import (Terraform 1.5+) # generate import blocks from existing infrastructure terraform plan -generate-config-out=imported.tf ```

Step 8: Fix Duplicate Resources

If state has duplicate IDs:

```bash # List all resources in state terraform state list

# Find duplicates terraform state list | sort | uniq -d

# Remove duplicate entry terraform state rm aws_instance.duplicate

# Re-import if needed terraform import aws_instance.main i-0123456789abcdef0 ```

Step 9: Repair Specific Resource Issues

For individual corrupted resources:

```bash # Check specific resource state terraform state show aws_instance.main

# If corrupted, remove and re-import terraform state rm aws_instance.main terraform import aws_instance.main i-0123456789abcdef0

# For modules terraform state rm module.eks.aws_eks_cluster.main terraform import module.eks.aws_eks_cluster.main my-cluster ```

Step 10: Push Recovered State

After repairing:

```bash # Validate the recovered state terraform state pull | jq .

# Push to backend terraform state push recovered-state.tfstate

# Verify state is usable terraform plan -refresh-only

# Should see: No changes. Your infrastructure matches the configuration. ```

Prevent Future Corruption

Enable state versioning and backups:

```hcl terraform { backend "s3" { bucket = "my-terraform-state" key = "prod/terraform.tfstate" region = "us-east-1" dynamodb_table = "terraform-locks"

# Enable versioning on the S3 bucket # (set in S3 bucket configuration) } } ```

S3 bucket versioning setup:

```bash # Enable versioning on state bucket aws s3api put-bucket-versioning \ --bucket my-terraform-state \ --versioning-configuration Status=Enabled

# Create a backup policy aws s3api put-bucket-lifecycle-configuration \ --bucket my-terraform-state \ --lifecycle-configuration file://lifecycle.json ```

lifecycle.json:

json
{
  "Rules": [
    {
      "ID": "KeepOldVersions",
      "Status": "Enabled",
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 90
      },
      "NoncurrentVersionTransition": {
        "NoncurrentDays": 30,
        "StorageClass": "GLACIER"
      }
    }
  ]
}

Verify Recovery

Complete validation:

```bash # Check state integrity terraform state pull | jq '.resources | length' terraform state list | wc -l

# Compare with actual infrastructure terraform plan -refresh-only

# No errors should occur # Output should be: No changes ```

Run a full refresh:

bash
terraform refresh
terraform plan