What's Actually Happening

You run terraform apply and it starts creating or updating resources, but then it hangs. The operation doesn't progress, doesn't fail with a clear error, and eventually times out. This leaves your infrastructure in an incomplete state and blocks further operations.

The Error You'll See

``` Error: waiting for EC2 Instance (i-0123456789abcdef0) creation: timeout while waiting for state to become 'running' (last state: 'pending', timeout: 10m0s)

Error: context deadline exceeded

Error: waiting for RDS Cluster (my-cluster) to become available: timeout while waiting for state to become 'available' (timeout: 30m0s)

Error: timeout - last error: dial tcp 10.0.0.1:22: i/o timeout ```

Or the eternal spinner:

bash
module.eks.aws_eks_cluster.main: Still creating... [10m30s elapsed]
module.eks.aws_eks_cluster.main: Still creating... [11m0s elapsed]
module.eks.aws_eks_cluster.main: Still creating... [12m0s elapsed]
... (continues without end)

Why This Happens

Timeouts occur due to:

  1. 1.Slow resource provisioning - Large databases, EKS clusters genuinely take 30+ minutes
  2. 2.Default timeouts too short - Terraform's 10-minute default is insufficient for many resources
  3. 3.Network connectivity issues - Terraform cannot reach the resource for status polling
  4. 4.Provider polling bugs - Incorrect status detection in provider code
  5. 5.API rate limiting - Cloud provider throttling API calls
  6. 6.Resource quotas exceeded - Hit service limits preventing completion
  7. 7.Dependency bottlenecks - Waiting on slow prerequisite resources
  8. 8.Provisioner failures - Remote-exec or local-exec provisioners timing out

Step 1: Increase Resource Timeout Values

Configure explicit timeouts for slow resources:

```hcl # EC2 instances (default 10m may be too short) resource "aws_instance" "large" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t3.micro"

timeouts { create = "30m" # Increase from default 10m update = "30m" delete = "30m" } }

# RDS databases (can take 1-2 hours for Multi-AZ) resource "aws_db_instance" "production" { allocated_storage = 100 engine = "postgres" engine_version = "15.0" instance_class = "db.r5.large"

timeouts { create = "3h" # Multi-AZ creation is very slow update = "3h" delete = "2h" } }

# EKS clusters (20-40 minutes typical) resource "aws_eks_cluster" "main" { name = "my-cluster" role_arn = aws_iam_role.eks.arn

timeouts { create = "45m" update = "60m" delete = "45m" } }

# EKS node groups resource "aws_eks_node_group" "main" { cluster_name = aws_eks_cluster.main.name node_group_name = "main"

timeouts { create = "30m" update = "60m" delete = "30m" } }

# CloudFront distributions (15-30 minutes) resource "aws_cloudfront_distribution" "cdn" { # ... distribution configuration ...

timeouts { create = "1h" update = "1h" } }

# Lambda with VPC attachment (can be slow) resource "aws_lambda_function" "vpc_lambda" { function_name = "vpc-function" vpc_config { subnet_ids = aws_subnet.private[*].id security_group_ids = [aws_security_group.lambda.id] }

timeouts { create = "15m" update = "15m" } } ```

Step 2: Check Resource Status During Timeout

Verify what's happening with the actual resource:

```bash # For EC2 instances aws ec2 describe-instances \ --instance-ids i-0123456789abcdef0 \ --query 'Reservations[].Instances[].{State:State.Name,LaunchTime:LaunchTime}'

# For RDS databases aws rds describe-db-instances \ --db-instance-identifier my-db \ --query 'DBInstances[].{Status:DBInstanceStatus,Progress:PercentProgress}'

# For EKS clusters aws eks describe-cluster \ --name my-cluster \ --query 'cluster.{Status:status,Endpoint:endpoint}'

# For CloudFront aws cloudfront get-distribution \ --id E1234567890ABC \ --query 'Distribution.{Status:Status,InProgressInvalidations:InProgressInvalidationBatches}' ```

If the resource is actually ready but Terraform timed out:

bash
# Import the completed resource
terraform import aws_db_instance.production my-db-instance

Step 3: Handle Provisioner Timeouts

For SSH-based provisioners that timeout:

```hcl resource "aws_instance" "with_provisioner" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t3.micro"

provisioner "remote-exec" { inline = [ "sudo yum update -y", "sudo yum install -y nginx", ]

connection { type = "ssh" user = "ec2-user" host = self.public_ip private_key = file(var.private_key_path) timeout = "10m" # Connection timeout } }

timeouts { create = "30m" # Total resource timeout including provisioner } } ```

Step 4: Check Network Connectivity

Verify Terraform can reach resources for status polling:

```bash # For resources requiring SSH access ssh -v -i my-key.pem ec2-user@10.0.0.1

# Check security groups allow required ports aws ec2 describe-security-groups \ --group-ids sg-12345678 \ --query 'SecurityGroups[].IpPermissions[]'

# Verify NAT Gateway is functioning for private resources aws ec2 describe-nat-gateways --nat-gateway-ids nat-12345678

# Check VPN/Direct Connect status if applicable aws ec2 describe-vpn-connections --vpn-connection-ids vpn-12345678 ```

Step 5: Reduce API Rate Limiting

When timeouts are caused by provider API throttling:

```bash # Reduce parallel operations terraform apply -parallelism=5 # Default is 10

# Or even lower for rate-sensitive operations terraform apply -parallelism=2 ```

Configure provider retry behavior:

```hcl provider "aws" { region = "us-east-1"

# Increase retry attempts for rate limiting max_retries = 25

# Or use custom retry mode retry_mode = "adaptive" } ```

Step 6: Break Large Operations into Stages

Apply resources incrementally to avoid timeout accumulation:

```bash # Stage 1: Networking terraform apply -target=module.networking

# Stage 2: Compute terraform apply -target=module.compute

# Stage 3: Database terraform apply -target=aws_db_instance.production

# Stage 4: Full apply for remaining resources terraform apply ```

Or use -target for problematic resources:

bash
# Apply just the slow resource with more time
terraform apply -target=aws_eks_cluster.main
# Wait, then apply rest
terraform apply

Step 7: Recover from Timeout State

When apply times out leaving partial infrastructure:

```bash # Check current state terraform state list

# Find incomplete resources terraform state show aws_instance.problematic

# If resource exists but state is wrong terraform taint aws_instance.problematic # Force recreation

# If resource doesn't exist but state thinks it does terraform state rm aws_instance.problematic

# Re-import if resource was actually created terraform import aws_instance.problematic i-0123456789abcdef0 ```

Step 8: Enable Debugging for Timeout Analysis

Get detailed information on what's timing out:

```bash # Enable Terraform debug logs export TF_LOG=DEBUG export TF_LOG_PATH=./terraform-debug.log terraform apply

# Analyze timeout patterns grep -i "timeout|deadline|exceeded|waiting" terraform-debug.log

# Find which API calls are slow grep -i "polling|status|describe" terraform-debug.log ```

Step 9: Handle Specific Resource Timeouts

RDS Multi-AZ Creation: ```hcl resource "aws_db_instance" "multi_az" { multi_az = true # Makes creation much slower allocated_storage = 100 engine = "mysql"

timeouts { create = "4h" # Multi-AZ with large storage takes hours update = "4h" delete = "2h" }

skip_final_snapshot = true # Faster deletion } ```

EKS Cluster and Node Groups: ```hcl resource "aws_eks_cluster" "main" { name = "production" timeouts { create = "45m"; delete = "45m" } }

# Create node group separately after cluster resource "aws_eks_node_group" "workers" { depends_on = [aws_eks_cluster.main] timeouts { create = "30m"; update = "60m" } } ```

S3 Bucket with Many Objects: ``bash # For buckets with thousands of objects, deletion can timeout terraform apply -target=aws_s3_bucket.main # Then manually empty before destroy aws s3 rm s3://my-bucket --recursive terraform destroy -target=aws_s3_bucket.main

Verify the Fix

After adding timeout configuration:

```bash # Validate configuration terraform validate

# Run plan to check timeouts are recognized terraform plan

# Apply with increased timeouts terraform apply ```

Verify resources complete within configured time:

```bash # Monitor during creation watch -n 30 'aws eks describe-cluster --name my-cluster --query cluster.status'

# After successful apply terraform state list ```

Prevention Best Practices

Pre-configure timeouts in your resource templates:

```hcl # Standard timeout template for all large resources resource "aws_db_instance" "template" { timeouts { create = "2h" update = "2h" delete = "1h" } }

resource "aws_eks_cluster" "template" { timeouts { create = "45m" update = "60m" delete = "45m" } } ```

Document expected creation times:

markdown
## Resource Creation Times
- RDS Single-AZ: 15-30 minutes
- RDS Multi-AZ: 45-120 minutes
- EKS Cluster: 20-40 minutes
- EKS Node Group: 10-20 minutes per group
- CloudFront Distribution: 15-30 minutes