Introduction

AWS Lambda timeout and memory errors occur when functions exceed configured execution time limits, run out of allocated memory, or fail due to resource constraints. Lambda functions are constrained by configurable timeout (max 15 minutes), memory (128MB-10GB), temporary disk space (512MB-10GB), and concurrent execution limits. Common causes include insufficient timeout for long-running operations, memory too low causing OOM kills, VPC connectivity issues blocking external calls, database connection pooling exhaustion, cold start latency adding to execution time, downstream service timeouts, event payload processing exceeding limits, recursive invocations, and provisioned concurrency not configured for latency-sensitive workloads. The fix requires understanding Lambda execution model, resource sizing, VPC networking, external service integration, and monitoring CloudWatch metrics. This guide provides production-proven troubleshooting for Lambda errors across synchronous (API Gateway) and asynchronous (SQS, EventBridge) invocation patterns.

Symptoms

  • Task timed out after 3.001 seconds
  • RequestId: xxx Error: Runtime exited with error: signal: killed (OOM)
  • Function succeeds locally but times out in Lambda
  • Intermittent timeouts under load
  • Connection timeout to RDS/VPC resources
  • Cold start latency > 5 seconds (Java/.NET)
  • Too many open files error
  • /tmp directory full errors
  • Concurrent invocation limit exceeded
  • Response size exceeds API Gateway 10MB limit
  • Function memory usage consistently > 90% of allocation
  • DLQ filling with timeout errors

Common Causes

  • Timeout configured too low for operation complexity
  • Memory allocation too small causing slow CPU and OOM
  • VPC-enabled function missing NAT Gateway for internet access
  • Database connections not reused (new connection per invocation)
  • External API calls without timeout configuration
  • Large event payloads causing processing delays
  • S3 operations on large objects without streaming
  • Recursive Lambda invocations
  • Downstream service (RDS, DynamoDB, API) slow response
  • Layer initialization adding to cold start
  • Node.js event loop keeping function alive
  • Python packages too large causing slow initialization
  • Java class loading during cold start
  • Connection pool exhaustion to external services

Step-by-Step Fix

### 1. Diagnose timeout issues

Check CloudWatch Logs:

```bash # Find function log group aws logs describe-log-groups --log-group-name-prefix /aws/lambda/my-function

# Get recent logs with timeout errors aws logs filter-log-events \ --log-group-name /aws/lambda/my-function \ --filter-pattern "Task timed out" \ --limit 10

# Check duration statistics aws logs filter-log-events \ --log-group-name /aws/lambda/my-function \ --filter-pattern "REPORT RequestId" \ --limit 100

# REPORT format: # REPORT RequestId: xxx Duration: 2999.12 ms Billed Duration: 3000 ms Memory Size: 256 MB Max Memory Used: 241 MB Init Duration: 456.78 ms

# Parse duration and memory stats aws logs filter-log-events \ --log-group-name /aws/lambda/my-function \ --query 'events[*].message' \ --output text | grep REPORT | awk '{print $5, $11, $15}' ```

Analyze duration breakdown:

```python # Add timing to your function import time import logging

logger = logging.getLogger() logger.setLevel(logging.INFO)

def lambda_handler(event, context): start = time.time()

# Initialization (cold start) init_start = time.time() # ... lazy initialization ... logger.info(f"Init: {time.time() - init_start:.3f}s")

# Main processing process_start = time.time() result = process_event(event) logger.info(f"Process: {time.time() - process_start:.3f}s")

# External calls logger.info(f"External API: {external_time:.3f}s") logger.info(f"Database: {db_time:.3f}s")

# Total total = time.time() - start logger.info(f"Total: {total:.3f}s") logger.info(f"Remaining time: {context.get_remaining_time_in_millis()}ms")

return result ```

CloudWatch Insights query:

```sql # Find timeout patterns fields @timestamp, @message, @requestId | filter @message like /Task timed out/ | stats count() by bin(1h)

# Duration percentiles fields @timestamp, @duration | filter ispresent(@duration) | stats avg(@duration) as avg, pct(@duration, 50) as p50, pct(@duration, 95) as p95, pct(@duration, 99) as p99 by bin(1h)

# Memory usage over time fields @timestamp, @memorySize, @maxMemoryUsed | stats avg(@maxMemoryUsed/@memorySize * 100) as memory_percent by bin(1h) ```

### 2. Fix timeout configuration

Calculate appropriate timeout:

```python # Timeout should be: # p99_duration * 1.5 (50% buffer)

# Example: If p99 is 2 seconds, set timeout to 3 seconds # For critical paths, use lower timeout to fail fast

# Current timeout (seconds) CURRENT_TIMEOUT = 3

# If consistently hitting timeout: # 1. Optimize code first # 2. Increase timeout if legitimately needed # 3. Consider async processing for long operations

# Maximum timeout by invocation type: # - API Gateway (sync): 29 seconds (API Gateway timeout is 30s) # - ALB (sync): 400 seconds # - SQS/SNS (async): 900 seconds (15 minutes max) # - EventBridge (async): 900 seconds ```

Update function timeout:

```bash # Increase timeout aws lambda update-function-configuration \ --function-name my-function \ --timeout 30

# Increase memory (also increases CPU proportionally) aws lambda update-function-configuration \ --function-name my-function \ --memory-size 1024

# Memory and CPU allocation: # 128 MB = 0.17 vCPU # 1024 MB = 0.5 vCPU # 2048 MB = 1 vCPU # 3008 MB = 1.5 vCPU (full core)

# For CPU-intensive work, increase memory for more CPU ```

Handle timeout gracefully:

```python # Python - check remaining time import json

def lambda_handler(event, context): remaining_ms = context.get_remaining_time_in_millis()

if remaining_ms < 1000: # Less than 1 second remaining return { 'statusCode': 504, 'body': json.dumps('Function timeout imminent') }

try: # Set internal timeout slightly before Lambda timeout internal_timeout = max(remaining_ms - 500, 1000) / 1000

result = process_with_timeout(event, timeout=internal_timeout) return result

except TimeoutError: return { 'statusCode': 504, 'body': json.dumps('Processing timed out') }

# Node.js - track remaining time exports.handler = async (event, context) => { const remainingMs = context.getRemainingTimeInMillis();

if (remainingMs < 1000) { throw new Error('Timeout imminent'); }

// Process with internal timeout const timeout = Math.max(remainingMs - 500, 1000);

const result = await Promise.race([ processEvent(event), new Promise((_, reject) => setTimeout(() => reject(new Error('Timeout')), timeout) ) ]);

return result; }; ```

### 3. Fix memory issues

Diagnose OOM kills:

```bash # OOM killed functions show: # RequestId: xxx Error: Runtime exited with error: signal: killed # Exit code 137 (128 + 9 = SIGKILL from OOM)

# Check memory usage pattern aws logs filter-log-events \ --log-group-name /aws/lambda/my-function \ --filter-pattern "REPORT RequestId" \ --limit 100 \ --query 'events[].message' \ --output text | grep -oP 'Max Memory Used: \d+' | sort | uniq -c

# If Max Memory Used consistently > 90% of Memory Size, increase memory ```

Size memory correctly:

```python # Memory profiling for Python import tracemalloc import json

def lambda_handler(event, context): tracemalloc.start()

# Your code result = process(event)

current, peak = tracemalloc.get_traced_memory() tracemalloc.stop()

print(f"Current memory: {current / 1024 / 1024:.2f} MB") print(f"Peak memory: {peak / 1024 / 1024:.2f} MB")

# Set memory to: peak * 1.5 (50% buffer) # If peak is 200MB, allocate at least 300MB

return result ```

```javascript // Memory profiling for Node.js exports.handler = async (event) => { const startMem = process.memoryUsage();

const result = await processEvent(event);

const endMem = process.memoryUsage(); console.log(Heap used: ${(endMem.heapUsed / 1024 / 1024).toFixed(2)} MB); console.log(RSS: ${(endMem.rss / 1024 / 1024).toFixed(2)} MB);

return result; }; ```

Memory optimization patterns:

```python # PROBLEM: Loading large file into memory import boto3 s3 = boto3.client('s3')

def lambda_handler(event, context): # Loads entire 500MB file into memory! obj = s3.get_object(Bucket='my-bucket', Key='large-file.csv') data = obj['Body'].read() # OOM with low memory return process(data)

# FIX: Stream processing def lambda_handler(event, context): obj = s3.get_object(Bucket='my-bucket', Key='large-file.csv')

# Stream line by line for line in obj['Body'].iter_lines(): process_line(line) # Process incrementally

return {'statusCode': 200}

# PROBLEM: Accumulating results in memory results = [] for item in items: result = process(item) results.append(result) # Grows with input size return json.dumps(results)

# FIX: Stream to S3 or paginate def lambda_handler(event, context): s3 = boto3.client('s3')

with s3.upload_fileobj(Fileobj=generate_results(), Bucket='output-bucket', Key='results.json') as f: pass

return {'outputLocation': 's3://output-bucket/results.json'} ```

### 4. Fix VPC connectivity issues

VPC timeout diagnosis:

```bash # Check if function is in VPC aws lambda get-function-configuration \ --function-name my-function \ --query 'VpcConfig.{SubnetIds:SubnetIds,SecurityGroupIds:SecurityGroupIds,VpcId:VpcId}'

# Test connectivity from Lambda # Create test function cat > test-vpc.py << 'EOF' import boto3 import urllib.request import socket

def lambda_handler(event, context): # Test internet access try: urllib.request.urlopen('https://google.com', timeout=5) internet = 'OK' except Exception as e: internet = f'FAILED: {e}'

# Test RDS connectivity try: s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.settimeout(5) s.connect(('my-rds.xxxx.us-east-1.rds.amazonaws.com', 5432)) rds = 'OK' except Exception as e: rds = f'FAILED: {e}'

return {'internet': internet, 'rds': rds} EOF

# Test NAT Gateway for internet access from VPC # Lambda in private subnet needs NAT Gateway for internet # Check route table aws ec2 describe-route-tables \ --filters "Name=subnet.association-id,Values=subnet-xxxx" \ --query 'RouteTables[*].Routes[?GatewayId starts_with nat-]'

# If no NAT route, Lambda cannot reach internet ```

Fix VPC configuration:

``` # Architecture for Lambda with VPC:

# Option 1: Lambda in public subnets (not recommended) # - Direct internet access # - Security: rely on security groups only

# Option 2: Lambda in private subnets with NAT Gateway (recommended) # VPC (10.0.0.0/16) # Public Subnet A (10.0.1.0/24) # - NAT Gateway A # Public Subnet B (10.0.2.0/24) # - NAT Gateway B # Private Subnet A (10.0.10.0/24) # - Lambda ENI # - Route: 0.0.0.0/0 -> NAT Gateway A # Private Subnet B (10.0.11.0/24) # - Lambda ENI # - Route: 0.0.0.0/0 -> NAT Gateway B

# Terraform configuration resource "aws_lambda_function" "my_function" { function_name = "my-function" role = aws_iam_role.lambda.arn handler = "index.handler" runtime = "python3.9" timeout = 30 memory_size = 512

vpc_config { subnet_ids = aws_subnet.private[*].id security_group_ids = [aws_security_group.lambda.id] } }

resource "aws_security_group" "lambda" { name = "lambda-sg" description = "Security group for Lambda" vpc_id = aws_vpc.main.id

# Allow outbound to RDS egress { from_port = 5432 to_port = 5432 protocol = "tcp" cidr_blocks = [aws_vpc.main.cidr_block] }

# Allow outbound HTTPS (for S3, external APIs) egress { from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } }

# NAT Gateway for internet access from private subnets resource "aws_nat_gateway" "main" { allocation_id = aws_eip.nat.id subnet_id = aws_subnet.public.id }

resource "aws_route" "private_nat" { route_table_id = aws_route_table.private.id destination_cidr_block = "0.0.0.0/0" nat_gateway_id = aws_nat_gateway.main.id } ```

VPC endpoints for AWS services:

```bash # Use VPC endpoints instead of internet for AWS services # No NAT Gateway needed for S3, DynamoDB

# S3 Gateway Endpoint aws ec2 create-vpc-endpoint \ --vpc-id vpc-xxxx \ --service-name com.amazonaws.us-east-1.s3 \ --route-table-ids rtb-xxxx

# DynamoDB Gateway Endpoint aws ec2 create-vpc-endpoint \ --vpc-id vpc-xxxx \ --service-name com.amazonaws.us-east-1.dynamodb \ --route-table-ids rtb-xxxx

# Interface endpoints for other services (PrivateLink) aws ec2 create-vpc-endpoint \ --vpc-id vpc-xxxx \ --service-name com.amazonaws.us-east-1.secretsmanager \ --vpc-endpoint-type Interface \ --subnet-ids subnet-xxxx \ --security-group-ids sg-xxxx ```

### 5. Fix connection pooling

Database connection pooling:

```python # PROBLEM: New connection every invocation import psycopg2

def lambda_handler(event, context): conn = psycopg2.connect( host='my-rds.xxxx.us-east-1.rds.amazonaws.com', database='mydb', user='user', password='password' ) # Connection overhead adds 100-500ms per invocation cursor = conn.cursor() cursor.execute("SELECT 1") result = cursor.fetchone() conn.close() return result

# FIX: Reuse connections with global variable import psycopg2 from psycopg2 import pool

# Global connection pool (persists across invocations) connection_pool = None

def init_pool(): global connection_pool if connection_pool is None: connection_pool = psycopg2.pool.ThreadLocalConnectionPool( minconn=1, maxconn=10, host='my-rds.xxxx.us-east-1.rds.amazonaws.com', database='mydb', user='user', password='password' )

def lambda_handler(event, context): init_pool() # Initialize once per sandbox

conn = connection_pool.getconn() try: cursor = conn.cursor() cursor.execute("SELECT 1") return cursor.fetchone() finally: connection_pool.putconn(conn) # Return to pool ```

```javascript // Node.js with RDS Data API (serverless, no pooling needed) const { DataApiClient, ExecuteStatementCommand } = require('@aws-sdk/client-rds-data');

const client = new DataApiClient({});

exports.handler = async (event) => { const command = new ExecuteStatementCommand({ resourceArn: 'arn:aws:rds:us-east-1:123456789012:cluster:my-cluster', secretArn: 'arn:aws:secretsmanager:us-east-1:123456789012:secret:my-secret', database: 'mydb', sql: 'SELECT 1' });

const result = await client.send(command); return result.records; };

// Node.js with traditional pooling const { Pool } = require('pg');

let pool;

function getPool() { if (!pool) { pool = new Pool({ host: process.env.DB_HOST, database: process.env.DB_NAME, user: process.env.DB_USER, password: process.env.DB_PASSWORD, max: 10, idleTimeoutMillis: 30000, connectionTimeoutMillis: 5000 }); } return pool; }

exports.handler = async (event) => { const client = await getPool().connect(); try { const result = await client.query('SELECT 1'); return result.rows; } finally { await client.release(); // Return to pool } }; ```

HTTP connection pooling:

```python # PROBLEM: New session every invocation import requests

def lambda_handler(event, context): response = requests.get('https://api.example.com/data') return response.json()

# FIX: Reuse session with global variable import requests

session = None

def get_session(): global session if session is None: session = requests.Session() # Configure adapter with connection pooling adapter = requests.adapters.HTTPAdapter( pool_connections=1, pool_maxsize=10, pool_block=False, max_retries=3 ) session.mount('https://', adapter) return session

def lambda_handler(event, context): session = get_session() response = session.get('https://api.example.com/data', timeout=5) return response.json() ```

### 6. Fix cold start issues

Measure cold start:

sql -- CloudWatch Insights fields @timestamp, @message, @requestId | filter @message like /Init Duration/ | parse @message /Init Duration: (* ms)/ as initDuration | stats avg(initDuration) as avg_init, pct(initDuration, 50) as p50_init, pct(initDuration, 95) as p95_init, pct(initDuration, 99) as p99_init by bin(1h)

Reduce cold start:

```python # 1. Minimize package size # Remove unnecessary dependencies # Use Lambda Layers for large dependencies

# 2. Lazy initialization import boto3

# Global - initialized on cold start s3_client = None

def get_s3_client(): global s3_client if s3_client is None: s3_client = boto3.client('s3') return s3_client

def lambda_handler(event, context): client = get_s3_client() # Use client...

# 3. Use provisioned concurrency for critical functions aws lambda put-provisioned-concurrency-config \ --function-name my-function \ --qualifier PROD \ --provisioned-concurrent-executions 10

# 4. Use lighter runtimes # Python/Node.js: 100-300ms cold start # Java/.NET: 1-5 seconds cold start # Consider Go/Rust for even faster starts

# 5. Keep function warm (not recommended - use provisioned concurrency instead) # Old pattern: CloudWatch Events every 5 minutes # Better: Use provisioned concurrency ```

Optimize deployment package:

```bash # Python - create minimal package # requirements.txt with only needed packages pip install -r requirements.txt -t ./python # Remove unused modules rm -rf python/boto3* # If using Lambda's built-in boto3 rm -rf python/botocore* # Package zip -r function.zip python index.py

# Node.js - tree shaking npm install --production npm prune zip -r function.zip node_modules index.js

# Use Lambda Layers for shared dependencies # Common libs in layers, function code small ```

### 7. Monitor Lambda metrics

CloudWatch alarms:

```bash # Error rate alarm aws cloudwatch put-metric-alarm \ --alarm-name "lambda-errors" \ --alarm-description "Lambda function errors" \ --metric-name Errors \ --namespace AWS/Lambda \ --statistic Sum \ --period 60 \ --threshold 5 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 1 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts \ --dimensions Name=FunctionName,Value=my-function

# Timeout alarm aws cloudwatch put-metric-alarm \ --alarm-name "lambda-timeouts" \ --metric-name Duration \ --namespace AWS/Lambda \ --statistic p95 \ --period 300 \ --threshold 2500 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 2 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts \ --dimensions Name=FunctionName,Value=my-function

# Throttle alarm aws cloudwatch put-metric-alarm \ --alarm-name "lambda-throttles" \ --metric-name Throttles \ --namespace AWS/Lambda \ --statistic Sum \ --period 60 \ --threshold 1 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 1 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts \ --dimensions Name=FunctionName,Value=my-function ```

Prometheus with CloudWatch exporter:

```yaml # Prometheus scrape config scrape_configs: - job_name: 'cloudwatch' static_configs: - targets: ['cloudwatch-exporter:9400']

# CloudWatch exporter config rules: - aws_namespace: AWS/Lambda aws_metric_name: Duration aws_dimensions: [FunctionName] aws_statistics: [Average, p95, p99]

  • aws_namespace: AWS/Lambda
  • aws_metric_name: Errors
  • aws_dimensions: [FunctionName]
  • aws_statistics: [Sum]
  • aws_namespace: AWS/Lambda
  • aws_metric_name: Throttles
  • aws_dimensions: [FunctionName]
  • aws_statistics: [Sum]
  • aws_namespace: AWS/Lambda
  • aws_metric_name: ConcurrentExecutions
  • aws_dimensions: [FunctionName]
  • aws_statistics: [Maximum]
  • `

Grafana dashboard panels:

```yaml # Duration (p95, p99) expr: cloudwatch_aws_lambda_duration_average{function_name="my-function"} unit: ms

# Error count expr: sum(rate(cloudwatch_aws_lambda_errors_sum{function_name="my-function"}[5m]))

# Throttle count expr: sum(rate(cloudwatch_aws_lambda_throttles_sum{function_name="my-function"}[5m]))

# Concurrent executions expr: cloudwatch_aws_lambda_concurrentexecutions_maximum{function_name="my-function"}

# Memory utilization expr: cloudwatch_aws_lambda_max_memory_used_average / cloudwatch_aws_lambda_memory_size_average * 100 ```

Prevention

  • Set timeout to p99 * 1.5 with monitoring for breaches
  • Size memory based on peak usage + 50% buffer
  • Use provisioned concurrency for latency-critical functions
  • Implement connection pooling for databases and external services
  • Use VPC endpoints for AWS services to avoid NAT Gateway
  • Stream large payloads instead of loading into memory
  • Add internal timeout checks before Lambda timeout
  • Monitor cold start duration and optimize package size
  • Set up DLQ for async invocations to capture failures
  • Use Lambda Power Tuning tool to find optimal configuration
  • **AccessDeniedException**: IAM role missing permissions
  • **InvalidParameterValueException**: Configuration parameter invalid
  • **TooManyRequestsException**: Account concurrency limit reached
  • **ResourceNotReadyException**: VPC resources not available
  • **ResourceConflictException**: Concurrent update to function configuration