Introduction

Docker container OOM killed occurs when a container exceeds its cgroup memory limit and the Linux kernel's OOM killer terminates the main process. The container exits with code 137 (128 + SIGKILL=9), causing service disruptions, data loss, and cascading failures in production environments. Unlike application-level memory errors, OOM kills happen at the kernel level—the process is terminated immediately without graceful shutdown, cleanup handlers, or final logging. This guide provides deep technical troubleshooting for Docker-specific OOM scenarios including cgroup v1 vs v2 differences, memory accounting bugs, multi-container memory contention, Java/Node.js/Python runtime tuning, and production monitoring strategies.

Symptoms

  • docker ps shows container STATUS: Exited (137)
  • docker inspect returns "OOMKilled": true in container state
  • Container restarts frequently with increasing memory usage before each crash
  • dmesg shows kernel messages: Out of memory: Killed process <pid>
  • Container starts successfully but crashes under load or after running for hours
  • Docker stats show memory usage at or near limit before termination
  • Host shows memory pressure with free -h showing low available memory
  • Other containers on same host experience similar OOM issues

Common Causes

  • Container memory limit set lower than application working set
  • Memory leak in application code causing unbounded growth
  • JVM/Node.js/Python runtime not configured for container memory constraints
  • Multiple containers competing for limited host memory
  • cgroup memory accounting bugs in older Docker/kernel versions
  • Memory limit not enforced due to cgroup driver misconfiguration
  • Large file processing or database queries loading too much data into memory
  • Cache without eviction policy growing unbounded
  • Traffic spike causing temporary memory surge above limit
  • Init process (PID 1) not properly reaping zombie processes

Step-by-Step Fix

### 1. Confirm OOM kill diagnosis

Verify the container was actually OOM killed:

```bash # Check container exit code and OOM status docker inspect <container-id> --format='{{json .State}}' | jq

# Expected output for OOM killed container: # { # "Status": "exited", # "Running": false, # "Paused": false, # "Restarting": false, # "OOMKilled": true, # "Dead": false, # "Pid": 0, # "ExitCode": 137, # "Error": "", # "StartedAt": "2026-03-31T10:00:00Z", # "FinishedAt": "2026-03-31T10:05:00Z" # }

# Check last 50 lines of container logs (may be truncated) docker logs --tail 50 <container-id>

# Check container restart history docker inspect <container-id> --format='{{.RestartCount}}'

# For containers that keep restarting, capture state quickly watch -n 1 'docker ps -a --filter "name=<container>" --format "table {{.Names}}\t{{.Status}}\t{{.State}}"' ```

Check if limit was actually configured:

```bash # Check memory limit (0 means no limit - uses host memory) docker inspect <container-id> --format='{{.HostConfig.Memory}}'

# Convert bytes to human-readable docker inspect <container-id> --format='{{.HostConfig.Memory}}' | awk '{printf "%.2f MB\n", $1/1024/1024}'

# Check memory + swap limit docker inspect <container-id> --format='Memory: {{.HostConfig.Memory}}, Swap: {{.HostConfig.MemorySwap}}' ```

### 2. Check kernel OOM killer messages

The kernel logs the exact reason for OOM kills:

```bash # Check dmesg for OOM killer messages dmesg -T | grep -i "oom\|killed" | tail -30

# Check for Docker container OOM specifically dmesg -T | grep -E "oom|killed|memory" | grep -i docker

# Typical OOM killer output: # [Mar31 10:05:23] Out of memory: Killed process 12345 (java) total-vm:2048000kB, anon-rss:1536000kB, file-rss:0kB # [Mar31 10:05:23] oom_reaper: reaped process 12345 (java), now anon-rss:0kB, file-rss:0kB

# Check systemd journal for OOM events journalctl -k --since "1 hour ago" | grep -i oom

# Check /var/log/messages (RHEL/CentOS) or syslog (Debian/Ubuntu) grep -i "oom" /var/log/syslog | tail -20 ```

Analyze OOM killer output:

```bash # Key fields from OOM log: # total-vm: Total virtual memory (includes shared libraries, mmap files) # anon-rss: Actual anonymous memory (heap, stack, thread stacks) # file-rss: File-backed memory (page cache, mapped files)

# If anon-rss >> limit, application heap/stack caused OOM # If file-rss high, check for large file mappings, excessive caching ```

### 3. Check cgroup memory configuration

Docker uses cgroups to enforce memory limits. Verify cgroup is configured correctly:

```bash # Find container cgroup path docker inspect <container-id> --format='{{.State.Pid}}' CONTAINER_PID=$(docker inspect <container-id> --format='{{.State.Pid}}')

# For cgroup v1 (most common) cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes cat /sys/fs/cgroup/memory/docker/<container-id>/memory.stat

# For cgroup v2 (newer systems) cat /sys/fs/cgroup/docker/<container-id>/memory.max cat /sys/fs/cgroup/docker/<container-id>/memory.current

# Check memory.stat breakdown (cgroup v1) cat /sys/fs/cgroup/memory/docker/<container-id>/memory.stat # Key fields: # cache: Page cache (can be reclaimed under pressure) # rss: Resident Set Size (actual process memory, cannot be reclaimed) # mapped_file: Memory-mapped files # inactive_file: Reclaimable file cache # active_file: Active file cache ```

Verify cgroup driver configuration:

```bash # Check Docker cgroup driver docker info | grep -i "cgroup driver"

# Should match system configuration: # - systemd: Most common, recommended # - cgroupfs: Legacy, can cause issues with systemd-based systems

# Mismatch causes memory limits not to be enforced # Fix: Configure Docker to use correct driver in /etc/docker/daemon.json # { # "exec-opts": ["native.cgroupdriver=systemd"] # } ```

### 4. Analyze container memory usage patterns

Monitor memory in real-time:

```bash # Watch memory usage live docker stats --no-stream <container-id>

# Output: # CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS # abc123def456 myapp 2.5% 1.8Gi / 2Gi 90.00% 1.2GB/800MB 500MB/0B 45

# Get memory usage over time (requires external monitoring) # Install cAdvisor for container metrics docker run -d \ --name=cadvisor \ --volume=/:/rootfs:ro \ --volume=/var/run:/var/run:ro \ --volume=/sys:/sys:ro \ --volume=/var/lib/docker/:/var/lib/docker:ro \ --publish=8080:8080 \ google/cadvisor:latest

# Query cAdvisor API curl http://localhost:8080/api/v2/subcontainers?docker/<container-id> ```

Check memory inside container:

```bash # Exec into running container docker exec -it <container-id> bash

# Check cgroup memory limit (cgroup v1) cat /sys/fs/cgroup/memory/memory.limit_in_bytes

# Check cgroup memory usage cat /sys/fs/cgroup/memory/memory.usage_in_bytes

# Check detailed memory stats cat /sys/fs/cgroup/memory/memory.stat

# For cgroup v2 cat /sys/fs/cgroup/memory.max cat /sys/fs/cgroup/memory.current

# Check process memory maps (find largest allocations) cat /proc/self/status | grep -E "VmSize|VmRSS|VmData|VmStack"

# Check memory-mapped files cat /proc/self/maps | sort -k1 | head -20 ```

### 5. Increase container memory limit appropriately

Set limits based on actual workload requirements:

```bash # Run container with memory limit docker run -d \ --name=myapp \ --memory=2g \ --memory-swap=2g \ --memory-reservation=1g \ myapp:latest

# Memory flags: # --memory: Hard limit (container OOM killed if exceeded) # --memory-swap: Total memory + swap (set equal to --memory to disable swap) # --memory-reservation: Soft limit (kernel tries to keep below this)

# Docker Compose # docker-compose.yml services: app: image: myapp:latest deploy: resources: limits: memory: 2G cpus: '1.0' reservations: memory: 1G cpus: '0.5' ```

Memory limit guidelines by workload:

| Workload Type | Minimum | Recommended | Maximum | |--------------|---------|-------------|---------| | Java Spring Boot | 1G | 2-4G | 8G | | Node.js API | 256M | 512M-1G | 2G | | Python Flask/FastAPI | 256M | 512M-1G | 2G | | Go Microservice | 128M | 256M-512M | 1G | | Redis Cache | 512M | 1-2G | 4G | | PostgreSQL | 512M | 1-4G | 16G | | Nginx Proxy | 64M | 128M-256M | 512M | | Sidecar (Envoy) | 128M | 256M-512M | 1G |

### 6. Configure Java for container memory

Java applications need explicit heap configuration for containers:

```bash # Java 10+ (container-aware by default) docker run -d \ --memory=2g \ --name=java-app \ -e JAVA_TOOL_OPTIONS="-XX:MaxRAMPercentage=75.0 -XX:InitialRAMPercentage=50.0" \ myapp:java11

# Java 8 (requires explicit container support) docker run -d \ --memory=2g \ --name=java-app \ -e JAVA_TOOL_OPTIONS="-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0" \ myapp:java8

# Memory breakdown with 2G container, 75% MaxRAMPercentage: # Heap (XMX): 1.5G (75% of 2G) # Metaspace: 256M (class metadata) # Code Cache: 240M (JIT compiled code) # Thread Stacks: 64M (256 threads × 256KB) # Direct Buffers: 64M (NIO, Netty) # GC Structures: 64M (G1 regions, card tables) # Total: ~2G

# For memory-constrained containers, reduce percentage # MaxRAMPercentage=65.0 leaves more room for non-heap ```

Java heap dump on OOM:

```bash # Enable heap dump on OOM in Dockerfile ENV JAVA_TOOL_OPTIONS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump.hprof"

# Mount volume to persist heap dump docker run -d \ --memory=2g \ -v /var/log/app:/tmp \ -e JAVA_TOOL_OPTIONS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump.hprof" \ myapp:java11

# Copy heap dump from crashed container docker cp <container-id>:/tmp/heapdump.hprof ./heapdump.hprof

# Analyze with Eclipse MAT or VisualVM ```

### 7. Configure Node.js for container memory

Node.js V8 heap needs explicit sizing:

```bash # Set max old space size (75% of container memory) docker run -d \ --memory=2g \ --name=node-app \ -e NODE_OPTIONS="--max-old-space-size=1536" \ myapp:node

# Or in Dockerfile ENV NODE_OPTIONS="--max-old-space-size=1536"

# Node.js memory breakdown with 2G container: # V8 Old Space: 1.5G (configurable via --max-old-space-size) # V8 New Space: ~16MB (short-lived objects) # Code Space: ~64MB (JIT compiled code) # Map Space: ~16MB (JavaScript objects) # External Memory: Variable (Buffers, TypedArrays) # Native Heap: Variable (C++ objects, handles) ```

Node.js memory profiling:

```bash # Enable heap snapshot on OOM (requires clinic or custom handler) npm install -g clinic

# Run with clinic doctor clinic doctor -- node app.js

# Or use built-in inspector docker run -d \ --memory=2g \ -e NODE_OPTIONS="--inspect=0.0.0.0:9229" \ -p 9229:9229 \ myapp:node

# Connect Chrome DevTools to chrome://inspect # Take heap snapshot, analyze retained objects ```

### 8. Check for memory leaks

Identify if application has memory leak vs. insufficient limit:

```bash # Monitor memory growth pattern over time while true; do docker stats --no-stream <container-id> --format "table {{.MemUsage}}" sleep 30 done

# Memory leak pattern: # - Memory grows steadily even under constant load # - Memory doesn't return to baseline after traffic spike # - Growth continues until OOMKilled

# Healthy pattern: # - Memory stable under constant load # - Spikes during traffic, returns to baseline # - GC effectively reclaims memory ```

Application-level leak detection:

```bash # Java - Generate heap dump docker exec <container-id> jcmd 1 GC.heap_dump /tmp/heap.hprof docker cp <container-id>:/tmp/heap.hprof ./heap.hprof # Analyze with Eclipse MAT

# Node.js - Generate heap snapshot docker exec <container-id> kill -USR1 1 # Heap snapshot written to /tmp/ or process.cwd()

# Python - Use tracemalloc docker exec <container-id> python -c " import tracemalloc tracemalloc.start() # ... run workload ... snapshot = tracemalloc.take_snapshot() for stat in snapshot.statistics('lineno')[:10]: print(stat) " ```

Common memory leak patterns:

```python # Pattern 1: Unbounded cache cache = {} # Grows forever def get_data(key): if key not in cache: cache[key] = load_data(key) return cache[key]

# Fix: Use LRU cache with max size from functools import lru_cache @lru_cache(maxsize=1000) def get_data(key): return load_data(key) ```

```java // Pattern 2: Static collection leak public class DataCache { private static final List<Object> cache = new ArrayList<>(); public void add(Object obj) { cache.add(obj); // Never removed! } }

// Fix: Use bounded cache with eviction private static final Cache<String, Object> cache = CacheBuilder.newBuilder() .maximumSize(10000) .expireAfterWrite(1, TimeUnit.HOURS) .build(); ```

```python # Pattern 3: Unclosed resources def process_files(files): for f in files: stream = open(f, 'r') data = stream.read() # If exception here, stream never closed!

# Fix: Use context manager def process_files(files): for f in files: with open(f, 'r') as stream: data = stream.read() ```

### 9. Check multi-container memory contention

Multiple containers competing for limited host memory:

```bash # Check host memory free -h

# Output: # total used free shared buff/cache available # Mem: 15Gi 8.0Gi 4.0Gi 200Mi 3.0Gi 6.5Gi # Swap: 2Gi 0B 2.0Gi

# If used > 90%, host is under memory pressure

# Check all container memory usage docker stats --no-stream

# Check which containers are using most memory docker ps --format '{{.ID}}' | xargs -I {} docker inspect {} --format '{{.Name}}: {{.HostConfig.Memory}}'

# Sum memory limits for all containers docker ps --format '{{.ID}}' | xargs -I {} docker inspect {} --format '{{.HostConfig.Memory}}' | \ awk '{sum+=$1} END {printf "Total reserved: %.2f GB\n", sum/1024/1024/1024}' ```

Set container memory limits to prevent contention:

```bash # Reserve memory for host system # Rule: Leave 20-30% of host memory for OS and overhead

# Example: 16GB host # - Reserve 4GB for OS (25%) # - Available for containers: 12GB

# Set individual container limits docker run -d --memory=2g app1 # 2GB docker run -d --memory=2g app2 # 2GB docker run -d --memory=2g app3 # 2GB docker run -d --memory=2g app4 # 2GB docker run -d --memory=2g app5 # 2GB docker run -d --memory=2g app6 # 2GB # Total: 12GB (leaves 4GB for host) ```

### 10. Configure OOM kill priority

Docker containers can set OOM score to influence kill order:

```bash # Set OOM score (higher = more likely to be killed) docker run -d \ --memory=2g \ --oom-score-adj=500 \ myapp:latest

# OOM score range: -1000 to 1000 # - -1000: Never OOM kill (use for critical services) # - 0: Default priority # - 1000: Always kill first (use for expendable jobs)

# Check current OOM score cat /proc/$(docker inspect --format '{{.State.Pid}}' <container>)/oom_score_adj

# Use case: Database should have lower priority than web app docker run -d --memory=4g --oom-score-adj=-500 postgres docker run -d --memory=2g --oom-score-adj=0 nginx docker run -d --memory=1g --oom-score-adj=500 batch-job ```

### 11. Handle init container and sidecar OOM

Multi-container pods with init containers:

```bash # Init containers share node memory but have separate limits # If init container OOM kills, main container never starts

# Check init container status docker inspect <container> --format='{{json .State}}'

# Set appropriate limits for init containers docker run -d \ --name=init \ --memory=512m \ myapp:init

# Then main container docker run -d \ --name=main \ --memory=2g \ --depends-on=init \ myapp:main ```

### 12. Implement memory monitoring and alerting

Set up proactive monitoring:

```yaml # Docker metrics with Prometheus and cAdvisor # docker-compose.yml version: '3' services: cadvisor: image: google/cadvisor:latest container_name: cadvisor volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro ports: - "8080:8080"

prometheus: image: prom/prometheus:latest volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml ports: - "9090:9090"

grafana: image: grafana/grafana:latest ports: - "3000:3000" ```

Prometheus alerting rules:

```yaml # alerting_rules.yml groups: - name: docker_memory rules: - alert: ContainerMemoryHigh expr: | container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85 for: 5m labels: severity: warning annotations: summary: "Container {{ $labels.name }} memory above 85%" description: "Memory usage is {{ $value | humanizePercentage }}"

  • alert: ContainerMemoryCritical
  • expr: |
  • container_memory_usage_bytes /
  • container_spec_memory_limit_bytes > 0.95
  • for: 2m
  • labels:
  • severity: critical
  • annotations:
  • summary: "Container {{ $labels.name }} memory above 95%"
  • description: "Memory usage is {{ $value | humanizePercentage }} - OOM imminent"
  • alert: ContainerOOMKilled
  • expr: |
  • increase(container_last_seen[5m]) > 0 and
  • (container_memory_usage_bytes == 0)
  • for: 1m
  • labels:
  • severity: critical
  • annotations:
  • summary: "Container {{ $labels.name }} may have been OOM killed"
  • `

### 13. Configure Docker daemon memory settings

Global Docker memory configuration:

```bash # Edit Docker daemon configuration # /etc/docker/daemon.json { "default-memory": "2g", "default-shm-size": "512m", "oom-score-adjust": -500, "log-driver": "json-file", "log-opts": { "max-size": "10m", "max-file": "3" } }

# Restart Docker daemon sudo systemctl restart docker

# Verify configuration docker info | grep -E "Default Memory|OOM Score" ```

### 14. Handle Docker overlay2 memory pressure

Overlay2 storage driver can consume memory:

```bash # Check overlay2 memory usage df -h /var/lib/docker/overlay2

# Large number of layers increases memory overhead docker image ls --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}"

# Clean up unused images and layers docker system prune -a

# Check for overlay2 corruption (causes memory leaks) ls -la /var/lib/docker/overlay2/ | head -20

# If overlay2 corrupted, recreate Docker data directory # WARNING: This deletes all containers and images! sudo systemctl stop docker sudo mv /var/lib/docker /var/lib/docker.bak sudo systemctl start docker ```

Prevention

  • Set memory limits to 1.5-2x normal usage based on load testing
  • Configure runtime heap (JVM, Node.js, Python) for container constraints
  • Leave 20-30% of container memory for non-heap allocations
  • Implement memory monitoring with Prometheus/cAdvisor
  • Set alerts at 80% and 95% memory usage
  • Use bounded caches with eviction policies (LRU, TTL)
  • Stream large files instead of loading into memory
  • Profile memory usage before production deployment
  • Document memory requirements in deployment guides
  • Test OOM scenarios in staging environment
  • **Exit Code 137**: Container killed by SIGKILL (usually OOM)
  • **Exit Code 139**: Container killed by SIGSEGV (segmentation fault)
  • **Cannot start container**: Memory limit too low or cgroup error
  • **Container killed on OOM**: Host OOM killer terminated container
  • **Memory limit exceeded**: Container exceeded configured memory limit