Fix Docker Container OOM Killed - Complete Deep Dive Guide

Introduction

Docker container OOM killed occurs when a container exceeds its cgroup memory limit and the Linux kernel's OOM killer terminates the main process. The container exits with code 137 (128 + SIGKILL=9), causing service disruptions, data loss, and cascading failures in production environments. Unlike application-level memory errors, OOM kills happen at the kernel level—the process is terminated immediately without graceful shutdown, cleanup handlers, or final logging. This guide provides deep technical troubleshooting for Docker-specific OOM scenarios including cgroup v1 vs v2 differences, memory accounting bugs, multi-container memory contention, Java/Node.js/Python runtime tuning, and production monitoring strategies.

Symptoms

docker ps shows container STATUS: Exited (137)
docker inspect returns "OOMKilled": true in container state
Container restarts frequently with increasing memory usage before each crash
dmesg shows kernel messages: Out of memory: Killed process <pid>
Container starts successfully but crashes under load or after running for hours
Docker stats show memory usage at or near limit before termination
Host shows memory pressure with free -h showing low available memory
Other containers on same host experience similar OOM issues

Common Causes

Container memory limit set lower than application working set
Memory leak in application code causing unbounded growth
JVM/Node.js/Python runtime not configured for container memory constraints
Multiple containers competing for limited host memory
cgroup memory accounting bugs in older Docker/kernel versions
Memory limit not enforced due to cgroup driver misconfiguration
Large file processing or database queries loading too much data into memory
Cache without eviction policy growing unbounded
Traffic spike causing temporary memory surge above limit
Init process (PID 1) not properly reaping zombie processes

Step-by-Step Fix

### 1. Confirm OOM kill diagnosis

Verify the container was actually OOM killed:

```bash # Check container exit code and OOM status docker inspect <container-id> --format='{{json .State}}' | jq

# Expected output for OOM killed container: # { # "Status": "exited", # "Running": false, # "Paused": false, # "Restarting": false, # "OOMKilled": true, # "Dead": false, # "Pid": 0, # "ExitCode": 137, # "Error": "", # "StartedAt": "2026-03-31T10:00:00Z", # "FinishedAt": "2026-03-31T10:05:00Z" # }

# Check last 50 lines of container logs (may be truncated) docker logs --tail 50 <container-id>

# Check container restart history docker inspect <container-id> --format='{{.RestartCount}}'

# For containers that keep restarting, capture state quickly watch -n 1 'docker ps -a --filter "name=<container>" --format "table {{.Names}}\t{{.Status}}\t{{.State}}"' ```

Check if limit was actually configured:

```bash # Check memory limit (0 means no limit - uses host memory) docker inspect <container-id> --format='{{.HostConfig.Memory}}'

# Convert bytes to human-readable docker inspect <container-id> --format='{{.HostConfig.Memory}}' | awk '{printf "%.2f MB\n", $1/1024/1024}'

# Check memory + swap limit docker inspect <container-id> --format='Memory: {{.HostConfig.Memory}}, Swap: {{.HostConfig.MemorySwap}}' ```

### 2. Check kernel OOM killer messages

The kernel logs the exact reason for OOM kills:

```bash # Check dmesg for OOM killer messages dmesg -T | grep -i "oom\|killed" | tail -30

# Check for Docker container OOM specifically dmesg -T | grep -E "oom|killed|memory" | grep -i docker

# Typical OOM killer output: # [Mar31 10:05:23] Out of memory: Killed process 12345 (java) total-vm:2048000kB, anon-rss:1536000kB, file-rss:0kB # [Mar31 10:05:23] oom_reaper: reaped process 12345 (java), now anon-rss:0kB, file-rss:0kB

# Check systemd journal for OOM events journalctl -k --since "1 hour ago" | grep -i oom

# Check /var/log/messages (RHEL/CentOS) or syslog (Debian/Ubuntu) grep -i "oom" /var/log/syslog | tail -20 ```

Analyze OOM killer output:

```bash # Key fields from OOM log: # total-vm: Total virtual memory (includes shared libraries, mmap files) # anon-rss: Actual anonymous memory (heap, stack, thread stacks) # file-rss: File-backed memory (page cache, mapped files)

# If anon-rss >> limit, application heap/stack caused OOM # If file-rss high, check for large file mappings, excessive caching ```

### 3. Check cgroup memory configuration

Docker uses cgroups to enforce memory limits. Verify cgroup is configured correctly:

```bash # Find container cgroup path docker inspect <container-id> --format='{{.State.Pid}}' CONTAINER_PID=$(docker inspect <container-id> --format='{{.State.Pid}}')

# For cgroup v1 (most common) cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes cat /sys/fs/cgroup/memory/docker/<container-id>/memory.stat

# For cgroup v2 (newer systems) cat /sys/fs/cgroup/docker/<container-id>/memory.max cat /sys/fs/cgroup/docker/<container-id>/memory.current

# Check memory.stat breakdown (cgroup v1) cat /sys/fs/cgroup/memory/docker/<container-id>/memory.stat # Key fields: # cache: Page cache (can be reclaimed under pressure) # rss: Resident Set Size (actual process memory, cannot be reclaimed) # mapped_file: Memory-mapped files # inactive_file: Reclaimable file cache # active_file: Active file cache ```

Verify cgroup driver configuration:

```bash # Check Docker cgroup driver docker info | grep -i "cgroup driver"

# Should match system configuration: # - systemd: Most common, recommended # - cgroupfs: Legacy, can cause issues with systemd-based systems

# Mismatch causes memory limits not to be enforced # Fix: Configure Docker to use correct driver in /etc/docker/daemon.json # { # "exec-opts": ["native.cgroupdriver=systemd"] # } ```

### 4. Analyze container memory usage patterns

Monitor memory in real-time:

```bash # Watch memory usage live docker stats --no-stream <container-id>

# Output: # CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS # abc123def456 myapp 2.5% 1.8Gi / 2Gi 90.00% 1.2GB/800MB 500MB/0B 45

# Get memory usage over time (requires external monitoring) # Install cAdvisor for container metrics docker run -d \ --name=cadvisor \ --volume=/:/rootfs:ro \ --volume=/var/run:/var/run:ro \ --volume=/sys:/sys:ro \ --volume=/var/lib/docker/:/var/lib/docker:ro \ --publish=8080:8080 \ google/cadvisor:latest

# Query cAdvisor API curl http://localhost:8080/api/v2/subcontainers?docker/<container-id> ```

Check memory inside container:

```bash # Exec into running container docker exec -it <container-id> bash

# Check cgroup memory limit (cgroup v1) cat /sys/fs/cgroup/memory/memory.limit_in_bytes

# Check cgroup memory usage cat /sys/fs/cgroup/memory/memory.usage_in_bytes

# Check detailed memory stats cat /sys/fs/cgroup/memory/memory.stat

# For cgroup v2 cat /sys/fs/cgroup/memory.max cat /sys/fs/cgroup/memory.current

# Check process memory maps (find largest allocations) cat /proc/self/status | grep -E "VmSize|VmRSS|VmData|VmStack"

# Check memory-mapped files cat /proc/self/maps | sort -k1 | head -20 ```

### 5. Increase container memory limit appropriately

Set limits based on actual workload requirements:

```bash # Run container with memory limit docker run -d \ --name=myapp \ --memory=2g \ --memory-swap=2g \ --memory-reservation=1g \ myapp:latest

# Memory flags: # --memory: Hard limit (container OOM killed if exceeded) # --memory-swap: Total memory + swap (set equal to --memory to disable swap) # --memory-reservation: Soft limit (kernel tries to keep below this)

# Docker Compose # docker-compose.yml services: app: image: myapp:latest deploy: resources: limits: memory: 2G cpus: '1.0' reservations: memory: 1G cpus: '0.5' ```

Memory limit guidelines by workload:

| Workload Type | Minimum | Recommended | Maximum | |--------------|---------|-------------|---------| | Java Spring Boot | 1G | 2-4G | 8G | | Node.js API | 256M | 512M-1G | 2G | | Python Flask/FastAPI | 256M | 512M-1G | 2G | | Go Microservice | 128M | 256M-512M | 1G | | Redis Cache | 512M | 1-2G | 4G | | PostgreSQL | 512M | 1-4G | 16G | | Nginx Proxy | 64M | 128M-256M | 512M | | Sidecar (Envoy) | 128M | 256M-512M | 1G |

### 6. Configure Java for container memory

Java applications need explicit heap configuration for containers:

```bash # Java 10+ (container-aware by default) docker run -d \ --memory=2g \ --name=java-app \ -e JAVA_TOOL_OPTIONS="-XX:MaxRAMPercentage=75.0 -XX:InitialRAMPercentage=50.0" \ myapp:java11

# Java 8 (requires explicit container support) docker run -d \ --memory=2g \ --name=java-app \ -e JAVA_TOOL_OPTIONS="-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0" \ myapp:java8

# Memory breakdown with 2G container, 75% MaxRAMPercentage: # Heap (XMX): 1.5G (75% of 2G) # Metaspace: 256M (class metadata) # Code Cache: 240M (JIT compiled code) # Thread Stacks: 64M (256 threads × 256KB) # Direct Buffers: 64M (NIO, Netty) # GC Structures: 64M (G1 regions, card tables) # Total: ~2G

# For memory-constrained containers, reduce percentage # MaxRAMPercentage=65.0 leaves more room for non-heap ```

Java heap dump on OOM:

```bash # Enable heap dump on OOM in Dockerfile ENV JAVA_TOOL_OPTIONS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump.hprof"

# Mount volume to persist heap dump docker run -d \ --memory=2g \ -v /var/log/app:/tmp \ -e JAVA_TOOL_OPTIONS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump.hprof" \ myapp:java11

# Copy heap dump from crashed container docker cp <container-id>:/tmp/heapdump.hprof ./heapdump.hprof

# Analyze with Eclipse MAT or VisualVM ```

### 7. Configure Node.js for container memory

Node.js V8 heap needs explicit sizing:

```bash # Set max old space size (75% of container memory) docker run -d \ --memory=2g \ --name=node-app \ -e NODE_OPTIONS="--max-old-space-size=1536" \ myapp:node

# Or in Dockerfile ENV NODE_OPTIONS="--max-old-space-size=1536"

# Node.js memory breakdown with 2G container: # V8 Old Space: 1.5G (configurable via --max-old-space-size) # V8 New Space: ~16MB (short-lived objects) # Code Space: ~64MB (JIT compiled code) # Map Space: ~16MB (JavaScript objects) # External Memory: Variable (Buffers, TypedArrays) # Native Heap: Variable (C++ objects, handles) ```

Node.js memory profiling:

```bash # Enable heap snapshot on OOM (requires clinic or custom handler) npm install -g clinic

# Run with clinic doctor clinic doctor -- node app.js

# Or use built-in inspector docker run -d \ --memory=2g \ -e NODE_OPTIONS="--inspect=0.0.0.0:9229" \ -p 9229:9229 \ myapp:node

# Connect Chrome DevTools to chrome://inspect # Take heap snapshot, analyze retained objects ```

### 8. Check for memory leaks

Identify if application has memory leak vs. insufficient limit:

```bash # Monitor memory growth pattern over time while true; do docker stats --no-stream <container-id> --format "table {{.MemUsage}}" sleep 30 done

# Memory leak pattern: # - Memory grows steadily even under constant load # - Memory doesn't return to baseline after traffic spike # - Growth continues until OOMKilled

# Healthy pattern: # - Memory stable under constant load # - Spikes during traffic, returns to baseline # - GC effectively reclaims memory ```

Application-level leak detection:

```bash # Java - Generate heap dump docker exec <container-id> jcmd 1 GC.heap_dump /tmp/heap.hprof docker cp <container-id>:/tmp/heap.hprof ./heap.hprof # Analyze with Eclipse MAT

# Node.js - Generate heap snapshot docker exec <container-id> kill -USR1 1 # Heap snapshot written to /tmp/ or process.cwd()

# Python - Use tracemalloc docker exec <container-id> python -c " import tracemalloc tracemalloc.start() # ... run workload ... snapshot = tracemalloc.take_snapshot() for stat in snapshot.statistics('lineno')[:10]: print(stat) " ```

Common memory leak patterns:

```python # Pattern 1: Unbounded cache cache = {} # Grows forever def get_data(key): if key not in cache: cache[key] = load_data(key) return cache[key]

# Fix: Use LRU cache with max size from functools import lru_cache @lru_cache(maxsize=1000) def get_data(key): return load_data(key) ```

```java // Pattern 2: Static collection leak public class DataCache { private static final List<Object> cache = new ArrayList<>(); public void add(Object obj) { cache.add(obj); // Never removed! } }

// Fix: Use bounded cache with eviction private static final Cache<String, Object> cache = CacheBuilder.newBuilder() .maximumSize(10000) .expireAfterWrite(1, TimeUnit.HOURS) .build(); ```

```python # Pattern 3: Unclosed resources def process_files(files): for f in files: stream = open(f, 'r') data = stream.read() # If exception here, stream never closed!

# Fix: Use context manager def process_files(files): for f in files: with open(f, 'r') as stream: data = stream.read() ```

### 9. Check multi-container memory contention

Multiple containers competing for limited host memory:

```bash # Check host memory free -h

# Output: # total used free shared buff/cache available # Mem: 15Gi 8.0Gi 4.0Gi 200Mi 3.0Gi 6.5Gi # Swap: 2Gi 0B 2.0Gi

# If used > 90%, host is under memory pressure

# Check all container memory usage docker stats --no-stream

# Check which containers are using most memory docker ps --format '{{.ID}}' | xargs -I {} docker inspect {} --format '{{.Name}}: {{.HostConfig.Memory}}'

# Sum memory limits for all containers docker ps --format '{{.ID}}' | xargs -I {} docker inspect {} --format '{{.HostConfig.Memory}}' | \ awk '{sum+=$1} END {printf "Total reserved: %.2f GB\n", sum/1024/1024/1024}' ```

Set container memory limits to prevent contention:

```bash # Reserve memory for host system # Rule: Leave 20-30% of host memory for OS and overhead

# Example: 16GB host # - Reserve 4GB for OS (25%) # - Available for containers: 12GB

# Set individual container limits docker run -d --memory=2g app1 # 2GB docker run -d --memory=2g app2 # 2GB docker run -d --memory=2g app3 # 2GB docker run -d --memory=2g app4 # 2GB docker run -d --memory=2g app5 # 2GB docker run -d --memory=2g app6 # 2GB # Total: 12GB (leaves 4GB for host) ```

### 10. Configure OOM kill priority

Docker containers can set OOM score to influence kill order:

```bash # Set OOM score (higher = more likely to be killed) docker run -d \ --memory=2g \ --oom-score-adj=500 \ myapp:latest

# OOM score range: -1000 to 1000 # - -1000: Never OOM kill (use for critical services) # - 0: Default priority # - 1000: Always kill first (use for expendable jobs)

# Check current OOM score cat /proc/$(docker inspect --format '{{.State.Pid}}' <container>)/oom_score_adj

# Use case: Database should have lower priority than web app docker run -d --memory=4g --oom-score-adj=-500 postgres docker run -d --memory=2g --oom-score-adj=0 nginx docker run -d --memory=1g --oom-score-adj=500 batch-job ```

### 11. Handle init container and sidecar OOM

Multi-container pods with init containers:

```bash # Init containers share node memory but have separate limits # If init container OOM kills, main container never starts

# Check init container status docker inspect <container> --format='{{json .State}}'

# Set appropriate limits for init containers docker run -d \ --name=init \ --memory=512m \ myapp:init

# Then main container docker run -d \ --name=main \ --memory=2g \ --depends-on=init \ myapp:main ```

### 12. Implement memory monitoring and alerting

Set up proactive monitoring:

```yaml # Docker metrics with Prometheus and cAdvisor # docker-compose.yml version: '3' services: cadvisor: image: google/cadvisor:latest container_name: cadvisor volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro ports: - "8080:8080"

prometheus: image: prom/prometheus:latest volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml ports: - "9090:9090"

grafana: image: grafana/grafana:latest ports: - "3000:3000" ```

Prometheus alerting rules:

```yaml # alerting_rules.yml groups: - name: docker_memory rules: - alert: ContainerMemoryHigh expr: | container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85 for: 5m labels: severity: warning annotations: summary: "Container {{ $labels.name }} memory above 85%" description: "Memory usage is {{ $value | humanizePercentage }}"

alert: ContainerMemoryCritical
expr: |
container_memory_usage_bytes /
container_spec_memory_limit_bytes > 0.95
for: 2m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} memory above 95%"
description: "Memory usage is {{ $value | humanizePercentage }} - OOM imminent"

alert: ContainerOOMKilled
expr: |
increase(container_last_seen[5m]) > 0 and
(container_memory_usage_bytes == 0)
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} may have been OOM killed"
`

### 13. Configure Docker daemon memory settings

Global Docker memory configuration:

```bash # Edit Docker daemon configuration # /etc/docker/daemon.json { "default-memory": "2g", "default-shm-size": "512m", "oom-score-adjust": -500, "log-driver": "json-file", "log-opts": { "max-size": "10m", "max-file": "3" } }

# Restart Docker daemon sudo systemctl restart docker

# Verify configuration docker info | grep -E "Default Memory|OOM Score" ```

### 14. Handle Docker overlay2 memory pressure

Overlay2 storage driver can consume memory:

```bash # Check overlay2 memory usage df -h /var/lib/docker/overlay2

# Large number of layers increases memory overhead docker image ls --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}"

# Clean up unused images and layers docker system prune -a

# Check for overlay2 corruption (causes memory leaks) ls -la /var/lib/docker/overlay2/ | head -20

# If overlay2 corrupted, recreate Docker data directory # WARNING: This deletes all containers and images! sudo systemctl stop docker sudo mv /var/lib/docker /var/lib/docker.bak sudo systemctl start docker ```

Prevention

Set memory limits to 1.5-2x normal usage based on load testing
Configure runtime heap (JVM, Node.js, Python) for container constraints
Leave 20-30% of container memory for non-heap allocations
Implement memory monitoring with Prometheus/cAdvisor
Set alerts at 80% and 95% memory usage
Use bounded caches with eviction policies (LRU, TTL)
Stream large files instead of loading into memory
Profile memory usage before production deployment
Document memory requirements in deployment guides
Test OOM scenarios in staging environment

**Exit Code 137**: Container killed by SIGKILL (usually OOM)
**Exit Code 139**: Container killed by SIGSEGV (segmentation fault)
**Cannot start container**: Memory limit too low or cgroup error
**Container killed on OOM**: Host OOM killer terminated container
**Memory limit exceeded**: Container exceeded configured memory limit

How to Fix Docker Container OOM Killed - Deep Troubleshooting Guide

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Related Errors

Share this guide