Introduction
The Prometheus TSDB uses a write-ahead log (WAL) to ensure durability of recent metric samples before they are compacted into blocks. An unclean shutdown -- such as a power loss, OOM kill, or forced process termination -- can leave the WAL in a corrupted state. When Prometheus restarts, it cannot replay the corrupted WAL and fails to start, causing a complete monitoring outage.
Symptoms
- Prometheus fails to start with
WAL corruptionorunexpected end of WAL segmenterrors - Prometheus logs show
corruption after segmentfollowed by a crash loop prometheus_tsdb_wal_corruptions_totalincreases after each restart attempt- Scrape targets show no data being ingested since the unclean shutdown
- Error message:
WAL corruption detected at segment 00001234, offset 45678: unexpected EOF
Common Causes
- Server power loss or forced reboot while Prometheus was writing WAL segments
- OOM killer terminating Prometheus mid-write to the WAL
- Disk I/O error or filesystem corruption affecting the WAL directory
- Container runtime killing Prometheus with SIGKILL instead of graceful SIGTERM
- Storage volume detached while Prometheus is running (cloud environment)
Step-by-Step Fix
- 1.Confirm WAL corruption from Prometheus logs: Identify the corrupted segment.
- 2.```bash
- 3.journalctl -u prometheus --no-pager -n 50 | grep -i "wal|corrupt"
- 4.
` - 5.Attempt automatic WAL repair with promtool: Use the built-in repair tool.
- 6.```bash
- 7.promtool tsdb repair /var/lib/prometheus/metrics2
- 8.
` - 9.If repair fails, truncate the corrupted WAL segment: Remove the corrupted data (accepting recent sample loss).
- 10.```bash
- 11.# Identify the last good WAL segment
- 12.ls -la /var/lib/prometheus/metrics2/wal/
- 13.# Remove the corrupted segment (the last one)
- 14.rm /var/lib/prometheus/metrics2/wal/00001234
- 15.
` - 16.Delete the checkpoint directory if also corrupted: Clean up checkpoint files.
- 17.```bash
- 18.rm -rf /var/lib/prometheus/metrics2/wal/checkpoint.*
- 19.
` - 20.Restart Prometheus and verify WAL replay completes: Confirm the TSDB starts successfully.
- 21.```bash
- 22.systemctl start prometheus
- 23.journalctl -u prometheus -f | grep -i "wal replay|TSDB started"
- 24.
`
Prevention
- Configure Prometheus as a systemd service with
ExecStopfor graceful shutdown - Set up UPS or graceful shutdown scripts for physical servers
- Use
Restart=on-failurewithRestartSec=30sto allow disk to stabilize after crash - Monitor
prometheus_tsdb_wal_fsync_duration_secondsto detect slow disk writes - Size Prometheus memory appropriately to prevent OOM kills during high ingestion
- Consider running Prometheus on reliable storage (SSD with power loss protection) for WAL durability