Introduction

The Prometheus TSDB uses a write-ahead log (WAL) to ensure durability of recent metric samples before they are compacted into blocks. An unclean shutdown -- such as a power loss, OOM kill, or forced process termination -- can leave the WAL in a corrupted state. When Prometheus restarts, it cannot replay the corrupted WAL and fails to start, causing a complete monitoring outage.

Symptoms

  • Prometheus fails to start with WAL corruption or unexpected end of WAL segment errors
  • Prometheus logs show corruption after segment followed by a crash loop
  • prometheus_tsdb_wal_corruptions_total increases after each restart attempt
  • Scrape targets show no data being ingested since the unclean shutdown
  • Error message: WAL corruption detected at segment 00001234, offset 45678: unexpected EOF

Common Causes

  • Server power loss or forced reboot while Prometheus was writing WAL segments
  • OOM killer terminating Prometheus mid-write to the WAL
  • Disk I/O error or filesystem corruption affecting the WAL directory
  • Container runtime killing Prometheus with SIGKILL instead of graceful SIGTERM
  • Storage volume detached while Prometheus is running (cloud environment)

Step-by-Step Fix

  1. 1.Confirm WAL corruption from Prometheus logs: Identify the corrupted segment.
  2. 2.```bash
  3. 3.journalctl -u prometheus --no-pager -n 50 | grep -i "wal|corrupt"
  4. 4.`
  5. 5.Attempt automatic WAL repair with promtool: Use the built-in repair tool.
  6. 6.```bash
  7. 7.promtool tsdb repair /var/lib/prometheus/metrics2
  8. 8.`
  9. 9.If repair fails, truncate the corrupted WAL segment: Remove the corrupted data (accepting recent sample loss).
  10. 10.```bash
  11. 11.# Identify the last good WAL segment
  12. 12.ls -la /var/lib/prometheus/metrics2/wal/
  13. 13.# Remove the corrupted segment (the last one)
  14. 14.rm /var/lib/prometheus/metrics2/wal/00001234
  15. 15.`
  16. 16.Delete the checkpoint directory if also corrupted: Clean up checkpoint files.
  17. 17.```bash
  18. 18.rm -rf /var/lib/prometheus/metrics2/wal/checkpoint.*
  19. 19.`
  20. 20.Restart Prometheus and verify WAL replay completes: Confirm the TSDB starts successfully.
  21. 21.```bash
  22. 22.systemctl start prometheus
  23. 23.journalctl -u prometheus -f | grep -i "wal replay|TSDB started"
  24. 24.`

Prevention

  • Configure Prometheus as a systemd service with ExecStop for graceful shutdown
  • Set up UPS or graceful shutdown scripts for physical servers
  • Use Restart=on-failure with RestartSec=30s to allow disk to stabilize after crash
  • Monitor prometheus_tsdb_wal_fsync_duration_seconds to detect slow disk writes
  • Size Prometheus memory appropriately to prevent OOM kills during high ingestion
  • Consider running Prometheus on reliable storage (SSD with power loss protection) for WAL durability