Fix Prometheus WAL Corruption After Unclean Shutdown

Introduction

The Prometheus TSDB uses a write-ahead log (WAL) to ensure durability of recent metric samples before they are compacted into blocks. An unclean shutdown -- such as a power loss, OOM kill, or forced process termination -- can leave the WAL in a corrupted state. When Prometheus restarts, it cannot replay the corrupted WAL and fails to start, causing a complete monitoring outage.

Symptoms

Prometheus fails to start with WAL corruption or unexpected end of WAL segment errors
Prometheus logs show corruption after segment followed by a crash loop
prometheus_tsdb_wal_corruptions_total increases after each restart attempt
Scrape targets show no data being ingested since the unclean shutdown
Error message: WAL corruption detected at segment 00001234, offset 45678: unexpected EOF

Common Causes

Server power loss or forced reboot while Prometheus was writing WAL segments
OOM killer terminating Prometheus mid-write to the WAL
Disk I/O error or filesystem corruption affecting the WAL directory
Container runtime killing Prometheus with SIGKILL instead of graceful SIGTERM
Storage volume detached while Prometheus is running (cloud environment)

Step-by-Step Fix

1.Confirm WAL corruption from Prometheus logs: Identify the corrupted segment.
2.```bash
3.journalctl -u prometheus --no-pager -n 50 | grep -i "wal|corrupt"
4.`
5.Attempt automatic WAL repair with promtool: Use the built-in repair tool.
6.```bash
7.promtool tsdb repair /var/lib/prometheus/metrics2
8.`
9.If repair fails, truncate the corrupted WAL segment: Remove the corrupted data (accepting recent sample loss).
10.```bash
11.# Identify the last good WAL segment
12.ls -la /var/lib/prometheus/metrics2/wal/
13.# Remove the corrupted segment (the last one)
14.rm /var/lib/prometheus/metrics2/wal/00001234
15.`
16.Delete the checkpoint directory if also corrupted: Clean up checkpoint files.
17.```bash
18.rm -rf /var/lib/prometheus/metrics2/wal/checkpoint.*
19.`
20.Restart Prometheus and verify WAL replay completes: Confirm the TSDB starts successfully.
21.```bash
22.systemctl start prometheus
23.journalctl -u prometheus -f | grep -i "wal replay|TSDB started"
24.`

Prevention

Configure Prometheus as a systemd service with ExecStop for graceful shutdown
Set up UPS or graceful shutdown scripts for physical servers
Use Restart=on-failure with RestartSec=30s to allow disk to stabilize after crash
Monitor prometheus_tsdb_wal_fsync_duration_seconds to detect slow disk writes
Size Prometheus memory appropriately to prevent OOM kills during high ingestion
Consider running Prometheus on reliable storage (SSD with power loss protection) for WAL durability

Prometheus WAL Corruption After Unclean Shutdown Requiring Repair

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Share this guide

More Prometheus Troubleshooting Guides

Prometheus Retention Period Config Ignored Disk Still Filling

Prometheus Service Discovery Kubernetes API Rate Limited

Prometheus Cardinality Explosion From Unbounded Label Values

Prometheus Relabel Config Dropping All Metrics Accidentally

Prometheus Federation Upstream Timeout on Slow Remote Read

Prometheus Alertmanager Notification Webhook Delivery Failed