Fix Prometheus TSDB Corruption

Prometheus started failing to start, or you're seeing strange query results with gaps in data that shouldn't exist. The logs show corruption-related errors. TSDB corruption is serious business, but with the right approach, you can often recover most of your data.

Understanding TSDB Corruption

Prometheus stores metrics in a time series database (TSDB) using a write-ahead log (WAL) and block storage. Corruption can occur in the WAL, the index files, or the chunk files themselves. Common causes include:

Hard shutdowns or power failures
Disk I/O errors or filesystem corruption
Running out of disk space during compaction
Hardware failures

Error patterns you might see:

bash

ts=2024-01-15T10:23:45.123Z caller=db.go:789 level=error msg="Opening storage failed" err="corruption after 45321 bytes in /data/wal/000001"

bash

ts=2024-01-15T10:23:45.123Z caller=head.go:456 level=error msg="loading WAL segments" err="WAL segment has invalid checksum"

bash

panic: runtime error: invalid memory address or nil pointer dereference

Initial Assessment

First, assess the extent of the damage:

```bash # Check Prometheus logs for corruption messages journalctl -u prometheus -n 500 | grep -i "corrupt|error|failed|wal"

# If running in Kubernetes kubectl logs prometheus-server-0 -n monitoring --previous | grep -i "corrupt|error|wal"

# Check disk space and filesystem health df -h /var/lib/prometheus dmesg | grep -i "disk|error|i/o"

# List WAL segments and blocks ls -lah /var/lib/prometheus/wal/ ls -lah /var/lib/prometheus/ ```

Recovery Strategy 1: Clean WAL Recovery

If Prometheus can't start because of WAL corruption, you can truncate the corrupted portion of the WAL. This will lose some recent data but may allow Prometheus to start.

```bash # Stop Prometheus first systemctl stop prometheus # or for Kubernetes kubectl scale deployment prometheus-server --replicas=0 -n monitoring

# Check WAL segment status ls -la /var/lib/prometheus/wal/

# Use promtool to inspect WAL promtool tsdb list /var/lib/prometheus/ ```

Option A: Use promtool to repair the TSDB

```bash # Create a backup first cp -r /var/lib/prometheus /var/lib/prometheus.backup

# Attempt repair with promtool promtool tsdb repair /var/lib/prometheus

# Check output for success or errors ```

Option B: Manually truncate corrupted WAL segments

```bash # Identify corrupted segments by checking file sizes and logs # Corrupted segments often have abnormal sizes or won't read properly

# Move corrupted segments aside (keeping them for potential later recovery) mkdir -p /var/lib/prometheus/wal_corrupted_backup mv /var/lib/prometheus/wal/000123 /var/lib/prometheus/wal_corrupted_backup/

# Start Prometheus to see if it can recover systemctl start prometheus

# Check if it starts successfully systemctl status prometheus journalctl -u prometheus -f ```

Recovery Strategy 2: Block Recovery

If the corruption is in the block storage rather than the WAL, the approach is different.

```bash # List all blocks promtool tsdb list /var/lib/prometheus

# Check for specific block integrity promtool tsdb dump /var/lib/prometheus --min-time=1705276800000 --max-time=1705363200000 ```

If you find a corrupted block:

```bash # Backup the corrupted block mkdir -p /var/lib/prometheus/blocks_corrupted_backup mv /var/lib/prometheus/01HXYZ123456789ABCDEF /var/lib/prometheus/blocks_corrupted_backup/

# The block will need to be regenerated from other sources # or you'll lose the data in that time range ```

Recovery Strategy 3: Restore from Snapshot

If you have a snapshot, this is the cleanest recovery path:

```bash # Stop Prometheus systemctl stop prometheus

# Clear existing data (after ensuring backup) mv /var/lib/prometheus /var/lib/prometheus.corrupted

# Restore from snapshot tar -xzf prometheus-snapshot-2024-01-10.tar.gz -C /var/lib/prometheus

# Start Prometheus systemctl start prometheus ```

Recovery Strategy 4: Start Fresh with Data Recovery

If the corruption is severe and you have remote storage:

```bash # Stop Prometheus systemctl stop prometheus

# Archive the corrupted data for later analysis tar -czf prometheus-corrupted-$(date +%Y%m%d).tar.gz /var/lib/prometheus

# Clear local data rm -rf /var/lib/prometheus/*

# If using remote storage, Prometheus will backfill from remote # Otherwise, you start fresh systemctl start prometheus ```

Using Remote Storage for Recovery

If you're using Thanos, Cortex, or another remote storage, you can recover data:

yaml

# prometheus.yml - Enable remote read to backfill
remote_read:
  - url: "http://thanos-query:19192/api/v1/read"
    read_recent: true

After recovery, the data will be queryable through Prometheus again.

Checking Data Integrity After Recovery

Once Prometheus starts, verify data integrity:

```bash # Check Prometheus is healthy curl http://localhost:9090/-/healthy curl http://localhost:9090/-/ready

# Query for recent data to verify continuity curl -s 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result | length'

# Check for gaps in data curl -s 'http://localhost:9090/api/v1/query_range?query=up&start=1705276800&end=1705363200&step=60' | jq '.data.result[].values | length'

# Check TSDB stats curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data' ```

Advanced: Manual WAL Recovery

For advanced users who need to recover as much data as possible:

```bash # Use promtool to read WAL entries promtool tsdb WAL list /var/lib/prometheus/wal

# Check specific segments for segment in /var/lib/prometheus/wal/*; do echo "Checking $segment" promtool tsdb WAL verify $segment 2>&1 done

# Extract series from valid segments promtool tsdb WAL verify /var/lib/prometheus/wal --write-to=/var/lib/prometheus/recovered/ ```

Prevention Strategies

Prevention is far better than recovery. Implement these practices:

1. Regular Backups

```bash #!/bin/bash # backup-prometheus.sh DATE=$(date +%Y%m%d-%H%M%S) BACKUP_DIR="/backups/prometheus" DATA_DIR="/var/lib/prometheus"

# Create a checkpoint before backup curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot

# Backup the data tar -czf $BACKUP_DIR/prometheus-$DATE.tar.gz $DATA_DIR/

# Clean up old backups (keep 7 days) find $BACKUP_DIR -name "*.tar.gz" -mtime +7 -delete ```

2. Use Remote Storage

```yaml # prometheus.yml remote_write: - url: "http://thanos-receive:19291/api/v1/receive" queue_config: capacity: 10000 max_shards: 10

# This ensures data is replicated to remote storage ```

3. Proper Shutdown Procedures

```bash # Use SIGTERM for graceful shutdown kill -TERM $(pidof prometheus)

# Wait for clean shutdown with timeout timeout 30 sh -c 'while kill -0 $(pidof prometheus); do sleep 1; done' ```

4. Resource Limits and Monitoring

```yaml # Ensure sufficient disk space storage: tsdb: path: /var/lib/prometheus retention: 15d

# Set appropriate resource limits resources: limits: memory: 8Gi requests: memory: 4Gi ```

5. Disk Health Monitoring

bash

# Add to your alerting rules
groups:
  - name: prometheus_health
    rules:
      - alert: PrometheusTSDBCorruption
        expr: prometheus_tsdb_reloads_failures_total > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus TSDB corruption detected"

Verification Checklist

After any recovery operation:

[ ] Prometheus starts without errors
[ ] /metrics endpoint is accessible
[ ] Recent data is queryable
[ ] No gaps in critical metrics
[ ] Alerts are firing appropriately
[ ] Remote write (if configured) is working

TSDB corruption is one of the most serious issues you can face with Prometheus. The key is having good backups and remote storage configured before you need them. If you do encounter corruption, start with the least destructive recovery option and only escalate if necessary.

Understanding TSDB Corruption

Initial Assessment

Recovery Strategy 1: Clean WAL Recovery

Recovery Strategy 2: Block Recovery

Recovery Strategy 3: Restore from Snapshot

Recovery Strategy 4: Start Fresh with Data Recovery

Using Remote Storage for Recovery

Checking Data Integrity After Recovery

Advanced: Manual WAL Recovery

Prevention Strategies

Verification Checklist

Share this guide

More Monitoring Troubleshooting Guides

Metric Retention Expired

Timeseries Storage Full

Collector Agent Crashed

Webhook Notification Timeout

SMS Notification Failed

Email Notification Bounced