Introduction
The Prometheus TSDB stores recent (in-memory) metric samples in the head chunk, which is periodically compacted into on-disk blocks. When the disk hosting the TSDB data directory fills up, Prometheus cannot write new chunks, causing scrape failures and data gaps. This is one of the most critical failures for a monitoring system.
Symptoms
- Prometheus logs show
TSDB has not been able to persist any blocksordisk full - Target scrape pages show targets going down with
context deadline exceeded /api/v1/status/tsdbendpoint returns errors- Grafana dashboards show gaps in metric data during the disk-full period
- Error message:
Failed to write chunks to disk: no space left on device
Common Causes
- Retention period configured longer than disk capacity can support
- High cardinality metrics creating excessive time series and chunk files
- Disk not expanded after increasing scrape targets or retention period
- Compaction failing silently, preventing old blocks from being cleaned up
- Other processes consuming disk space on the same partition as Prometheus data
Step-by-Step Fix
- 1.Check disk usage and Prometheus data directory size: Identify the immediate cause.
- 2.```bash
- 3.df -h /var/lib/prometheus
- 4.du -sh /var/lib/prometheus/metrics2/*
- 5.
` - 6.Reduce retention period temporarily to trigger block cleanup: Free disk space immediately.
- 7.```bash
- 8.# Edit Prometheus startup flags
- 9.--storage.tsdb.retention.time=3d
- 10.# Or via config file, then restart
- 11.systemctl restart prometheus
- 12.
` - 13.Force a TSDB snapshot and clean old blocks: Create a point-in-time snapshot then remove old data.
- 14.```bash
- 15.# Trigger snapshot via admin API
- 16.curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot
- 17.# Remove oldest blocks manually if needed
- 18.ls -lt /var/lib/prometheus/metrics2/01* | tail -20 | awk '{print $NF}' | xargs rm -rf
- 19.
` - 20.Identify high-cardinality metrics contributing to disk usage: Find the worst offenders.
- 21.```bash
- 22.# Top 10 metrics by series count
- 23.curl -s http://localhost:9090/api/v1/status/tsdb | jq '.seriesCountByMetricName[:10]'
- 24.
` - 25.Expand disk or configure retention limits: Implement a permanent fix.
- 26.```bash
- 27.# Set appropriate retention based on disk capacity
- 28.--storage.tsdb.retention.time=15d
- 29.--storage.tsdb.retention.size=50GB
- 30.
`
Prevention
- Set
--storage.tsdb.retention.sizein addition to time-based retention to cap disk usage - Monitor disk usage with alerts at 70% and 85% capacity
- Identify and reduce high-cardinality metrics before they cause disk issues
- Size disk to handle at least 3x the daily ingestion rate at the desired retention period
- Use remote write to offload long-term storage to systems like Thanos or Cortex
- Implement metric cardinality limits using
--storage.tsdb.max-block-duration