Introduction

The Prometheus TSDB stores recent (in-memory) metric samples in the head chunk, which is periodically compacted into on-disk blocks. When the disk hosting the TSDB data directory fills up, Prometheus cannot write new chunks, causing scrape failures and data gaps. This is one of the most critical failures for a monitoring system.

Symptoms

  • Prometheus logs show TSDB has not been able to persist any blocks or disk full
  • Target scrape pages show targets going down with context deadline exceeded
  • /api/v1/status/tsdb endpoint returns errors
  • Grafana dashboards show gaps in metric data during the disk-full period
  • Error message: Failed to write chunks to disk: no space left on device

Common Causes

  • Retention period configured longer than disk capacity can support
  • High cardinality metrics creating excessive time series and chunk files
  • Disk not expanded after increasing scrape targets or retention period
  • Compaction failing silently, preventing old blocks from being cleaned up
  • Other processes consuming disk space on the same partition as Prometheus data

Step-by-Step Fix

  1. 1.Check disk usage and Prometheus data directory size: Identify the immediate cause.
  2. 2.```bash
  3. 3.df -h /var/lib/prometheus
  4. 4.du -sh /var/lib/prometheus/metrics2/*
  5. 5.`
  6. 6.Reduce retention period temporarily to trigger block cleanup: Free disk space immediately.
  7. 7.```bash
  8. 8.# Edit Prometheus startup flags
  9. 9.--storage.tsdb.retention.time=3d
  10. 10.# Or via config file, then restart
  11. 11.systemctl restart prometheus
  12. 12.`
  13. 13.Force a TSDB snapshot and clean old blocks: Create a point-in-time snapshot then remove old data.
  14. 14.```bash
  15. 15.# Trigger snapshot via admin API
  16. 16.curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot
  17. 17.# Remove oldest blocks manually if needed
  18. 18.ls -lt /var/lib/prometheus/metrics2/01* | tail -20 | awk '{print $NF}' | xargs rm -rf
  19. 19.`
  20. 20.Identify high-cardinality metrics contributing to disk usage: Find the worst offenders.
  21. 21.```bash
  22. 22.# Top 10 metrics by series count
  23. 23.curl -s http://localhost:9090/api/v1/status/tsdb | jq '.seriesCountByMetricName[:10]'
  24. 24.`
  25. 25.Expand disk or configure retention limits: Implement a permanent fix.
  26. 26.```bash
  27. 27.# Set appropriate retention based on disk capacity
  28. 28.--storage.tsdb.retention.time=15d
  29. 29.--storage.tsdb.retention.size=50GB
  30. 30.`

Prevention

  • Set --storage.tsdb.retention.size in addition to time-based retention to cap disk usage
  • Monitor disk usage with alerts at 70% and 85% capacity
  • Identify and reduce high-cardinality metrics before they cause disk issues
  • Size disk to handle at least 3x the daily ingestion rate at the desired retention period
  • Use remote write to offload long-term storage to systems like Thanos or Cortex
  • Implement metric cardinality limits using --storage.tsdb.max-block-duration