Fix Prometheus TSDB Head Chunk Disk Space Full Error

Introduction

The Prometheus TSDB stores recent (in-memory) metric samples in the head chunk, which is periodically compacted into on-disk blocks. When the disk hosting the TSDB data directory fills up, Prometheus cannot write new chunks, causing scrape failures and data gaps. This is one of the most critical failures for a monitoring system.

Symptoms

Prometheus logs show TSDB has not been able to persist any blocks or disk full
Target scrape pages show targets going down with context deadline exceeded
/api/v1/status/tsdb endpoint returns errors
Grafana dashboards show gaps in metric data during the disk-full period
Error message: Failed to write chunks to disk: no space left on device

Common Causes

Retention period configured longer than disk capacity can support
High cardinality metrics creating excessive time series and chunk files
Disk not expanded after increasing scrape targets or retention period
Compaction failing silently, preventing old blocks from being cleaned up
Other processes consuming disk space on the same partition as Prometheus data

Step-by-Step Fix

1.Check disk usage and Prometheus data directory size: Identify the immediate cause.
2.```bash
3.df -h /var/lib/prometheus
4.du -sh /var/lib/prometheus/metrics2/*
5.`
6.Reduce retention period temporarily to trigger block cleanup: Free disk space immediately.
7.```bash
8.# Edit Prometheus startup flags
9.--storage.tsdb.retention.time=3d
10.# Or via config file, then restart
11.systemctl restart prometheus
12.`
13.Force a TSDB snapshot and clean old blocks: Create a point-in-time snapshot then remove old data.
14.```bash
15.# Trigger snapshot via admin API
16.curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot
17.# Remove oldest blocks manually if needed
18.ls -lt /var/lib/prometheus/metrics2/01* | tail -20 | awk '{print $NF}' | xargs rm -rf
19.`
20.Identify high-cardinality metrics contributing to disk usage: Find the worst offenders.
21.```bash
22.# Top 10 metrics by series count
23.curl -s http://localhost:9090/api/v1/status/tsdb | jq '.seriesCountByMetricName[:10]'
24.`
25.Expand disk or configure retention limits: Implement a permanent fix.
26.```bash
27.# Set appropriate retention based on disk capacity
28.--storage.tsdb.retention.time=15d
29.--storage.tsdb.retention.size=50GB
30.`

Prevention

Set --storage.tsdb.retention.size in addition to time-based retention to cap disk usage
Monitor disk usage with alerts at 70% and 85% capacity
Identify and reduce high-cardinality metrics before they cause disk issues
Size disk to handle at least 3x the daily ingestion rate at the desired retention period
Use remote write to offload long-term storage to systems like Thanos or Cortex
Implement metric cardinality limits using --storage.tsdb.max-block-duration

Prometheus TSDB Head Chunk Disk Space Full

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Share this guide

More Prometheus Troubleshooting Guides

Prometheus Retention Period Config Ignored Disk Still Filling

Prometheus Service Discovery Kubernetes API Rate Limited

Prometheus WAL Corruption After Unclean Shutdown Requiring Repair

Prometheus Cardinality Explosion From Unbounded Label Values

Prometheus Relabel Config Dropping All Metrics Accidentally

Prometheus Federation Upstream Timeout on Slow Remote Read