Introduction

Prometheus stores one time series per unique combination of metric name and label values. When a metric includes a label with unbounded values -- such as user IDs, request IDs, IP addresses, or URLs -- each unique value creates a new time series. This cardinality explosion quickly exhausts Prometheus memory and disk, causing scrape failures and out-of-memory crashes.

Symptoms

  • Prometheus memory usage grows rapidly, eventually triggering OOM killer
  • prometheus_tsdb_head_series metric shows exponential time series growth
  • Scrape duration increases as the number of time series overwhelms the TSDB
  • Query performance degrades, with simple queries taking tens of seconds
  • Error message: mmap: Cannot allocate memory or TSDB head chunk mmap: no space left on device

Common Causes

  • Metric exposing user_id, request_id, or client_ip as a label value
  • Application generating unique labels per HTTP request path with query parameters
  • Exporter not filtering high-cardinality labels from upstream metrics
  • Label values containing timestamps or UUIDs, creating a new series per event
  • Missing label allowlist/denylist on exporters that forward all labels

Step-by-Step Fix

  1. 1.Identify the highest cardinality metrics: Find which metrics are creating the most series.
  2. 2.```bash
  3. 3.curl -s http://localhost:9090/api/v1/status/tsdb | jq '.seriesCountByMetricName[:10]'
  4. 4.`
  5. 5.Check which labels are causing the explosion: Examine label cardinality.
  6. 6.```bash
  7. 7.curl -s http://localhost:9090/api/v1/status/tsdb | jq '.labelValueCountByLabelName | to_entries | sort_by(.value) | reverse | .[:10]'
  8. 8.`
  9. 9.Drop the problematic label using metric relabeling: Remove the high-cardinality label at scrape time.
  10. 10.```yaml
  11. 11.metric_relabel_configs:
  12. 12.- source_labels: [__name__]
  13. 13.regex: "http_request_duration.*"
  14. 14.action: drop
  15. 15.# Or drop just the problematic label
  16. 16.- regex: "user_id|request_id|client_ip"
  17. 17.action: labeldrop
  18. 18.`
  19. 19.Fix the application instrumentation: Remove the high-cardinality label at the source.
  20. 20.```java
  21. 21.// WRONG: high cardinality
  22. 22.Counter.builder("http_requests_total")
  23. 23..tag("user_id", userId)
  24. 24..register(registry);

// CORRECT: use bounded labels Counter.builder("http_requests_total") .tag("method", method) .tag("status", status) .register(registry); ```

  1. 1.Set series limits to prevent future explosions: Cap the number of time series.
  2. 2.```bash
  3. 3.# Prometheus does not have a built-in series limit, but you can use:
  4. 4.# - Recording rules to aggregate before series count grows
  5. 5.# - Remote write with series limits (Cortex/Thanos)
  6. 6.# - Alert on series count growth rate
  7. 7.- alert: HighCardinalityGrowth
  8. 8.expr: rate(prometheus_tsdb_head_series[1h]) > 100
  9. 9.`

Prevention

  • Define and enforce a label cardinality policy: only use labels with bounded, enumerated values
  • Use labeldrop in metric_relabel_configs to strip known high-cardinality labels at scrape time
  • Monitor prometheus_tsdb_head_series growth rate and alert on sudden increases
  • Review all new metric instrumentation in code reviews for cardinality risk
  • Use histograms with explicit bucket boundaries instead of per-request latency labels
  • Implement automated cardinality testing in CI that flags metrics with unbounded label values