Introduction

Prometheus recording rules precompute expensive queries and store the results as new time series. When a recording rule takes longer to evaluate than the configured evaluation interval or timeout, Prometheus logs a timeout error and the recorded metric is not updated. This causes dashboards and alerts depending on the recorded metric to show stale data.

Symptoms

  • Prometheus logs show recording rule evaluation took too long or context deadline exceeded
  • Recording rule group duration exceeds the evaluation interval in /api/v1/rules
  • Dashboards using recorded metrics show flatlined or stale values
  • prometheus_rule_evaluation_duration_seconds metric exceeds expected thresholds
  • Error message: rule group took 45.2s to evaluate, rule evaluation interval is 30s

Common Causes

  • Recording rule query scans too many time series due to broad label matchers
  • High cardinality in the input metrics causing the aggregation to process millions of series
  • Recording rule evaluation interval set too short for the query complexity
  • Storage backend slow due to disk I/O contention or compaction activity
  • Rule depends on subqueries with long range vectors that require scanning historical data

Step-by-Step Fix

  1. 1.Identify the slow recording rules: Check rule evaluation metrics.
  2. 2.```bash
  3. 3.curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.evaluationDuration > "30s") | {name: .name, duration: .evaluationDuration}'
  4. 4.`
  5. 5.Optimize the recording rule query: Reduce the query scope and cardinality.
  6. 6.```yaml
  7. 7.# BEFORE: Broad query scanning all series
  8. 8.# record: job:http_requests_total:rate5m
  9. 9.# expr: rate(http_requests_total[5m])

# AFTER: Filter by specific label set record: job:http_requests_total:rate5m expr: sum by (job) (rate(http_requests_total{status=~"2.."}[5m])) ```

  1. 1.Increase the evaluation interval for complex rule groups: Give the rule more time.
  2. 2.```yaml
  3. 3.groups:
  4. 4.- name: complex-recordings
  5. 5.interval: 60s
  6. 6.rules:
  7. 7.- record: job:complex_metric:avg
  8. 8.expr: avg by (job) (complex_metric)
  9. 9.`
  10. 10.Reduce cardinality of input metrics: Add metric relabeling to drop unnecessary label combinations.
  11. 11.```yaml
  12. 12.metric_relabel_configs:
  13. 13.- source_labels: [__name__]
  14. 14.regex: "http_requests_total"
  15. 15.action: keep
  16. 16.- regex: "instance|pod"
  17. 17.action: labeldrop
  18. 18.`
  19. 19.Monitor rule evaluation performance: Set up alerting for slow evaluations.
  20. 20.```yaml
  21. 21.- alert: RecordingRuleSlow
  22. 22.expr: prometheus_rule_group_last_duration_seconds > 20
  23. 23.for: 5m
  24. 24.labels:
  25. 25.severity: warning
  26. 26.annotations:
  27. 27.summary: "Recording rule group {{ $labels.name }} is slow"
  28. 28.`

Prevention

  • Keep recording rule queries simple with explicit label selectors and sum by clauses
  • Set recording rule evaluation intervals based on the p99 evaluation duration plus a 50% buffer
  • Monitor prometheus_rule_group_last_duration_seconds and alert when it approaches the evaluation interval
  • Use topk() or bottomk() in recording rules to limit output series count
  • Avoid recording rules that depend on subqueries with range vectors longer than necessary
  • Regularly review recording rules and remove those no longer used by dashboards or alerts