Introduction
Prometheus recording rules precompute expensive queries and store the results as new time series. When a recording rule takes longer to evaluate than the configured evaluation interval or timeout, Prometheus logs a timeout error and the recorded metric is not updated. This causes dashboards and alerts depending on the recorded metric to show stale data.
Symptoms
- Prometheus logs show
recording rule evaluation took too longorcontext deadline exceeded - Recording rule group duration exceeds the evaluation interval in
/api/v1/rules - Dashboards using recorded metrics show flatlined or stale values
prometheus_rule_evaluation_duration_secondsmetric exceeds expected thresholds- Error message:
rule group took 45.2s to evaluate, rule evaluation interval is 30s
Common Causes
- Recording rule query scans too many time series due to broad label matchers
- High cardinality in the input metrics causing the aggregation to process millions of series
- Recording rule evaluation interval set too short for the query complexity
- Storage backend slow due to disk I/O contention or compaction activity
- Rule depends on subqueries with long range vectors that require scanning historical data
Step-by-Step Fix
- 1.Identify the slow recording rules: Check rule evaluation metrics.
- 2.```bash
- 3.curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.evaluationDuration > "30s") | {name: .name, duration: .evaluationDuration}'
- 4.
` - 5.Optimize the recording rule query: Reduce the query scope and cardinality.
- 6.```yaml
- 7.# BEFORE: Broad query scanning all series
- 8.# record: job:http_requests_total:rate5m
- 9.# expr: rate(http_requests_total[5m])
# AFTER: Filter by specific label set record: job:http_requests_total:rate5m expr: sum by (job) (rate(http_requests_total{status=~"2.."}[5m])) ```
- 1.Increase the evaluation interval for complex rule groups: Give the rule more time.
- 2.```yaml
- 3.groups:
- 4.- name: complex-recordings
- 5.interval: 60s
- 6.rules:
- 7.- record: job:complex_metric:avg
- 8.expr: avg by (job) (complex_metric)
- 9.
` - 10.Reduce cardinality of input metrics: Add metric relabeling to drop unnecessary label combinations.
- 11.```yaml
- 12.metric_relabel_configs:
- 13.- source_labels: [__name__]
- 14.regex: "http_requests_total"
- 15.action: keep
- 16.- regex: "instance|pod"
- 17.action: labeldrop
- 18.
` - 19.Monitor rule evaluation performance: Set up alerting for slow evaluations.
- 20.```yaml
- 21.- alert: RecordingRuleSlow
- 22.expr: prometheus_rule_group_last_duration_seconds > 20
- 23.for: 5m
- 24.labels:
- 25.severity: warning
- 26.annotations:
- 27.summary: "Recording rule group {{ $labels.name }} is slow"
- 28.
`
Prevention
- Keep recording rule queries simple with explicit label selectors and
sum byclauses - Set recording rule evaluation intervals based on the p99 evaluation duration plus a 50% buffer
- Monitor
prometheus_rule_group_last_duration_secondsand alert when it approaches the evaluation interval - Use
topk()orbottomk()in recording rules to limit output series count - Avoid recording rules that depend on subqueries with range vectors longer than necessary
- Regularly review recording rules and remove those no longer used by dashboards or alerts