Fix Prometheus Recording Rule Evaluation Timeout

Introduction

Prometheus recording rules precompute expensive queries and store the results as new time series. When a recording rule takes longer to evaluate than the configured evaluation interval or timeout, Prometheus logs a timeout error and the recorded metric is not updated. This causes dashboards and alerts depending on the recorded metric to show stale data.

Symptoms

Prometheus logs show recording rule evaluation took too long or context deadline exceeded
Recording rule group duration exceeds the evaluation interval in /api/v1/rules
Dashboards using recorded metrics show flatlined or stale values
prometheus_rule_evaluation_duration_seconds metric exceeds expected thresholds
Error message: rule group took 45.2s to evaluate, rule evaluation interval is 30s

Common Causes

Recording rule query scans too many time series due to broad label matchers
High cardinality in the input metrics causing the aggregation to process millions of series
Recording rule evaluation interval set too short for the query complexity
Storage backend slow due to disk I/O contention or compaction activity
Rule depends on subqueries with long range vectors that require scanning historical data

Step-by-Step Fix

1.Identify the slow recording rules: Check rule evaluation metrics.
2.```bash
3.curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.evaluationDuration > "30s") | {name: .name, duration: .evaluationDuration}'
4.`
5.Optimize the recording rule query: Reduce the query scope and cardinality.
6.```yaml
7.# BEFORE: Broad query scanning all series
8.# record: job:http_requests_total:rate5m
9.# expr: rate(http_requests_total[5m])

# AFTER: Filter by specific label set record: job:http_requests_total:rate5m expr: sum by (job) (rate(http_requests_total{status=~"2.."}[5m])) ```

1.Increase the evaluation interval for complex rule groups: Give the rule more time.
2.```yaml
3.groups:
4.- name: complex-recordings
5.interval: 60s
6.rules:
7.- record: job:complex_metric:avg
8.expr: avg by (job) (complex_metric)
9.`
10.Reduce cardinality of input metrics: Add metric relabeling to drop unnecessary label combinations.
11.```yaml
12.metric_relabel_configs:
13.- source_labels: [__name__]
14.regex: "http_requests_total"
15.action: keep
16.- regex: "instance|pod"
17.action: labeldrop
18.`
19.Monitor rule evaluation performance: Set up alerting for slow evaluations.
20.```yaml
21.- alert: RecordingRuleSlow
22.expr: prometheus_rule_group_last_duration_seconds > 20
23.for: 5m
24.labels:
25.severity: warning
26.annotations:
27.summary: "Recording rule group {{ $labels.name }} is slow"
28.`

Prevention

Keep recording rule queries simple with explicit label selectors and sum by clauses
Set recording rule evaluation intervals based on the p99 evaluation duration plus a 50% buffer
Monitor prometheus_rule_group_last_duration_seconds and alert when it approaches the evaluation interval
Use topk() or bottomk() in recording rules to limit output series count
Avoid recording rules that depend on subqueries with range vectors longer than necessary
Regularly review recording rules and remove those no longer used by dashboards or alerts

Prometheus Recording Rule Evaluation Taking Too Long Timeout

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Share this guide

More Prometheus Troubleshooting Guides

Prometheus Retention Period Config Ignored Disk Still Filling

Prometheus Service Discovery Kubernetes API Rate Limited

Prometheus WAL Corruption After Unclean Shutdown Requiring Repair

Prometheus Cardinality Explosion From Unbounded Label Values

Prometheus Relabel Config Dropping All Metrics Accidentally

Prometheus Federation Upstream Timeout on Slow Remote Read