The Problem
Prometheus logs show recording rule evaluation errors:
level=error ts=2026-04-04T04:20:15.789Z caller=manager.go:456 component="rule manager" msg="Error evaluating rule" rule="job:http_requests:rate5m" err="vector contains metrics with the same name but duplicate labels"
level=error ts=2026-04-04T04:20:15.790Z caller=manager.go:457 component="rule manager" err="unknown metric name \"http_requests_total_5m_rate\""Recording rules pre-compute frequently needed or computationally expensive expressions. Errors here cascade into broken dashboards and alerts.
Diagnosis
Check Rule Evaluation Status
```promql # Recording rule evaluation time prometheus_rule_evaluation_duration_seconds
# Failed rule evaluations prometheus_rule_evaluations_total - prometheus_rule_evaluations_total{result="success"}
# Rule group evaluation errors prometheus_rule_group_rules{state="error"} ```
Check Rule Groups
```bash # List all recording rules with status curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.type == "recording") | {name: .name, state: .state, lastError: .lastError}'
# Check for errors curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.lastError != null) | {name: .name, error: .lastError}' ```
View Prometheus Logs
# Check for rule evaluation errors
journalctl -u prometheus --since "1 hour ago" | grep -i "rule evaluation"Solutions
1. Fix Syntax Errors
Common syntax issues in recording rules:
```yaml # recording_rules.yml groups: - name: http_metrics interval: 30s rules: # WRONG: Using incorrect syntax # - record: job:http_requests:rate5m # expr: rate(http_requests_total[5m]) by (job)
# CORRECT: Proper aggregation syntax - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m])) ```
Validate configuration:
```bash # Check recording rules syntax promtool check rules recording_rules.yml
# Check full config promtool check config prometheus.yml ```
2. Fix Label Collisions
When recording rules produce labels that conflict:
```yaml groups: - name: service_metrics rules: # Problem: Creates duplicate labels # - record: http_requests:rate5m # expr: rate(http_requests_total[5m])
# Solution: Explicitly aggregate labels - record: service:http_requests:rate5m expr: sum without (instance, pod) (rate(http_requests_total[5m]))
# Or keep specific labels - record: job:http_requests:rate5m expr: sum by (job, method, status) (rate(http_requests_total[5m])) ```
3. Fix Non-Existent Metrics
Recording rules referencing metrics that don't exist yet:
```yaml groups: - name: derived_metrics rules: # Problem: Source metric may not exist at startup - record: app:request_rate expr: rate(app_requests_total[5m])
# Solution: Use absent() for default values - record: app:request_rate expr: | rate(app_requests_total[5m]) or vector(0)
# Or use a default with on() - record: app:availability expr: | (sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) or on() vector(1) ```
4. Handle Stale Data
Recording rules with missing data gaps:
```yaml groups: - name: availability rules: # Problem: Returns no data during gaps - record: app:availability_ratio expr: sum(rate(http_requests_total{status="200"}[5m])) / sum(rate(http_requests_total[5m]))
# Solution: Fill gaps with default - record: app:availability_ratio expr: | clamp_max( sum(rate(http_requests_total{status="200"}[5m])) / on() group_left sum(rate(http_requests_total[5m])), 1 ) or on() vector(0) ```
5. Fix Evaluation Order
Rules that depend on other recording rules must be in order:
```yaml groups: - name: derived_metrics # Order matters within a group - earlier rules are available to later ones rules: # First: Base metric - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))
# Second: Depends on first - record: job:http_requests:rate5m:increase_1h expr: increase(job:http_requests:rate5m[1h])
# Separate group for independent calculations - name: error_metrics rules: - record: job:http_errors:rate5m expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) ```
6. Optimize Performance
Slow recording rules impact Prometheus:
```yaml groups: - name: optimized_rules interval: 30s # Reduce evaluation frequency rules: # Slow: High cardinality query every 15s # - record: pod:http_requests:rate5m # expr: sum by (pod) (rate(http_requests_total[5m]))
# Better: Less frequent, and reduce labels - record: namespace:http_requests:rate5m expr: sum without (pod, container) (rate(http_requests_total[5m])) ```
Monitor rule performance:
```promql # Slow rule evaluations histogram_quantile(0.95, rate(prometheus_rule_evaluation_duration_seconds_bucket[5m])) > 1
# Rules taking > 1s prometheus_rule_evaluation_duration_seconds > 1 ```
Verification
Verify Recording Rules Work
# Check rule exists and has data
curl -s 'http://localhost:9090/api/v1/query?query=job:http_requests:rate5m' | jq '.data.result'```promql # Verify recording rule output {__name__="job:http_requests:rate5m"}
# Compare with source sum by (job) (rate(http_requests_total[5m])) == job:http_requests:rate5m ```
Check Rule Status
# All rules should show "ok"
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.type == "recording") | {name: .name, health: .health}'Prevention
Add monitoring for recording rules:
```yaml groups: - name: rule_health rules: - alert: RecordingRuleEvaluationSlow expr: histogram_quantile(0.95, rate(prometheus_rule_evaluation_duration_seconds_bucket[5m])) > 0.5 for: 5m labels: severity: warning annotations: summary: "Recording rules evaluating slowly" description: "95th percentile rule evaluation time is {{ $value }}s"
- alert: RecordingRuleMissing
- expr: absent({__name__="job:http_requests:rate5m"})
- for: 5m
- labels:
- severity: warning
- annotations:
- summary: "Recording rule job:http_requests:rate5m is missing"
- alert: RecordingRuleStale
- expr: time() - timestamp(job:http_requests:rate5m) > 600
- for: 5m
- labels:
- severity: warning
- annotations:
- summary: "Recording rule job:http_requests:rate5m is stale"
`
Best Practices
- 1.Naming Convention: Use
level:metric:operationsformat (e.g.,job:http_requests:rate5m) - 2.Keep Rules Simple: One calculation per rule
- 3.Document Dependencies: Comment rules that depend on others
- 4.Monitor Performance: Track evaluation duration
- 5.Test Changes: Use
promtool test rulesto validate