The Problem

Prometheus logs show recording rule evaluation errors:

bash
level=error ts=2026-04-04T04:20:15.789Z caller=manager.go:456 component="rule manager" msg="Error evaluating rule" rule="job:http_requests:rate5m" err="vector contains metrics with the same name but duplicate labels"
level=error ts=2026-04-04T04:20:15.790Z caller=manager.go:457 component="rule manager" err="unknown metric name \"http_requests_total_5m_rate\""

Recording rules pre-compute frequently needed or computationally expensive expressions. Errors here cascade into broken dashboards and alerts.

Diagnosis

Check Rule Evaluation Status

```promql # Recording rule evaluation time prometheus_rule_evaluation_duration_seconds

# Failed rule evaluations prometheus_rule_evaluations_total - prometheus_rule_evaluations_total{result="success"}

# Rule group evaluation errors prometheus_rule_group_rules{state="error"} ```

Check Rule Groups

```bash # List all recording rules with status curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.type == "recording") | {name: .name, state: .state, lastError: .lastError}'

# Check for errors curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.lastError != null) | {name: .name, error: .lastError}' ```

View Prometheus Logs

bash
# Check for rule evaluation errors
journalctl -u prometheus --since "1 hour ago" | grep -i "rule evaluation"

Solutions

1. Fix Syntax Errors

Common syntax issues in recording rules:

```yaml # recording_rules.yml groups: - name: http_metrics interval: 30s rules: # WRONG: Using incorrect syntax # - record: job:http_requests:rate5m # expr: rate(http_requests_total[5m]) by (job)

# CORRECT: Proper aggregation syntax - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m])) ```

Validate configuration:

```bash # Check recording rules syntax promtool check rules recording_rules.yml

# Check full config promtool check config prometheus.yml ```

2. Fix Label Collisions

When recording rules produce labels that conflict:

```yaml groups: - name: service_metrics rules: # Problem: Creates duplicate labels # - record: http_requests:rate5m # expr: rate(http_requests_total[5m])

# Solution: Explicitly aggregate labels - record: service:http_requests:rate5m expr: sum without (instance, pod) (rate(http_requests_total[5m]))

# Or keep specific labels - record: job:http_requests:rate5m expr: sum by (job, method, status) (rate(http_requests_total[5m])) ```

3. Fix Non-Existent Metrics

Recording rules referencing metrics that don't exist yet:

```yaml groups: - name: derived_metrics rules: # Problem: Source metric may not exist at startup - record: app:request_rate expr: rate(app_requests_total[5m])

# Solution: Use absent() for default values - record: app:request_rate expr: | rate(app_requests_total[5m]) or vector(0)

# Or use a default with on() - record: app:availability expr: | (sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) or on() vector(1) ```

4. Handle Stale Data

Recording rules with missing data gaps:

```yaml groups: - name: availability rules: # Problem: Returns no data during gaps - record: app:availability_ratio expr: sum(rate(http_requests_total{status="200"}[5m])) / sum(rate(http_requests_total[5m]))

# Solution: Fill gaps with default - record: app:availability_ratio expr: | clamp_max( sum(rate(http_requests_total{status="200"}[5m])) / on() group_left sum(rate(http_requests_total[5m])), 1 ) or on() vector(0) ```

5. Fix Evaluation Order

Rules that depend on other recording rules must be in order:

```yaml groups: - name: derived_metrics # Order matters within a group - earlier rules are available to later ones rules: # First: Base metric - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))

# Second: Depends on first - record: job:http_requests:rate5m:increase_1h expr: increase(job:http_requests:rate5m[1h])

# Separate group for independent calculations - name: error_metrics rules: - record: job:http_errors:rate5m expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) ```

6. Optimize Performance

Slow recording rules impact Prometheus:

```yaml groups: - name: optimized_rules interval: 30s # Reduce evaluation frequency rules: # Slow: High cardinality query every 15s # - record: pod:http_requests:rate5m # expr: sum by (pod) (rate(http_requests_total[5m]))

# Better: Less frequent, and reduce labels - record: namespace:http_requests:rate5m expr: sum without (pod, container) (rate(http_requests_total[5m])) ```

Monitor rule performance:

```promql # Slow rule evaluations histogram_quantile(0.95, rate(prometheus_rule_evaluation_duration_seconds_bucket[5m])) > 1

# Rules taking > 1s prometheus_rule_evaluation_duration_seconds > 1 ```

Verification

Verify Recording Rules Work

bash
# Check rule exists and has data
curl -s 'http://localhost:9090/api/v1/query?query=job:http_requests:rate5m' | jq '.data.result'

```promql # Verify recording rule output {__name__="job:http_requests:rate5m"}

# Compare with source sum by (job) (rate(http_requests_total[5m])) == job:http_requests:rate5m ```

Check Rule Status

bash
# All rules should show "ok"
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.type == "recording") | {name: .name, health: .health}'

Prevention

Add monitoring for recording rules:

```yaml groups: - name: rule_health rules: - alert: RecordingRuleEvaluationSlow expr: histogram_quantile(0.95, rate(prometheus_rule_evaluation_duration_seconds_bucket[5m])) > 0.5 for: 5m labels: severity: warning annotations: summary: "Recording rules evaluating slowly" description: "95th percentile rule evaluation time is {{ $value }}s"

  • alert: RecordingRuleMissing
  • expr: absent({__name__="job:http_requests:rate5m"})
  • for: 5m
  • labels:
  • severity: warning
  • annotations:
  • summary: "Recording rule job:http_requests:rate5m is missing"
  • alert: RecordingRuleStale
  • expr: time() - timestamp(job:http_requests:rate5m) > 600
  • for: 5m
  • labels:
  • severity: warning
  • annotations:
  • summary: "Recording rule job:http_requests:rate5m is stale"
  • `

Best Practices

  1. 1.Naming Convention: Use level:metric:operations format (e.g., job:http_requests:rate5m)
  2. 2.Keep Rules Simple: One calculation per rule
  3. 3.Document Dependencies: Comment rules that depend on others
  4. 4.Monitor Performance: Track evaluation duration
  5. 5.Test Changes: Use promtool test rules to validate