Fix Prometheus Alerting Rule Error

The Problem

Prometheus alerting rules are failing to evaluate or fire correctly. You see errors like:

bash

level=error ts=2026-04-04T03:15:45.234Z caller=manager.go:567 component="rule manager" msg="Error evaluating rule" rule="HighErrorRate" err="unexpected token \"}\" in template"
level=error ts=2026-04-04T03:15:45.235Z caller=manager.go:568 component="rule manager" err="template: :1: bad character U+002D '-'"
level=warn ts=2026-04-04T03:15:46.123Z caller="alertmanager.go:234" msg="notify retry cancelled" err="context deadline exceeded"

Alerting rule errors mean your critical alerts aren't firing, creating monitoring blind spots.

Diagnosis

Check Alert States

```promql # Current active alerts ALERTS{alertstate="firing"}

# Pending alerts (waiting for 'for' duration) ALERTS{alertstate="pending"}

# Alert evaluation status prometheus_rule_evaluations_total{rule_type="alerting"} ```

View Rule Errors

```bash # Check all alerting rules and their status curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.type == "alerting") | {alert: .name, state: .state, lastError: .lastError}'

# Check for failed evaluations curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.lastError != null)' ```

Check Prometheus Logs

```bash # Alert evaluation errors journalctl -u prometheus --since "1 hour ago" | grep -i "alert"

# Template errors journalctl -u prometheus --since "1 hour ago" | grep -i "template" ```

Solutions

1. Fix Expression Syntax

Alert expressions with syntax errors:

```yaml # alert_rules.yml groups: - name: application_alerts rules: # WRONG: Invalid expression syntax # - alert: HighErrorRate # expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1 # # Missing 'for' and improper regex

# CORRECT: Valid expression with proper syntax - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.1 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}" ```

Validate rules:

```bash # Check alert rules syntax promtool check rules alert_rules.yml

# Test expression in Prometheus UI first # Go to /graph and run the expr to verify ```

2. Fix Template Errors

Annotation/label templates with errors:

```yaml groups: - name: alert_templates rules: # WRONG: Template syntax errors # - alert: InstanceDown # expr: up == 0 # annotations: # summary: "Instance {{ .Labels.instance }} is down" # Wrong access # description: "Value: {{ .Value }}" # May not exist

# CORRECT: Proper template syntax - alert: InstanceDown expr: up == 0 for: 5m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} is down" description: | Instance {{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes. Current value: {{ $value }}

# Using template functions correctly - alert: HighMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanizePercentage }} ({{ $value | humanize1024 }}B)" ```

3. Fix Label Access Issues

Common label access mistakes:

```yaml groups: - name: label_access rules: # WRONG: Accessing labels incorrectly # - alert: PodCrashLooping # expr: rate(kube_pod_container_status_restarts_total[1h]) > 5 # annotations: # summary: "Pod {{ .labels.pod }} crash looping" # Wrong

# CORRECT: Use $labels variable - alert: PodCrashLooping expr: increase(kube_pod_container_status_restarts_total[1h]) > 5 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is crash looping" description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value }} times in the last hour"

# Accessing external labels - alert: GlobalAlert expr: up == 0 annotations: summary: "Instance {{ $labels.instance }} down in {{ $externalLabels.cluster }}" ```

4. Fix 'for' Duration Issues

Alerts with incorrect timing:

```yaml groups: - name: duration_alerts rules: # Problem: Too short 'for' duration causes alert flapping - alert: HighCPU expr: rate(process_cpu_seconds_total[1m]) > 0.8 for: 10s # Too short! labels: severity: critical

# Solution: Use appropriate 'for' duration - alert: HighCPU expr: rate(process_cpu_seconds_total[1m]) > 0.8 for: 5m # Wait 5 minutes before firing labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}"

# Add resolved notification - alert: HighCPU expr: rate(process_cpu_seconds_total[1m]) > 0.8 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value | humanize }}. Started at {{ .StartsAt }}" ```

5. Handle Missing Metrics

Alerts that fail when metrics are missing:

```yaml groups: - name: metric_missing rules: # Problem: Alert breaks if metric doesn't exist # - alert: NoTraffic # expr: rate(http_requests_total[5m]) == 0 # # Returns no data if http_requests_total doesn't exist

# Solution: Use absent() to handle missing metrics - alert: NoTraffic expr: | rate(http_requests_total[5m]) == 0 or (absent(http_requests_total) * 0) for: 10m labels: severity: info annotations: summary: "No HTTP traffic detected"

# Alert when metric is completely absent - alert: MetricMissing expr: absent(http_requests_total) for: 5m labels: severity: warning annotations: summary: "Metric http_requests_total is missing" description: "The metric http_requests_total has not been scraped for 5 minutes" ```

6. Fix Alertmanager Integration

Alerts firing but not notifying:

yaml

# prometheus.yml - Check alertmanager config
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093
      timeout: 10s
      api_version: v2

Verify connectivity:

```bash # Test connection to Alertmanager curl -s http://alertmanager:9093/api/v2/status | jq .

# Check if alerts are being sent curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state == "firing")' ```

Verification

Test Alert Manually

```promql # Run the alert expression directly sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.1

# Check if alert would fire ALERTS{alertname="HighErrorRate"}

# Verify alert status curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[]' ```

Test Template Rendering

```bash # Use promtool to test templates promtool test rules alert_rules.yml

# Check specific alert curl -s 'http://localhost:9090/api/v1/query?query=ALERTS%7Balertname%3D%22HighErrorRate%22%7D' | jq . ```

Prevention

Add alerting for alert system health:

```yaml groups: - name: alert_system_health rules: - alert: AlertingRulesFailed expr: increase(prometheus_rule_evaluation_failures_total[5m]) > 0 for: 5m labels: severity: critical annotations: summary: "Alerting rule evaluation failed" description: "{{ $value }} rule evaluations have failed in the last 5 minutes"

alert: AlertmanagerUnreachable
expr: prometheus_notifications_alertmanagers_discovered < 1
for: 5m
labels:
severity: critical
annotations:
summary: "No Alertmanager instances discovered"

alert: AlertNotificationFailed
expr: rate(prometheus_notifications_errors_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Alert notification failures detected"
`

Best Practices

1.Test expressions first: Run expr in Prometheus UI before adding to rules
2.Use consistent naming: AlertNameShouldBeDescriptive
3.Add 'for' duration: Prevent flapping with appropriate wait times
4.Include runbook links: Add runbook_url in labels
5.Set severity properly: critical, warning, info
6.Test templates: Verify annotations render correctly with real data

The Problem

Diagnosis

Check Alert States

View Rule Errors

Check Prometheus Logs

Solutions

1. Fix Expression Syntax

2. Fix Template Errors

3. Fix Label Access Issues

4. Fix 'for' Duration Issues

5. Handle Missing Metrics

6. Fix Alertmanager Integration

Verification

Test Alert Manually

Test Template Rendering

Prevention

Best Practices

Share this guide

More Monitoring Troubleshooting Guides

Metric Retention Expired

Timeseries Storage Full

Collector Agent Crashed

Webhook Notification Timeout

SMS Notification Failed

Email Notification Bounced