The Problem

Prometheus alerting rules are failing to evaluate or fire correctly. You see errors like:

bash
level=error ts=2026-04-04T03:15:45.234Z caller=manager.go:567 component="rule manager" msg="Error evaluating rule" rule="HighErrorRate" err="unexpected token \"}\" in template"
level=error ts=2026-04-04T03:15:45.235Z caller=manager.go:568 component="rule manager" err="template: :1: bad character U+002D '-'"
level=warn ts=2026-04-04T03:15:46.123Z caller="alertmanager.go:234" msg="notify retry cancelled" err="context deadline exceeded"

Alerting rule errors mean your critical alerts aren't firing, creating monitoring blind spots.

Diagnosis

Check Alert States

```promql # Current active alerts ALERTS{alertstate="firing"}

# Pending alerts (waiting for 'for' duration) ALERTS{alertstate="pending"}

# Alert evaluation status prometheus_rule_evaluations_total{rule_type="alerting"} ```

View Rule Errors

```bash # Check all alerting rules and their status curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.type == "alerting") | {alert: .name, state: .state, lastError: .lastError}'

# Check for failed evaluations curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.lastError != null)' ```

Check Prometheus Logs

```bash # Alert evaluation errors journalctl -u prometheus --since "1 hour ago" | grep -i "alert"

# Template errors journalctl -u prometheus --since "1 hour ago" | grep -i "template" ```

Solutions

1. Fix Expression Syntax

Alert expressions with syntax errors:

```yaml # alert_rules.yml groups: - name: application_alerts rules: # WRONG: Invalid expression syntax # - alert: HighErrorRate # expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1 # # Missing 'for' and improper regex

# CORRECT: Valid expression with proper syntax - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.1 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}" ```

Validate rules:

```bash # Check alert rules syntax promtool check rules alert_rules.yml

# Test expression in Prometheus UI first # Go to /graph and run the expr to verify ```

2. Fix Template Errors

Annotation/label templates with errors:

```yaml groups: - name: alert_templates rules: # WRONG: Template syntax errors # - alert: InstanceDown # expr: up == 0 # annotations: # summary: "Instance {{ .Labels.instance }} is down" # Wrong access # description: "Value: {{ .Value }}" # May not exist

# CORRECT: Proper template syntax - alert: InstanceDown expr: up == 0 for: 5m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} is down" description: | Instance {{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes. Current value: {{ $value }}

# Using template functions correctly - alert: HighMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanizePercentage }} ({{ $value | humanize1024 }}B)" ```

3. Fix Label Access Issues

Common label access mistakes:

```yaml groups: - name: label_access rules: # WRONG: Accessing labels incorrectly # - alert: PodCrashLooping # expr: rate(kube_pod_container_status_restarts_total[1h]) > 5 # annotations: # summary: "Pod {{ .labels.pod }} crash looping" # Wrong

# CORRECT: Use $labels variable - alert: PodCrashLooping expr: increase(kube_pod_container_status_restarts_total[1h]) > 5 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is crash looping" description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value }} times in the last hour"

# Accessing external labels - alert: GlobalAlert expr: up == 0 annotations: summary: "Instance {{ $labels.instance }} down in {{ $externalLabels.cluster }}" ```

4. Fix 'for' Duration Issues

Alerts with incorrect timing:

```yaml groups: - name: duration_alerts rules: # Problem: Too short 'for' duration causes alert flapping - alert: HighCPU expr: rate(process_cpu_seconds_total[1m]) > 0.8 for: 10s # Too short! labels: severity: critical

# Solution: Use appropriate 'for' duration - alert: HighCPU expr: rate(process_cpu_seconds_total[1m]) > 0.8 for: 5m # Wait 5 minutes before firing labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}"

# Add resolved notification - alert: HighCPU expr: rate(process_cpu_seconds_total[1m]) > 0.8 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value | humanize }}. Started at {{ .StartsAt }}" ```

5. Handle Missing Metrics

Alerts that fail when metrics are missing:

```yaml groups: - name: metric_missing rules: # Problem: Alert breaks if metric doesn't exist # - alert: NoTraffic # expr: rate(http_requests_total[5m]) == 0 # # Returns no data if http_requests_total doesn't exist

# Solution: Use absent() to handle missing metrics - alert: NoTraffic expr: | rate(http_requests_total[5m]) == 0 or (absent(http_requests_total) * 0) for: 10m labels: severity: info annotations: summary: "No HTTP traffic detected"

# Alert when metric is completely absent - alert: MetricMissing expr: absent(http_requests_total) for: 5m labels: severity: warning annotations: summary: "Metric http_requests_total is missing" description: "The metric http_requests_total has not been scraped for 5 minutes" ```

6. Fix Alertmanager Integration

Alerts firing but not notifying:

yaml
# prometheus.yml - Check alertmanager config
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093
      timeout: 10s
      api_version: v2

Verify connectivity:

```bash # Test connection to Alertmanager curl -s http://alertmanager:9093/api/v2/status | jq .

# Check if alerts are being sent curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state == "firing")' ```

Verification

Test Alert Manually

```promql # Run the alert expression directly sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.1

# Check if alert would fire ALERTS{alertname="HighErrorRate"}

# Verify alert status curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[]' ```

Test Template Rendering

```bash # Use promtool to test templates promtool test rules alert_rules.yml

# Check specific alert curl -s 'http://localhost:9090/api/v1/query?query=ALERTS%7Balertname%3D%22HighErrorRate%22%7D' | jq . ```

Prevention

Add alerting for alert system health:

```yaml groups: - name: alert_system_health rules: - alert: AlertingRulesFailed expr: increase(prometheus_rule_evaluation_failures_total[5m]) > 0 for: 5m labels: severity: critical annotations: summary: "Alerting rule evaluation failed" description: "{{ $value }} rule evaluations have failed in the last 5 minutes"

  • alert: AlertmanagerUnreachable
  • expr: prometheus_notifications_alertmanagers_discovered < 1
  • for: 5m
  • labels:
  • severity: critical
  • annotations:
  • summary: "No Alertmanager instances discovered"
  • alert: AlertNotificationFailed
  • expr: rate(prometheus_notifications_errors_total[5m]) > 0
  • for: 5m
  • labels:
  • severity: warning
  • annotations:
  • summary: "Alert notification failures detected"
  • `

Best Practices

  1. 1.Test expressions first: Run expr in Prometheus UI before adding to rules
  2. 2.Use consistent naming: AlertNameShouldBeDescriptive
  3. 3.Add 'for' duration: Prevent flapping with appropriate wait times
  4. 4.Include runbook links: Add runbook_url in labels
  5. 5.Set severity properly: critical, warning, info
  6. 6.Test templates: Verify annotations render correctly with real data