Alerts are firing in Prometheus, but you're not receiving notifications. This critical gap means your incident response is compromised. Let's systematically diagnose and fix Alertmanager notification failures.

Understanding the Problem

Alertmanager notification failures can occur at several points:

  • Alert routing and grouping
  • Receiver configuration
  • Network connectivity to notification services
  • Authentication with external services
  • Template rendering issues

Common error patterns:

bash
notify retry for *slack.Notifier: unexpected status code 404
bash
notify retry for *email.Email: dial tcp: lookup smtp.gmail.com: no such host
bash
notify retry for *pagerduty.PagerDuty: unexpected status code 401

Initial Diagnosis

Start by checking Alertmanager's status and logs:

```bash # Check Alertmanager UI # Navigate to http://alertmanager:9093

# Check Alertmanager status via API curl -s http://localhost:9093/api/v2/status | jq '.'

# View active alerts curl -s http://localhost:9093/api/v2/alerts | jq '.[] | {labels: .labels, status: .status}'

# Check Alertmanager logs kubectl logs -l app=alertmanager -n monitoring | grep -i "notify|error|failed"

# Or for systemd journalctl -u alertmanager -f | grep -i "notify|error" ```

Common Cause 1: Slack Notification Failures

Slack is one of the most common notification channels, and failures usually stem from webhook URL issues or permission problems.

Error pattern: `` notify retry for *slack.Notifier: unexpected status code 404

bash
notify retry for *slack.Notifier: invalid_auth

Diagnosis:

```bash # Test Slack webhook directly curl -X POST -H 'Content-type: application/json' \ --data '{"text":"Test alert from Alertmanager"}' \ https://hooks.slack.com/services/YOUR/WEBHOOK/URL

# Check Alertmanager configuration curl -s http://localhost:9093/api/v2/status | jq '.config.original'

# Look for Slack-specific errors in logs kubectl logs -l app=alertmanager -n monitoring | grep -i slack ```

Solution:

Verify and update Slack configuration:

```yaml # alertmanager.yml route: receiver: 'slack-notifications' routes: - match: severity: critical receiver: 'slack-critical'

receivers: - name: 'slack-notifications' slack_configs: - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX' channel: '#alerts' send_resolved: true title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}' text: >- {{ range .Alerts }} *Alert:* {{ .Labels.alertname }} *Severity:* {{ .Labels.severity }} *Description:* {{ .Annotations.description }} *Details:* {{ range .Labels.SortedPairs }} • *{{ .Name }}:* {{ .Value }} {{ end }} {{ end }} ```

If using Slack App tokens:

yaml
receivers:
  - name: 'slack-app'
    slack_configs:
      - api_url: 'https://slack.com/api/chat.postMessage'
        api_url_file: '/etc/alertmanager/slack-token'
        channel: '#alerts'
        http_config:
          authorization:
            type: Bearer
            credentials_file: '/etc/alertmanager/slack-token'

Test the configuration:

```bash # Validate Alertmanager config amtool check-config alertmanager.yml

# Reload configuration curl -X POST http://localhost:9093/-/reload ```

Common Cause 2: Email Notification Failures

Email delivery issues are common due to SMTP authentication and network problems.

Error pattern: `` notify retry for *email.Email: dial tcp: lookup smtp.gmail.com: no such host

bash
notify retry for *email.Email: 535 5.7.8 Username and Password not accepted

Diagnosis:

```bash # Test SMTP connectivity telnet smtp.gmail.com 587 # Then type: EHLO localhost # STARTTLS # etc.

# Or use openssl openssl s_client -connect smtp.gmail.com:587 -starttls smtp

# Check DNS resolution nslookup smtp.gmail.com dig smtp.gmail.com

# Check Alertmanager logs for SMTP errors grep -i "smtp|email|dial" /var/log/alertmanager/alertmanager.log ```

Solution:

Update email configuration with correct SMTP settings:

```yaml # alertmanager.yml global: smtp_smarthost: 'smtp.gmail.com:587' smtp_from: 'alertmanager@yourdomain.com' smtp_auth_username: 'your-email@gmail.com' smtp_auth_password: 'your-app-password' smtp_require_tls: true

receivers: - name: 'email-notifications' email_configs: - to: 'team@yourdomain.com' send_resolved: true html: '{{ template "email.html" . }}' ```

For services requiring app passwords:

```bash # Gmail requires app-specific passwords # Generate at: https://myaccount.google.com/apppasswords

# Store securely in Kubernetes secret kubectl create secret generic alertmanager-smtp \ --from-literal=password='your-app-password' \ -n monitoring ```

Mount the secret and use it:

```yaml # In alertmanager.yml global: smtp_auth_password_file: '/etc/alertmanager/smtp-password'

# In Kubernetes deployment volumeMounts: - name: smtp-secret mountPath: /etc/alertmanager/smtp-password subPath: password ```

Common Cause 3: PagerDuty Integration Issues

PagerDuty integration failures usually involve API key issues or routing problems.

Error pattern: `` notify retry for *pagerduty.PagerDuty: unexpected status code 401

bash
notify retry for *pagerduty.PagerDuty: Invalid Routing Key

Diagnosis:

```bash # Test PagerDuty API directly curl -X POST https://events.pagerduty.com/v2/enqueue \ -H 'Content-Type: application/json' \ -H 'Authorization: Token token=your-integration-key' \ -d '{ "routing_key": "your-routing-key", "event_action": "trigger", "dedup_key": "test-alert", "payload": { "summary": "Test alert from Alertmanager", "severity": "critical", "source": "Alertmanager" } }'

# Check PagerDuty configuration amtool config show | grep -A 20 pagerduty ```

Solution:

Correct PagerDuty configuration:

yaml
# alertmanager.yml
receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-integration-key'
        url: 'https://events.pagerduty.com/v2/enqueue'
        severity: critical
        class: 'deployment'
        group: 'production'
        component: 'application'
        details:
          firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
          resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
          num_firing: '{{ .Alerts.Firing | len }}'
          num_resolved: '{{ .Alerts.Resolved | len }}'

For Events API V2:

yaml
receivers:
  - name: 'pagerduty-v2'
    pagerduty_configs:
      - routing_key: 'your-routing-key'
        severity: '{{ .Status }}'
        class: '{{ .CommonLabels.alertname }}'
        component: '{{ .CommonLabels.component }}'
        group: '{{ .CommonLabels.job }}'

Common Cause 4: Webhook Delivery Failures

Custom webhooks can fail due to network issues, authentication problems, or payload format issues.

Error pattern: `` notify retry for *webhook.Notifier: Post "https://webhook.example.com/": dial tcp: i/o timeout

bash
notify retry for *webhook.Notifier: unexpected status code 500

Diagnosis:

```bash # Test webhook endpoint directly curl -X POST https://webhook.example.com/alerts \ -H 'Content-Type: application/json' \ -d '{"test": true}'

# Check webhook configuration curl -s http://localhost:9093/api/v2/status | jq '.config.original' | grep -A 20 webhook

# View notification history curl -s http://localhost:9093/api/v2/alerts/groups | jq '.[].receiver' ```

Solution:

Configure webhook correctly:

yaml
# alertmanager.yml
receivers:
  - name: 'webhook'
    webhook_configs:
      - url: 'https://webhook.example.com/alerts'
        send_resolved: true
        http_config:
          basic_auth:
            username: 'alertmanager'
            password: 'webhook-password'
          tls_config:
            insecure_skip_verify: false
        max_alerts: 100

For custom payloads:

yaml
receivers:
  - name: 'custom-webhook'
    webhook_configs:
      - url: 'https://api.example.com/incidents'
        send_resolved: true
        http_config:
          authorization:
            type: Bearer
            credentials: 'your-api-token'

Common Cause 5: Routing and Grouping Issues

Sometimes alerts fire but don't reach the intended receiver due to routing misconfiguration.

Error pattern: Alerts appear in Alertmanager UI but aren't delivered to any receiver.

Diagnosis:

```bash # Check current route configuration amtool config show

# Test route matching amtool config routes test --config.file=alertmanager.yml alertname=HighCPU severity=critical

# Check alert status in Alertmanager amtool alert query

# View silence rules amtool silence query ```

Solution:

Verify and fix routing:

```yaml # alertmanager.yml route: receiver: 'default' group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: # Critical alerts go to PagerDuty immediately - match: severity: critical receiver: 'pagerduty-critical' group_wait: 10s repeat_interval: 1h

# Warning alerts go to Slack - match: severity: warning receiver: 'slack-warnings' group_wait: 5m repeat_interval: 12h

# Database alerts need special handling - match_re: service: ^(mysql|postgres|redis)$ receiver: 'database-team' routes: - match: severity: critical receiver: 'database-pagerduty'

receivers: - name: 'default' slack_configs: - channel: '#alerts' api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL' ```

Common Cause 6: Template Rendering Errors

Invalid notification templates can cause failures.

Error pattern: `` template: email:1: unexpected EOF

bash
template: slack:2: undefined variable ".Alerts"

Diagnosis:

```bash # Test template rendering amtool template render --config.file=alertmanager.yml \ --template.globs="*.tmpl" \ --template.input='{{ template "slack.title" . }}'

# Check for template syntax errors cat /etc/alertmanager/templates/*.tmpl | go fmt ```

Solution:

Fix template syntax:

```yaml # In alertmanager.yml templates: - '/etc/alertmanager/templates/*.tmpl'

# templates/slack.tmpl {{ define "slack.title" }} {{ if eq .Status "firing" }} :fire: {{ .Alerts.Firing | len }} alerts firing {{ else if eq .Status "resolved" }} :checkered_flag: {{ .Alerts.Resolved | len }} alerts resolved {{ end }} {{ end }}

{{ define "slack.text" }} {{ range .Alerts }} *Alert:* {{ .Labels.alertname }} *Severity:* {{ .Labels.severity }} *Instance:* {{ .Labels.instance }} *Description:* {{ .Annotations.description }} *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }} {{ if eq .Status "resolved" }} *Resolved:* {{ .EndsAt.Format "2006-01-02 15:04:05" }} {{ end }} {{ end }} {{ end }} ```

Common Cause 7: Alertmanager Silences

Active silences can block notifications unexpectedly.

Diagnosis:

```bash # List all silences amtool silence query

# Check silences via API curl -s http://localhost:9093/api/v2/silences | jq '.[] | {id: .id, matchers: .matchers, createdBy: .createdBy, comment: .comment}'

# Check if specific alert is silenced amtool silence query alertname=HighCPU ```

Solution:

```bash # Remove unwanted silences amtool silence expire <silence-id>

# Or via API curl -X DELETE http://localhost:9093/api/v2/silence/<silence-id> ```

Verification

After fixing, verify notifications are working:

```bash # Send a test alert amtool alert add alertname=TestAlert severity=warning \ --annotation=summary="Test notification" \ --generator-url="http://localhost:9090/graph"

# Check alert was received curl -s http://localhost:9093/api/v2/alerts | jq '.[] | select(.labels.alertname=="TestAlert")'

# Check notification was sent (look at logs) kubectl logs -l app=alertmanager -n monitoring --tail=100 | grep -i "TestAlert|notify"

# Verify receiver configuration amtool config routes test alertname=TestAlert severity=critical ```

Prevention

Monitor Alertmanager health:

```yaml # Prometheus alerting rules for Alertmanager groups: - name: alertmanager_health rules: - alert: AlertmanagerConfigInconsistent expr: count_values("config_hash", alertmanager_config_hash) BY (cluster) != 1 for: 5m labels: severity: critical annotations: summary: "Alertmanager configurations are inconsistent"

  • alert: AlertmanagerNotificationFailed
  • expr: rate(alertmanager_notifications_failed_total[5m]) > 0
  • for: 5m
  • labels:
  • severity: warning
  • annotations:
  • summary: "Alertmanager notification failures detected"
  • alert: AlertmanagerSilenced
  • expr: ALERTS{alertstate="firing", alertname="SilencedAlert"} > 0
  • for: 1h
  • labels:
  • severity: info
  • annotations:
  • summary: "Alerts are being silenced"
  • `

Notification failures are usually configuration or connectivity issues. Start by testing the notification channel directly, then verify Alertmanager's configuration and logs.