Alerts are firing in Prometheus, but you're not receiving notifications. This critical gap means your incident response is compromised. Let's systematically diagnose and fix Alertmanager notification failures.
Understanding the Problem
Alertmanager notification failures can occur at several points:
- Alert routing and grouping
- Receiver configuration
- Network connectivity to notification services
- Authentication with external services
- Template rendering issues
Common error patterns:
notify retry for *slack.Notifier: unexpected status code 404notify retry for *email.Email: dial tcp: lookup smtp.gmail.com: no such hostnotify retry for *pagerduty.PagerDuty: unexpected status code 401Initial Diagnosis
Start by checking Alertmanager's status and logs:
```bash # Check Alertmanager UI # Navigate to http://alertmanager:9093
# Check Alertmanager status via API curl -s http://localhost:9093/api/v2/status | jq '.'
# View active alerts curl -s http://localhost:9093/api/v2/alerts | jq '.[] | {labels: .labels, status: .status}'
# Check Alertmanager logs kubectl logs -l app=alertmanager -n monitoring | grep -i "notify|error|failed"
# Or for systemd journalctl -u alertmanager -f | grep -i "notify|error" ```
Common Cause 1: Slack Notification Failures
Slack is one of the most common notification channels, and failures usually stem from webhook URL issues or permission problems.
Error pattern:
``
notify retry for *slack.Notifier: unexpected status code 404
notify retry for *slack.Notifier: invalid_authDiagnosis:
```bash # Test Slack webhook directly curl -X POST -H 'Content-type: application/json' \ --data '{"text":"Test alert from Alertmanager"}' \ https://hooks.slack.com/services/YOUR/WEBHOOK/URL
# Check Alertmanager configuration curl -s http://localhost:9093/api/v2/status | jq '.config.original'
# Look for Slack-specific errors in logs kubectl logs -l app=alertmanager -n monitoring | grep -i slack ```
Solution:
Verify and update Slack configuration:
```yaml # alertmanager.yml route: receiver: 'slack-notifications' routes: - match: severity: critical receiver: 'slack-critical'
receivers: - name: 'slack-notifications' slack_configs: - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX' channel: '#alerts' send_resolved: true title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}' text: >- {{ range .Alerts }} *Alert:* {{ .Labels.alertname }} *Severity:* {{ .Labels.severity }} *Description:* {{ .Annotations.description }} *Details:* {{ range .Labels.SortedPairs }} • *{{ .Name }}:* {{ .Value }} {{ end }} {{ end }} ```
If using Slack App tokens:
receivers:
- name: 'slack-app'
slack_configs:
- api_url: 'https://slack.com/api/chat.postMessage'
api_url_file: '/etc/alertmanager/slack-token'
channel: '#alerts'
http_config:
authorization:
type: Bearer
credentials_file: '/etc/alertmanager/slack-token'Test the configuration:
```bash # Validate Alertmanager config amtool check-config alertmanager.yml
# Reload configuration curl -X POST http://localhost:9093/-/reload ```
Common Cause 2: Email Notification Failures
Email delivery issues are common due to SMTP authentication and network problems.
Error pattern:
``
notify retry for *email.Email: dial tcp: lookup smtp.gmail.com: no such host
notify retry for *email.Email: 535 5.7.8 Username and Password not acceptedDiagnosis:
```bash # Test SMTP connectivity telnet smtp.gmail.com 587 # Then type: EHLO localhost # STARTTLS # etc.
# Or use openssl openssl s_client -connect smtp.gmail.com:587 -starttls smtp
# Check DNS resolution nslookup smtp.gmail.com dig smtp.gmail.com
# Check Alertmanager logs for SMTP errors grep -i "smtp|email|dial" /var/log/alertmanager/alertmanager.log ```
Solution:
Update email configuration with correct SMTP settings:
```yaml # alertmanager.yml global: smtp_smarthost: 'smtp.gmail.com:587' smtp_from: 'alertmanager@yourdomain.com' smtp_auth_username: 'your-email@gmail.com' smtp_auth_password: 'your-app-password' smtp_require_tls: true
receivers: - name: 'email-notifications' email_configs: - to: 'team@yourdomain.com' send_resolved: true html: '{{ template "email.html" . }}' ```
For services requiring app passwords:
```bash # Gmail requires app-specific passwords # Generate at: https://myaccount.google.com/apppasswords
# Store securely in Kubernetes secret kubectl create secret generic alertmanager-smtp \ --from-literal=password='your-app-password' \ -n monitoring ```
Mount the secret and use it:
```yaml # In alertmanager.yml global: smtp_auth_password_file: '/etc/alertmanager/smtp-password'
# In Kubernetes deployment volumeMounts: - name: smtp-secret mountPath: /etc/alertmanager/smtp-password subPath: password ```
Common Cause 3: PagerDuty Integration Issues
PagerDuty integration failures usually involve API key issues or routing problems.
Error pattern:
``
notify retry for *pagerduty.PagerDuty: unexpected status code 401
notify retry for *pagerduty.PagerDuty: Invalid Routing KeyDiagnosis:
```bash # Test PagerDuty API directly curl -X POST https://events.pagerduty.com/v2/enqueue \ -H 'Content-Type: application/json' \ -H 'Authorization: Token token=your-integration-key' \ -d '{ "routing_key": "your-routing-key", "event_action": "trigger", "dedup_key": "test-alert", "payload": { "summary": "Test alert from Alertmanager", "severity": "critical", "source": "Alertmanager" } }'
# Check PagerDuty configuration amtool config show | grep -A 20 pagerduty ```
Solution:
Correct PagerDuty configuration:
# alertmanager.yml
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'your-integration-key'
url: 'https://events.pagerduty.com/v2/enqueue'
severity: critical
class: 'deployment'
group: 'production'
component: 'application'
details:
firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
num_firing: '{{ .Alerts.Firing | len }}'
num_resolved: '{{ .Alerts.Resolved | len }}'For Events API V2:
receivers:
- name: 'pagerduty-v2'
pagerduty_configs:
- routing_key: 'your-routing-key'
severity: '{{ .Status }}'
class: '{{ .CommonLabels.alertname }}'
component: '{{ .CommonLabels.component }}'
group: '{{ .CommonLabels.job }}'Common Cause 4: Webhook Delivery Failures
Custom webhooks can fail due to network issues, authentication problems, or payload format issues.
Error pattern:
``
notify retry for *webhook.Notifier: Post "https://webhook.example.com/": dial tcp: i/o timeout
notify retry for *webhook.Notifier: unexpected status code 500Diagnosis:
```bash # Test webhook endpoint directly curl -X POST https://webhook.example.com/alerts \ -H 'Content-Type: application/json' \ -d '{"test": true}'
# Check webhook configuration curl -s http://localhost:9093/api/v2/status | jq '.config.original' | grep -A 20 webhook
# View notification history curl -s http://localhost:9093/api/v2/alerts/groups | jq '.[].receiver' ```
Solution:
Configure webhook correctly:
# alertmanager.yml
receivers:
- name: 'webhook'
webhook_configs:
- url: 'https://webhook.example.com/alerts'
send_resolved: true
http_config:
basic_auth:
username: 'alertmanager'
password: 'webhook-password'
tls_config:
insecure_skip_verify: false
max_alerts: 100For custom payloads:
receivers:
- name: 'custom-webhook'
webhook_configs:
- url: 'https://api.example.com/incidents'
send_resolved: true
http_config:
authorization:
type: Bearer
credentials: 'your-api-token'Common Cause 5: Routing and Grouping Issues
Sometimes alerts fire but don't reach the intended receiver due to routing misconfiguration.
Error pattern: Alerts appear in Alertmanager UI but aren't delivered to any receiver.
Diagnosis:
```bash # Check current route configuration amtool config show
# Test route matching amtool config routes test --config.file=alertmanager.yml alertname=HighCPU severity=critical
# Check alert status in Alertmanager amtool alert query
# View silence rules amtool silence query ```
Solution:
Verify and fix routing:
```yaml # alertmanager.yml route: receiver: 'default' group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: # Critical alerts go to PagerDuty immediately - match: severity: critical receiver: 'pagerduty-critical' group_wait: 10s repeat_interval: 1h
# Warning alerts go to Slack - match: severity: warning receiver: 'slack-warnings' group_wait: 5m repeat_interval: 12h
# Database alerts need special handling - match_re: service: ^(mysql|postgres|redis)$ receiver: 'database-team' routes: - match: severity: critical receiver: 'database-pagerduty'
receivers: - name: 'default' slack_configs: - channel: '#alerts' api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL' ```
Common Cause 6: Template Rendering Errors
Invalid notification templates can cause failures.
Error pattern:
``
template: email:1: unexpected EOF
template: slack:2: undefined variable ".Alerts"Diagnosis:
```bash # Test template rendering amtool template render --config.file=alertmanager.yml \ --template.globs="*.tmpl" \ --template.input='{{ template "slack.title" . }}'
# Check for template syntax errors cat /etc/alertmanager/templates/*.tmpl | go fmt ```
Solution:
Fix template syntax:
```yaml # In alertmanager.yml templates: - '/etc/alertmanager/templates/*.tmpl'
# templates/slack.tmpl {{ define "slack.title" }} {{ if eq .Status "firing" }} :fire: {{ .Alerts.Firing | len }} alerts firing {{ else if eq .Status "resolved" }} :checkered_flag: {{ .Alerts.Resolved | len }} alerts resolved {{ end }} {{ end }}
{{ define "slack.text" }} {{ range .Alerts }} *Alert:* {{ .Labels.alertname }} *Severity:* {{ .Labels.severity }} *Instance:* {{ .Labels.instance }} *Description:* {{ .Annotations.description }} *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }} {{ if eq .Status "resolved" }} *Resolved:* {{ .EndsAt.Format "2006-01-02 15:04:05" }} {{ end }} {{ end }} {{ end }} ```
Common Cause 7: Alertmanager Silences
Active silences can block notifications unexpectedly.
Diagnosis:
```bash # List all silences amtool silence query
# Check silences via API curl -s http://localhost:9093/api/v2/silences | jq '.[] | {id: .id, matchers: .matchers, createdBy: .createdBy, comment: .comment}'
# Check if specific alert is silenced amtool silence query alertname=HighCPU ```
Solution:
```bash # Remove unwanted silences amtool silence expire <silence-id>
# Or via API curl -X DELETE http://localhost:9093/api/v2/silence/<silence-id> ```
Verification
After fixing, verify notifications are working:
```bash # Send a test alert amtool alert add alertname=TestAlert severity=warning \ --annotation=summary="Test notification" \ --generator-url="http://localhost:9090/graph"
# Check alert was received curl -s http://localhost:9093/api/v2/alerts | jq '.[] | select(.labels.alertname=="TestAlert")'
# Check notification was sent (look at logs) kubectl logs -l app=alertmanager -n monitoring --tail=100 | grep -i "TestAlert|notify"
# Verify receiver configuration amtool config routes test alertname=TestAlert severity=critical ```
Prevention
Monitor Alertmanager health:
```yaml # Prometheus alerting rules for Alertmanager groups: - name: alertmanager_health rules: - alert: AlertmanagerConfigInconsistent expr: count_values("config_hash", alertmanager_config_hash) BY (cluster) != 1 for: 5m labels: severity: critical annotations: summary: "Alertmanager configurations are inconsistent"
- alert: AlertmanagerNotificationFailed
- expr: rate(alertmanager_notifications_failed_total[5m]) > 0
- for: 5m
- labels:
- severity: warning
- annotations:
- summary: "Alertmanager notification failures detected"
- alert: AlertmanagerSilenced
- expr: ALERTS{alertstate="firing", alertname="SilencedAlert"} > 0
- for: 1h
- labels:
- severity: info
- annotations:
- summary: "Alerts are being silenced"
`
Notification failures are usually configuration or connectivity issues. Start by testing the notification channel directly, then verify Alertmanager's configuration and logs.