Introduction

Alertmanager routes alerts to notification receivers including webhooks for custom integrations. When the webhook endpoint is unreachable, returns errors, or times out, Alertmanager retries with exponential backoff. If all retries are exhausted, the alert notification is lost, leaving the operations team unaware of active incidents.

Symptoms

  • Alertmanager logs show webhook notification failed or context deadline exceeded
  • alertmanager_notifications_failed_total metric increases for the webhook receiver
  • Alerts fire but no notifications arrive at the downstream system
  • Webhook endpoint shows no incoming requests from Alertmanager
  • Error message: error="Post "http://webhook:8080/alerts": dial tcp: connection refused"

Common Causes

  • Webhook endpoint crashed or was redeployed without Alertmanager configuration update
  • Network policy or firewall blocking Alertmanager from reaching the webhook URL
  • Webhook returning non-2xx HTTP status codes, causing Alertmanager to mark delivery as failed
  • Webhook endpoint TLS certificate expired or self-signed without CA in Alertmanager truststore
  • Alertmanager retry queue full due to persistent webhook failures, dropping new notifications

Step-by-Step Fix

  1. 1.Check Alertmanager notification failure metrics: Identify the failing webhook.
  2. 2.```bash
  3. 3.curl -s http://alertmanager:9093/metrics | grep alertmanager_notifications_failed_total
  4. 4.`
  5. 5.Test webhook endpoint connectivity from Alertmanager: Verify the endpoint is reachable.
  6. 6.```bash
  7. 7.# From Alertmanager pod or server
  8. 8.curl -v -X POST http://webhook:8080/alerts \
  9. 9.-H "Content-Type: application/json" \
  10. 10.-d '{"version":"4","status":"firing","alerts":[]}'
  11. 11.`
  12. 12.Check webhook endpoint logs for errors: Identify if requests are arriving and failing.
  13. 13.```bash
  14. 14.kubectl logs -l app=webhook-receiver --tail=50
  15. 15.`
  16. 16.Update Alertmanager webhook configuration: Fix the webhook URL or add TLS configuration.
  17. 17.```yaml
  18. 18.# alertmanager.yml
  19. 19.receivers:
  20. 20.- name: 'custom-webhook'
  21. 21.webhook_configs:
  22. 22.- url: 'http://webhook-receiver:8080/alerts'
  23. 23.send_resolved: true
  24. 24.http_config:
  25. 25.tls_config:
  26. 26.ca_file: /etc/alertmanager/ca.pem
  27. 27.`
  28. 28.Reload Alertmanager configuration: Apply the updated configuration.
  29. 29.```bash
  30. 30.curl -X POST http://alertmanager:9093/-/reload
  31. 31.# Or send SIGHUP
  32. 32.kill -HUP $(pidof alertmanager)
  33. 33.`

Prevention

  • Implement webhook endpoint health checks that Alertmanager can query before sending notifications
  • Configure multiple notification receivers for critical alerts (webhook + email + PagerDuty)
  • Monitor alertmanager_notifications_failed_total and alert on sustained failure rates
  • Use Alertmanager's retry_max and retry_initial_backoff settings tuned for your webhook reliability
  • Deploy webhook receivers with high availability and auto-scaling to handle alert storms
  • Test end-to-end alert delivery regularly using synthetic alerts in staging environments