Introduction

Grafana sends alert notifications to Slack via incoming webhooks. During an alert storm -- when many alerts fire simultaneously -- Grafana can exceed Slack's rate limit of 1 message per second per webhook URL. Rate-limited notifications are dropped, meaning critical alerts never reach the team, exactly when they are most needed.

Symptoms

  • Slack channel stops receiving Grafana alert notifications during incident
  • Grafana logs show Failed to send webhook to Slack: 429 Too Many Requests
  • alerting_notification_sent_total metric shows fewer sent than fired alerts
  • Slack returns rate_limited in the webhook response body
  • Alert notifications resume after the storm subsides but the gap期间的 alerts are lost

Common Causes

  • Alert storm from a widespread infrastructure issue firing hundreds of alerts simultaneously
  • Slack webhook rate limit of 1 message per second per channel exceeded
  • No notification deduplication or grouping configured in Grafana alert rules
  • Multiple alert rules targeting the same Slack channel without coordination
  • Grafana alert evaluation interval too short, generating notifications faster than Slack can process

Step-by-Step Fix

  1. 1.Check Grafana alert notification logs for rate limiting: Confirm the failure cause.
  2. 2.```bash
  3. 3.journalctl -u grafana-server | grep -i "slack|429|rate.limit" | tail -20
  4. 4.`
  5. 5.Configure alert notification grouping: Group related alerts into a single notification.
  6. 6.```yaml
  7. 7.# Alertmanager routing config (if using external Alertmanager)
  8. 8.route:
  9. 9.receiver: slack
  10. 10.group_by: ['alertname', 'severity', 'namespace']
  11. 11.group_wait: 30s
  12. 12.group_interval: 5m
  13. 13.repeat_interval: 4h
  14. 14.`
  15. 15.Use Slack App rate-limit handling with retry: Configure Grafana to retry rate-limited notifications.
  16. 16.```ini
  17. 17.# grafana.ini
  18. 18.[unified_alerting]
  19. 19.evaluation_timeout = 30s
  20. 20.# Increase notification timeout for retry
  21. 21.[smtp]
  22. 22.timeout = 30
  23. 23.`
  24. 24.Implement alert deduplication at the rule level: Reduce notification volume.
  25. 25.`
  26. 26.# Use Grafana's built-in deduplication:
  27. 27.# - Set alert rule evaluation interval to 1m minimum
  28. 28.# - Use "Pending for" period to filter transient alerts
  29. 29.# - Consolidate multiple similar alert rules into one
  30. 30.`
  31. 31.Add a secondary notification channel for critical alerts: Ensure alerts are not lost.
  32. 32.`
  33. 33.# Configure PagerDuty or email as a backup channel
  34. 34.# Route critical alerts to both Slack and PagerDuty
  35. 35.`

Prevention

  • Configure alert grouping with appropriate group_wait and group_interval to batch notifications
  • Set Slack webhook rate limit awareness in alert routing design -- max 60 notifications per minute per channel
  • Use separate Slack channels for different alert severity levels to distribute load
  • Implement alert storm detection that temporarily aggregates alerts during high-volume periods
  • Monitor notification delivery success rate and alert when the failure rate exceeds 5%
  • Use Grafana's notification policies to route alerts through multiple channels for critical severities