Introduction

Prometheus uses Kubernetes service discovery to automatically find and scrape targets based on pod, service, endpoint, and node resources. When Prometheus makes too many API requests -- due to frequent resource changes, many discovery configurations, or a small cluster API rate limit -- the Kubernetes API server throttles the requests. This causes Prometheus to have stale target lists and miss newly created or terminated pods.

Symptoms

  • Prometheus logs show kubernetes: failed to list with 429 Too Many Requests
  • Newly created pods are not discovered as scrape targets
  • Terminated pods remain in the target list, showing as DOWN
  • Kubernetes API server logs show rate limit exceeded for the Prometheus service account
  • prometheus_sd_kubernetes_http_request_total shows increasing rate of 429 responses

Common Causes

  • Large number of pods and services causing frequent API watch re-lists
  • Multiple Prometheus instances each running their own service discovery
  • Short --discovery.reloader interval causing repeated API calls
  • Kubernetes API server rate limits (APF - API Priority and Fairness) too restrictive
  • Watch connections dropping frequently due to network issues, forcing expensive re-lists

Step-by-Step Fix

  1. 1.Confirm API rate limiting from Prometheus logs: Verify 429 errors.
  2. 2.```bash
  3. 3.kubectl logs -n monitoring prometheus-prometheus-0 | grep "429|rate limit"
  4. 4.`
  5. 5.Check Kubernetes API server rate limit metrics: Verify the throttling is API server-side.
  6. 6.```bash
  7. 7.kubectl get --raw /metrics | grep apiserver_request_total | grep "429"
  8. 8.`
  9. 9.Reduce service discovery refresh frequency: Limit the rate of API calls.
  10. 10.```yaml
  11. 11.# Prometheus Operator configuration
  12. 12.prometheus:
  13. 13.prometheusSpec:
  14. 14.scrapeInterval: 30s
  15. 15.evaluationInterval: 30s
  16. 16.`
  17. 17.Grant RBAC permissions for efficient watch-based discovery: Ensure Prometheus uses watches, not polls.
  18. 18.```yaml
  19. 19.apiVersion: rbac.authorization.k8s.io/v1
  20. 20.kind: ClusterRole
  21. 21.metadata:
  22. 22.name: prometheus
  23. 23.rules:
  24. 24.- apiGroups: [""]
  25. 25.resources: ["pods", "endpoints", "services", "nodes"]
  26. 26.verbs: ["get", "list", "watch"]
  27. 27.`
  28. 28.Use a single shared service discovery instance: Deploy a kube-state-metrics based approach.
  29. 29.```bash
  30. 30.# Deploy kube-state-metrics to reduce per-Prometheus SD load
  31. 31.helm install kube-state-metrics prometheus-community/kube-state-metrics -n monitoring
  32. 32.`

Prevention

  • Use Kubernetes APF (API Priority and Fairness) to allocate sufficient priority level for Prometheus
  • Deploy a single Prometheus instance for service discovery and share targets via federation
  • Monitor prometheus_sd_kubernetes_http_request_total for 429 response rates
  • Ensure Prometheus has proper RBAC with watch permissions (not just list and get)
  • Consider using --enable-feature=extra-scrape-metrics for detailed SD metrics
  • Keep the number of distinct kubernetes_sd_configs jobs to a minimum by using role selectors