Introduction
Prometheus uses Kubernetes service discovery to automatically find and scrape targets based on pod, service, endpoint, and node resources. When Prometheus makes too many API requests -- due to frequent resource changes, many discovery configurations, or a small cluster API rate limit -- the Kubernetes API server throttles the requests. This causes Prometheus to have stale target lists and miss newly created or terminated pods.
Symptoms
- Prometheus logs show
kubernetes: failed to listwith429 Too Many Requests - Newly created pods are not discovered as scrape targets
- Terminated pods remain in the target list, showing as DOWN
- Kubernetes API server logs show
rate limit exceededfor the Prometheus service account prometheus_sd_kubernetes_http_request_totalshows increasing rate of 429 responses
Common Causes
- Large number of pods and services causing frequent API watch re-lists
- Multiple Prometheus instances each running their own service discovery
- Short
--discovery.reloaderinterval causing repeated API calls - Kubernetes API server rate limits (APF - API Priority and Fairness) too restrictive
- Watch connections dropping frequently due to network issues, forcing expensive re-lists
Step-by-Step Fix
- 1.Confirm API rate limiting from Prometheus logs: Verify 429 errors.
- 2.```bash
- 3.kubectl logs -n monitoring prometheus-prometheus-0 | grep "429|rate limit"
- 4.
` - 5.Check Kubernetes API server rate limit metrics: Verify the throttling is API server-side.
- 6.```bash
- 7.kubectl get --raw /metrics | grep apiserver_request_total | grep "429"
- 8.
` - 9.Reduce service discovery refresh frequency: Limit the rate of API calls.
- 10.```yaml
- 11.# Prometheus Operator configuration
- 12.prometheus:
- 13.prometheusSpec:
- 14.scrapeInterval: 30s
- 15.evaluationInterval: 30s
- 16.
` - 17.Grant RBAC permissions for efficient watch-based discovery: Ensure Prometheus uses watches, not polls.
- 18.```yaml
- 19.apiVersion: rbac.authorization.k8s.io/v1
- 20.kind: ClusterRole
- 21.metadata:
- 22.name: prometheus
- 23.rules:
- 24.- apiGroups: [""]
- 25.resources: ["pods", "endpoints", "services", "nodes"]
- 26.verbs: ["get", "list", "watch"]
- 27.
` - 28.Use a single shared service discovery instance: Deploy a kube-state-metrics based approach.
- 29.```bash
- 30.# Deploy kube-state-metrics to reduce per-Prometheus SD load
- 31.helm install kube-state-metrics prometheus-community/kube-state-metrics -n monitoring
- 32.
`
Prevention
- Use Kubernetes APF (API Priority and Fairness) to allocate sufficient priority level for Prometheus
- Deploy a single Prometheus instance for service discovery and share targets via federation
- Monitor
prometheus_sd_kubernetes_http_request_totalfor 429 response rates - Ensure Prometheus has proper RBAC with
watchpermissions (not justlistandget) - Consider using
--enable-feature=extra-scrape-metricsfor detailed SD metrics - Keep the number of distinct
kubernetes_sd_configsjobs to a minimum by usingroleselectors