Fix Prometheus Kubernetes Service Discovery API Rate Limiting

Introduction

Prometheus uses Kubernetes service discovery to automatically find and scrape targets based on pod, service, endpoint, and node resources. When Prometheus makes too many API requests -- due to frequent resource changes, many discovery configurations, or a small cluster API rate limit -- the Kubernetes API server throttles the requests. This causes Prometheus to have stale target lists and miss newly created or terminated pods.

Symptoms

Prometheus logs show kubernetes: failed to list with 429 Too Many Requests
Newly created pods are not discovered as scrape targets
Terminated pods remain in the target list, showing as DOWN
Kubernetes API server logs show rate limit exceeded for the Prometheus service account
prometheus_sd_kubernetes_http_request_total shows increasing rate of 429 responses

Common Causes

Large number of pods and services causing frequent API watch re-lists
Multiple Prometheus instances each running their own service discovery
Short --discovery.reloader interval causing repeated API calls
Kubernetes API server rate limits (APF - API Priority and Fairness) too restrictive
Watch connections dropping frequently due to network issues, forcing expensive re-lists

Step-by-Step Fix

1.Confirm API rate limiting from Prometheus logs: Verify 429 errors.
2.```bash
3.kubectl logs -n monitoring prometheus-prometheus-0 | grep "429|rate limit"
4.`
5.Check Kubernetes API server rate limit metrics: Verify the throttling is API server-side.
6.```bash
7.kubectl get --raw /metrics | grep apiserver_request_total | grep "429"
8.`
9.Reduce service discovery refresh frequency: Limit the rate of API calls.
10.```yaml
11.# Prometheus Operator configuration
12.prometheus:
13.prometheusSpec:
14.scrapeInterval: 30s
15.evaluationInterval: 30s
16.`
17.Grant RBAC permissions for efficient watch-based discovery: Ensure Prometheus uses watches, not polls.
18.```yaml
19.apiVersion: rbac.authorization.k8s.io/v1
20.kind: ClusterRole
21.metadata:
22.name: prometheus
23.rules:
24.- apiGroups: [""]
25.resources: ["pods", "endpoints", "services", "nodes"]
26.verbs: ["get", "list", "watch"]
27.`
28.Use a single shared service discovery instance: Deploy a kube-state-metrics based approach.
29.```bash
30.# Deploy kube-state-metrics to reduce per-Prometheus SD load
31.helm install kube-state-metrics prometheus-community/kube-state-metrics -n monitoring
32.`

Prevention

Use Kubernetes APF (API Priority and Fairness) to allocate sufficient priority level for Prometheus
Deploy a single Prometheus instance for service discovery and share targets via federation
Monitor prometheus_sd_kubernetes_http_request_total for 429 response rates
Ensure Prometheus has proper RBAC with watch permissions (not just list and get)
Consider using --enable-feature=extra-scrape-metrics for detailed SD metrics
Keep the number of distinct kubernetes_sd_configs jobs to a minimum by using role selectors

Prometheus Service Discovery Kubernetes API Rate Limited

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Share this guide

More Prometheus Troubleshooting Guides

Prometheus Retention Period Config Ignored Disk Still Filling

Prometheus WAL Corruption After Unclean Shutdown Requiring Repair

Prometheus Cardinality Explosion From Unbounded Label Values

Prometheus Relabel Config Dropping All Metrics Accidentally

Prometheus Federation Upstream Timeout on Slow Remote Read

Prometheus Alertmanager Notification Webhook Delivery Failed