Introduction
Prometheus federation allows a central Prometheus server to scrape metrics from downstream Prometheus instances. When the upstream server is under heavy load, has a large number of time series, or experiences disk I/O bottlenecks, the federation scrape can exceed the configured timeout. This results in missing aggregated metrics at the central level and gaps in global dashboards.
Symptoms
- Central Prometheus shows federation targets as DOWN with timeout errors
- Federation scrape duration exceeds the configured
scrape_timeout prometheus_target_sync_length_secondsshows high values for federation jobs- Upstream Prometheus CPU and disk I/O spike during federation scrape windows
- Error message:
server returned HTTP status 503 Service Unavailable: query timed out
Common Causes
- Upstream Prometheus serving federation requests is overloaded with local scrape and query load
- Federation query matches too many time series, causing slow remote read on the upstream
- Network latency between central and upstream Prometheus servers
- Upstream Prometheus performing TSDB compaction during federation scrape window
scrape_timeoutset too low for the federation query complexity and upstream load
Step-by-Step Fix
- 1.Check federation scrape performance metrics: Identify the slow federation targets.
- 2.```bash
- 3.curl -s http://central-prom:9090/api/v1/targets | jq '.data.activeTargets[] | select(.scrapePool == "federation") | {url: .scrapeUrl, health: .health, lastError: .lastError}'
- 4.
` - 5.Increase scrape timeout for federation jobs: Give federation more time to complete.
- 6.```yaml
- 7.scrape_configs:
- 8.- job_name: 'federation'
- 9.scrape_interval: 1m
- 10.scrape_timeout: 45s
- 11.metrics_path: '/federate'
- 12.params:
- 13.'match[]':
- 14.- '{job=~"critical.*"}'
- 15.static_configs:
- 16.- targets: ['upstream-prom-1:9090', 'upstream-prom-2:9090']
- 17.
` - 18.Reduce federation query scope: Only federate necessary metrics to reduce upstream load.
- 19.```yaml
- 20.params:
- 21.'match[]':
- 22.- '{__name__=~"up|node_cpu.*|node_memory.*|http_requests.*"}'
- 23.
` - 24.Optimize upstream Prometheus for federation read load: Improve upstream query performance.
- 25.```bash
- 26.# Check upstream TSDB health
- 27.curl -s http://upstream-prom:9090/api/v1/status/tsdb | jq '.seriesCount'
- 28.# Consider increasing upstream resources
- 29.
` - 30.Verify federation scrape recovery: Confirm the federation targets are healthy.
- 31.```bash
- 32.curl -s http://central-prom:9090/api/v1/targets | jq '.data.activeTargets[] | select(.scrapePool == "federation") | .health'
- 33.
`
Prevention
- Use selective metric federation with explicit
match[]parameters rather than scraping all metrics - Set
scrape_timeoutto at least 2x the expected federation scrape duration based on historical data - Deploy dedicated read replicas for federation to avoid impacting primary Prometheus query performance
- Monitor federation scrape duration and alert when it approaches the timeout threshold
- Consider using Thanos or Cortex for horizontal scalability instead of native federation
- Stagger federation scrape times across upstream instances to avoid simultaneous load spikes