Introduction

Prometheus federation allows a central Prometheus server to scrape metrics from downstream Prometheus instances. When the upstream server is under heavy load, has a large number of time series, or experiences disk I/O bottlenecks, the federation scrape can exceed the configured timeout. This results in missing aggregated metrics at the central level and gaps in global dashboards.

Symptoms

  • Central Prometheus shows federation targets as DOWN with timeout errors
  • Federation scrape duration exceeds the configured scrape_timeout
  • prometheus_target_sync_length_seconds shows high values for federation jobs
  • Upstream Prometheus CPU and disk I/O spike during federation scrape windows
  • Error message: server returned HTTP status 503 Service Unavailable: query timed out

Common Causes

  • Upstream Prometheus serving federation requests is overloaded with local scrape and query load
  • Federation query matches too many time series, causing slow remote read on the upstream
  • Network latency between central and upstream Prometheus servers
  • Upstream Prometheus performing TSDB compaction during federation scrape window
  • scrape_timeout set too low for the federation query complexity and upstream load

Step-by-Step Fix

  1. 1.Check federation scrape performance metrics: Identify the slow federation targets.
  2. 2.```bash
  3. 3.curl -s http://central-prom:9090/api/v1/targets | jq '.data.activeTargets[] | select(.scrapePool == "federation") | {url: .scrapeUrl, health: .health, lastError: .lastError}'
  4. 4.`
  5. 5.Increase scrape timeout for federation jobs: Give federation more time to complete.
  6. 6.```yaml
  7. 7.scrape_configs:
  8. 8.- job_name: 'federation'
  9. 9.scrape_interval: 1m
  10. 10.scrape_timeout: 45s
  11. 11.metrics_path: '/federate'
  12. 12.params:
  13. 13.'match[]':
  14. 14.- '{job=~"critical.*"}'
  15. 15.static_configs:
  16. 16.- targets: ['upstream-prom-1:9090', 'upstream-prom-2:9090']
  17. 17.`
  18. 18.Reduce federation query scope: Only federate necessary metrics to reduce upstream load.
  19. 19.```yaml
  20. 20.params:
  21. 21.'match[]':
  22. 22.- '{__name__=~"up|node_cpu.*|node_memory.*|http_requests.*"}'
  23. 23.`
  24. 24.Optimize upstream Prometheus for federation read load: Improve upstream query performance.
  25. 25.```bash
  26. 26.# Check upstream TSDB health
  27. 27.curl -s http://upstream-prom:9090/api/v1/status/tsdb | jq '.seriesCount'
  28. 28.# Consider increasing upstream resources
  29. 29.`
  30. 30.Verify federation scrape recovery: Confirm the federation targets are healthy.
  31. 31.```bash
  32. 32.curl -s http://central-prom:9090/api/v1/targets | jq '.data.activeTargets[] | select(.scrapePool == "federation") | .health'
  33. 33.`

Prevention

  • Use selective metric federation with explicit match[] parameters rather than scraping all metrics
  • Set scrape_timeout to at least 2x the expected federation scrape duration based on historical data
  • Deploy dedicated read replicas for federation to avoid impacting primary Prometheus query performance
  • Monitor federation scrape duration and alert when it approaches the timeout threshold
  • Consider using Thanos or Cortex for horizontal scalability instead of native federation
  • Stagger federation scrape times across upstream instances to avoid simultaneous load spikes