Fix Prometheus Federation Upstream Timeout on Slow Remote Read

Introduction

Prometheus federation allows a central Prometheus server to scrape metrics from downstream Prometheus instances. When the upstream server is under heavy load, has a large number of time series, or experiences disk I/O bottlenecks, the federation scrape can exceed the configured timeout. This results in missing aggregated metrics at the central level and gaps in global dashboards.

Symptoms

Central Prometheus shows federation targets as DOWN with timeout errors
Federation scrape duration exceeds the configured scrape_timeout
prometheus_target_sync_length_seconds shows high values for federation jobs
Upstream Prometheus CPU and disk I/O spike during federation scrape windows
Error message: server returned HTTP status 503 Service Unavailable: query timed out

Common Causes

Upstream Prometheus serving federation requests is overloaded with local scrape and query load
Federation query matches too many time series, causing slow remote read on the upstream
Network latency between central and upstream Prometheus servers
Upstream Prometheus performing TSDB compaction during federation scrape window
scrape_timeout set too low for the federation query complexity and upstream load

Step-by-Step Fix

1.Check federation scrape performance metrics: Identify the slow federation targets.
2.```bash
3.curl -s http://central-prom:9090/api/v1/targets | jq '.data.activeTargets[] | select(.scrapePool == "federation") | {url: .scrapeUrl, health: .health, lastError: .lastError}'
4.`
5.Increase scrape timeout for federation jobs: Give federation more time to complete.
6.```yaml
7.scrape_configs:
8.- job_name: 'federation'
9.scrape_interval: 1m
10.scrape_timeout: 45s
11.metrics_path: '/federate'
12.params:
13.'match[]':
14.- '{job=~"critical.*"}'
15.static_configs:
16.- targets: ['upstream-prom-1:9090', 'upstream-prom-2:9090']
17.`
18.Reduce federation query scope: Only federate necessary metrics to reduce upstream load.
19.```yaml
20.params:
21.'match[]':
22.- '{__name__=~"up|node_cpu.*|node_memory.*|http_requests.*"}'
23.`
24.Optimize upstream Prometheus for federation read load: Improve upstream query performance.
25.```bash
26.# Check upstream TSDB health
27.curl -s http://upstream-prom:9090/api/v1/status/tsdb | jq '.seriesCount'
28.# Consider increasing upstream resources
29.`
30.Verify federation scrape recovery: Confirm the federation targets are healthy.
31.```bash
32.curl -s http://central-prom:9090/api/v1/targets | jq '.data.activeTargets[] | select(.scrapePool == "federation") | .health'
33.`

Prevention

Use selective metric federation with explicit match[] parameters rather than scraping all metrics
Set scrape_timeout to at least 2x the expected federation scrape duration based on historical data
Deploy dedicated read replicas for federation to avoid impacting primary Prometheus query performance
Monitor federation scrape duration and alert when it approaches the timeout threshold
Consider using Thanos or Cortex for horizontal scalability instead of native federation
Stagger federation scrape times across upstream instances to avoid simultaneous load spikes

Prometheus Federation Upstream Timeout on Slow Remote Read

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Share this guide

More Prometheus Troubleshooting Guides

Prometheus Retention Period Config Ignored Disk Still Filling

Prometheus Service Discovery Kubernetes API Rate Limited

Prometheus WAL Corruption After Unclean Shutdown Requiring Repair

Prometheus Cardinality Explosion From Unbounded Label Values

Prometheus Relabel Config Dropping All Metrics Accidentally

Prometheus Alertmanager Notification Webhook Delivery Failed