The Problem
Prometheus federation is failing to scrape metrics from subordinate Prometheus instances. You see errors like:
level=error ts=2026-04-04T00:00:18.234Z caller=scrape.go:1234 component="scrape manager" target=http://prometheus-dc1:9090/federate msg="Scrape failed" err="Get \"http://prometheus-dc1:9090/federate\": context deadline exceeded"
level=warn ts=2026-04-04T00:00:19.345Z caller=scrape.go:1235 msg="Error scraping federate endpoint" err="server returned HTTP status 400 Bad Request"
level=error ts=2026-04-04T00:00:20.456Z caller=scrape.go:1236 msg="Federation returned no data"Federation errors break global monitoring views and prevent aggregating metrics across datacenters or clusters.
Diagnosis
Check Federation Endpoint
```bash # Test federation endpoint directly curl -v "http://prometheus-dc1:9090/federate?match[]={__name__=~'.+'}"
# With specific matchers curl -v "http://prometheus-dc1:9090/federate?match[]={job='node-exporter'}&match[]={job='kubelet'}"
# Check response headers curl -I "http://prometheus-dc1:9090/federate?match[]={up}" ```
Check Federation Metrics
```promql # Federation scrape success up{job="federation-dc1"}
# Federation scrape duration scrape_duration_seconds{job="federation-dc1"}
# Samples received from federation scrape_samples_scraped{job="federation-dc1"}
# Federation errors rate(prometheus_target_scrapes_exceeded_sample_limit_total{job="federation-dc1"}[5m]) ```
Check Network Connectivity
```bash # Test basic connectivity curl -s http://prometheus-dc1:9090/-/healthy
# Check DNS resolution nslookup prometheus-dc1
# Test from within container kubectl exec -it prometheus-global -- curl -s http://prometheus-dc1:9090/metrics ```
Solutions
1. Fix Federation URL Configuration
Incorrect federation configuration:
```yaml # prometheus.yml - Global/Aggregating Prometheus scrape_configs: # WRONG: Missing match[] parameter # - job_name: 'federation-dc1' # honor_labels: true # static_configs: # - targets: ['prometheus-dc1:9090']
# CORRECT: Proper federation configuration - job_name: 'federation-dc1' scrape_interval: 30s scrape_timeout: 25s honor_labels: true # Keep original labels metrics_path: '/federate' params: 'match[]': - '{__name__=~"job:.*"}' # Recording rules - '{job="node-exporter"}' - '{job="kubelet"}' - 'up' static_configs: - targets: - 'prometheus-dc1:9090' labels: datacenter: 'dc1' ```
Apply and reload:
```bash # Reload configuration curl -X POST http://localhost:9090/-/reload
# Or restart systemctl restart prometheus ```
2. Fix Timeout Issues
Federation scraping takes too long:
scrape_configs:
- job_name: 'federation-dc1'
scrape_interval: 60s
scrape_timeout: 55s # Must be < scrape_interval
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"job:.*"}'
static_configs:
- targets: ['prometheus-dc1:9090']Reduce data volume:
scrape_configs:
- job_name: 'federation-dc1'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# Only federate pre-aggregated metrics
- '{__name__=~"job:.*"}'
- '{__name__=~"namespace:.*"}'
# Don't federate high-cardinality raw metrics
static_configs:
- targets: ['prometheus-dc1:9090']3. Fix Label Conflicts
Labels getting overwritten:
scrape_configs:
- job_name: 'federation-dc1'
honor_labels: true # CRITICAL: Keeps original labels
metrics_path: '/federate'
params:
'match[]':
- '{job="node-exporter"}'
static_configs:
- targets: ['prometheus-dc1:9090']
labels:
datacenter: 'dc1' # Add datacenter label
metric_relabel_configs:
# Ensure datacenter label is set
- target_label: datacenter
replacement: dc1Handle external labels on source Prometheus:
# On subordinate Prometheus (prometheus-dc1)
global:
external_labels:
datacenter: 'dc1'
cluster: 'production-dc1'4. Fix 400/500 Errors
Server returning errors:
```bash # Check for error details in response curl -v "http://prometheus-dc1:9090/federate?match[]={__name__='up'}"
# Common causes: # - Invalid match[] parameter # - Too many matchers # - Memory issues on source Prometheus ```
Fix match parameter:
scrape_configs:
- job_name: 'federation-dc1'
metrics_path: '/federate'
params:
'match[]':
# Use proper label matchers
- '{job="node-exporter"}'
- '{__name__=~"job:.*"}'
# NOT: 'up' (must be a vector selector)5. Handle Sample Limits
Too many samples from federation:
scrape_configs:
- job_name: 'federation-dc1'
honor_labels: true
metrics_path: '/federate'
sample_limit: 100000 # Increase limit
params:
'match[]':
- '{__name__=~"job:.*"}'
static_configs:
- targets: ['prometheus-dc1:9090']Reduce federated metrics:
```yaml # On subordinate Prometheus, create recording rules groups: - name: federation_exports rules: # Pre-aggregate for federation - record: federation:node_cpu:rate5m expr: sum by (datacenter, instance) (rate(node_cpu_seconds_total[5m]))
- record: federation:node_memory:usage
- expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
`
Then federate only aggregated metrics:
scrape_configs:
- job_name: 'federation-dc1'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"federation:.*"}'
static_configs:
- targets: ['prometheus-dc1:9090']6. Fix Authentication
Federation requiring authentication:
```yaml scrape_configs: - job_name: 'federation-dc1' honor_labels: true metrics_path: '/federate' params: 'match[]': - '{__name__=~"job:.*"}'
# Basic auth basic_auth: username: federation_user password: secure_password
# TLS config scheme: https tls_config: ca_file: /etc/prometheus/certs/ca.crt cert_file: /etc/prometheus/certs/client.crt key_file: /etc/prometheus/certs/client.key
static_configs: - targets: ['prometheus-dc1:9090'] ```
Verification
Test Federation Query
```bash # Direct test of federation curl -s "http://prometheus-dc1:9090/federate?match[]={job='node-exporter'}" | head -20
# Check global Prometheus receives data curl -s 'http://localhost:9090/api/v1/query?query=up{datacenter="dc1"}' | jq . ```
Verify Labels Preserved
```promql # Check datacenter label is present count by (datacenter) ({__name__=~".+"})
# Check original job labels count by (job) ({datacenter="dc1"}) ```
Check Metrics Flow
```promql # Federation scrape success up{job="federation-dc1"} == 1
# Samples scraped scrape_samples_scraped{job="federation-dc1"}
# No errors rate(scrape_samples_scraped{job="federation-dc1"}[5m]) > 0 ```
Prevention
Add monitoring for federation:
```yaml groups: - name: federation_alerts rules: - alert: FederationDown expr: up{job=~"federation-.*"} == 0 for: 5m labels: severity: critical annotations: summary: "Federation target {{ $labels.instance }} is down" description: "Cannot scrape federation endpoint for {{ $labels.job }}"
- alert: FederationSlow
- expr: scrape_duration_seconds{job=~"federation-.*"} > 30
- for: 5m
- labels:
- severity: warning
- annotations:
- summary: "Federation scrape is slow"
- description: "Scrape duration for {{ $labels.job }} is {{ $value }}s"
- alert: FederationSampleLimit
- expr: scrape_samples_scraped{job=~"federation-.*"} > 50000
- for: 5m
- labels:
- severity: warning
- annotations:
- summary: "Federation returning many samples"
- description: "{{ $labels.job }} returned {{ $value }} samples"
- alert: FederationDataStale
- expr: time() - timestamp(up{datacenter=~".+"}) > 300
- for: 5m
- labels:
- severity: warning
- annotations:
- summary: "Federated data is stale"
- description: "No data from {{ $labels.datacenter }} for {{ $value }}s"
`
Federation Architecture Best Practices
- 1.Federate pre-aggregated data: Use recording rules on subordinate Prometheus instances
- 2.Use honor_labels: Always set
honor_labels: true - 3.Add external labels: Set
datacenter,cluster, orregionlabels - 4.Limit matchers: Only federate needed metrics
- 5.Increase timeouts: Federation often needs longer timeouts
- 6.Monitor federation: Alert on scrape failures and stale data