The Problem

Prometheus federation is failing to scrape metrics from subordinate Prometheus instances. You see errors like:

bash
level=error ts=2026-04-04T00:00:18.234Z caller=scrape.go:1234 component="scrape manager" target=http://prometheus-dc1:9090/federate msg="Scrape failed" err="Get \"http://prometheus-dc1:9090/federate\": context deadline exceeded"
level=warn ts=2026-04-04T00:00:19.345Z caller=scrape.go:1235 msg="Error scraping federate endpoint" err="server returned HTTP status 400 Bad Request"
level=error ts=2026-04-04T00:00:20.456Z caller=scrape.go:1236 msg="Federation returned no data"

Federation errors break global monitoring views and prevent aggregating metrics across datacenters or clusters.

Diagnosis

Check Federation Endpoint

```bash # Test federation endpoint directly curl -v "http://prometheus-dc1:9090/federate?match[]={__name__=~'.+'}"

# With specific matchers curl -v "http://prometheus-dc1:9090/federate?match[]={job='node-exporter'}&match[]={job='kubelet'}"

# Check response headers curl -I "http://prometheus-dc1:9090/federate?match[]={up}" ```

Check Federation Metrics

```promql # Federation scrape success up{job="federation-dc1"}

# Federation scrape duration scrape_duration_seconds{job="federation-dc1"}

# Samples received from federation scrape_samples_scraped{job="federation-dc1"}

# Federation errors rate(prometheus_target_scrapes_exceeded_sample_limit_total{job="federation-dc1"}[5m]) ```

Check Network Connectivity

```bash # Test basic connectivity curl -s http://prometheus-dc1:9090/-/healthy

# Check DNS resolution nslookup prometheus-dc1

# Test from within container kubectl exec -it prometheus-global -- curl -s http://prometheus-dc1:9090/metrics ```

Solutions

1. Fix Federation URL Configuration

Incorrect federation configuration:

```yaml # prometheus.yml - Global/Aggregating Prometheus scrape_configs: # WRONG: Missing match[] parameter # - job_name: 'federation-dc1' # honor_labels: true # static_configs: # - targets: ['prometheus-dc1:9090']

# CORRECT: Proper federation configuration - job_name: 'federation-dc1' scrape_interval: 30s scrape_timeout: 25s honor_labels: true # Keep original labels metrics_path: '/federate' params: 'match[]': - '{__name__=~"job:.*"}' # Recording rules - '{job="node-exporter"}' - '{job="kubelet"}' - 'up' static_configs: - targets: - 'prometheus-dc1:9090' labels: datacenter: 'dc1' ```

Apply and reload:

```bash # Reload configuration curl -X POST http://localhost:9090/-/reload

# Or restart systemctl restart prometheus ```

2. Fix Timeout Issues

Federation scraping takes too long:

yaml
scrape_configs:
  - job_name: 'federation-dc1'
    scrape_interval: 60s
    scrape_timeout: 55s  # Must be < scrape_interval
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets: ['prometheus-dc1:9090']

Reduce data volume:

yaml
scrape_configs:
  - job_name: 'federation-dc1'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        # Only federate pre-aggregated metrics
        - '{__name__=~"job:.*"}'
        - '{__name__=~"namespace:.*"}'
        # Don't federate high-cardinality raw metrics
    static_configs:
      - targets: ['prometheus-dc1:9090']

3. Fix Label Conflicts

Labels getting overwritten:

yaml
scrape_configs:
  - job_name: 'federation-dc1'
    honor_labels: true  # CRITICAL: Keeps original labels
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="node-exporter"}'
    static_configs:
      - targets: ['prometheus-dc1:9090']
        labels:
          datacenter: 'dc1'  # Add datacenter label
    metric_relabel_configs:
      # Ensure datacenter label is set
      - target_label: datacenter
        replacement: dc1

Handle external labels on source Prometheus:

yaml
# On subordinate Prometheus (prometheus-dc1)
global:
  external_labels:
    datacenter: 'dc1'
    cluster: 'production-dc1'

4. Fix 400/500 Errors

Server returning errors:

```bash # Check for error details in response curl -v "http://prometheus-dc1:9090/federate?match[]={__name__='up'}"

# Common causes: # - Invalid match[] parameter # - Too many matchers # - Memory issues on source Prometheus ```

Fix match parameter:

yaml
scrape_configs:
  - job_name: 'federation-dc1'
    metrics_path: '/federate'
    params:
      'match[]':
        # Use proper label matchers
        - '{job="node-exporter"}'
        - '{__name__=~"job:.*"}'
        # NOT: 'up' (must be a vector selector)

5. Handle Sample Limits

Too many samples from federation:

yaml
scrape_configs:
  - job_name: 'federation-dc1'
    honor_labels: true
    metrics_path: '/federate'
    sample_limit: 100000  # Increase limit
    params:
      'match[]':
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets: ['prometheus-dc1:9090']

Reduce federated metrics:

```yaml # On subordinate Prometheus, create recording rules groups: - name: federation_exports rules: # Pre-aggregate for federation - record: federation:node_cpu:rate5m expr: sum by (datacenter, instance) (rate(node_cpu_seconds_total[5m]))

  • record: federation:node_memory:usage
  • expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
  • `

Then federate only aggregated metrics:

yaml
scrape_configs:
  - job_name: 'federation-dc1'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"federation:.*"}'
    static_configs:
      - targets: ['prometheus-dc1:9090']

6. Fix Authentication

Federation requiring authentication:

```yaml scrape_configs: - job_name: 'federation-dc1' honor_labels: true metrics_path: '/federate' params: 'match[]': - '{__name__=~"job:.*"}'

# Basic auth basic_auth: username: federation_user password: secure_password

# TLS config scheme: https tls_config: ca_file: /etc/prometheus/certs/ca.crt cert_file: /etc/prometheus/certs/client.crt key_file: /etc/prometheus/certs/client.key

static_configs: - targets: ['prometheus-dc1:9090'] ```

Verification

Test Federation Query

```bash # Direct test of federation curl -s "http://prometheus-dc1:9090/federate?match[]={job='node-exporter'}" | head -20

# Check global Prometheus receives data curl -s 'http://localhost:9090/api/v1/query?query=up{datacenter="dc1"}' | jq . ```

Verify Labels Preserved

```promql # Check datacenter label is present count by (datacenter) ({__name__=~".+"})

# Check original job labels count by (job) ({datacenter="dc1"}) ```

Check Metrics Flow

```promql # Federation scrape success up{job="federation-dc1"} == 1

# Samples scraped scrape_samples_scraped{job="federation-dc1"}

# No errors rate(scrape_samples_scraped{job="federation-dc1"}[5m]) > 0 ```

Prevention

Add monitoring for federation:

```yaml groups: - name: federation_alerts rules: - alert: FederationDown expr: up{job=~"federation-.*"} == 0 for: 5m labels: severity: critical annotations: summary: "Federation target {{ $labels.instance }} is down" description: "Cannot scrape federation endpoint for {{ $labels.job }}"

  • alert: FederationSlow
  • expr: scrape_duration_seconds{job=~"federation-.*"} > 30
  • for: 5m
  • labels:
  • severity: warning
  • annotations:
  • summary: "Federation scrape is slow"
  • description: "Scrape duration for {{ $labels.job }} is {{ $value }}s"
  • alert: FederationSampleLimit
  • expr: scrape_samples_scraped{job=~"federation-.*"} > 50000
  • for: 5m
  • labels:
  • severity: warning
  • annotations:
  • summary: "Federation returning many samples"
  • description: "{{ $labels.job }} returned {{ $value }} samples"
  • alert: FederationDataStale
  • expr: time() - timestamp(up{datacenter=~".+"}) > 300
  • for: 5m
  • labels:
  • severity: warning
  • annotations:
  • summary: "Federated data is stale"
  • description: "No data from {{ $labels.datacenter }} for {{ $value }}s"
  • `

Federation Architecture Best Practices

  1. 1.Federate pre-aggregated data: Use recording rules on subordinate Prometheus instances
  2. 2.Use honor_labels: Always set honor_labels: true
  3. 3.Add external labels: Set datacenter, cluster, or region labels
  4. 4.Limit matchers: Only federate needed metrics
  5. 5.Increase timeouts: Federation often needs longer timeouts
  6. 6.Monitor federation: Alert on scrape failures and stale data