Fix Prometheus Federation Error

The Problem

Prometheus federation is failing to scrape metrics from subordinate Prometheus instances. You see errors like:

bash

level=error ts=2026-04-04T00:00:18.234Z caller=scrape.go:1234 component="scrape manager" target=http://prometheus-dc1:9090/federate msg="Scrape failed" err="Get \"http://prometheus-dc1:9090/federate\": context deadline exceeded"
level=warn ts=2026-04-04T00:00:19.345Z caller=scrape.go:1235 msg="Error scraping federate endpoint" err="server returned HTTP status 400 Bad Request"
level=error ts=2026-04-04T00:00:20.456Z caller=scrape.go:1236 msg="Federation returned no data"

Federation errors break global monitoring views and prevent aggregating metrics across datacenters or clusters.

Diagnosis

Check Federation Endpoint

```bash # Test federation endpoint directly curl -v "http://prometheus-dc1:9090/federate?match[]={__name__=~'.+'}"

# With specific matchers curl -v "http://prometheus-dc1:9090/federate?match[]={job='node-exporter'}&match[]={job='kubelet'}"

# Check response headers curl -I "http://prometheus-dc1:9090/federate?match[]={up}" ```

Check Federation Metrics

```promql # Federation scrape success up{job="federation-dc1"}

# Federation scrape duration scrape_duration_seconds{job="federation-dc1"}

# Samples received from federation scrape_samples_scraped{job="federation-dc1"}

# Federation errors rate(prometheus_target_scrapes_exceeded_sample_limit_total{job="federation-dc1"}[5m]) ```

Check Network Connectivity

```bash # Test basic connectivity curl -s http://prometheus-dc1:9090/-/healthy

# Check DNS resolution nslookup prometheus-dc1

# Test from within container kubectl exec -it prometheus-global -- curl -s http://prometheus-dc1:9090/metrics ```

Solutions

1. Fix Federation URL Configuration

Incorrect federation configuration:

```yaml # prometheus.yml - Global/Aggregating Prometheus scrape_configs: # WRONG: Missing match[] parameter # - job_name: 'federation-dc1' # honor_labels: true # static_configs: # - targets: ['prometheus-dc1:9090']

# CORRECT: Proper federation configuration - job_name: 'federation-dc1' scrape_interval: 30s scrape_timeout: 25s honor_labels: true # Keep original labels metrics_path: '/federate' params: 'match[]': - '{__name__=~"job:.*"}' # Recording rules - '{job="node-exporter"}' - '{job="kubelet"}' - 'up' static_configs: - targets: - 'prometheus-dc1:9090' labels: datacenter: 'dc1' ```

Apply and reload:

```bash # Reload configuration curl -X POST http://localhost:9090/-/reload

# Or restart systemctl restart prometheus ```

2. Fix Timeout Issues

Federation scraping takes too long:

yaml

scrape_configs:
  - job_name: 'federation-dc1'
    scrape_interval: 60s
    scrape_timeout: 55s  # Must be < scrape_interval
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets: ['prometheus-dc1:9090']

Reduce data volume:

yaml

scrape_configs:
  - job_name: 'federation-dc1'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        # Only federate pre-aggregated metrics
        - '{__name__=~"job:.*"}'
        - '{__name__=~"namespace:.*"}'
        # Don't federate high-cardinality raw metrics
    static_configs:
      - targets: ['prometheus-dc1:9090']

3. Fix Label Conflicts

Labels getting overwritten:

yaml

scrape_configs:
  - job_name: 'federation-dc1'
    honor_labels: true  # CRITICAL: Keeps original labels
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="node-exporter"}'
    static_configs:
      - targets: ['prometheus-dc1:9090']
        labels:
          datacenter: 'dc1'  # Add datacenter label
    metric_relabel_configs:
      # Ensure datacenter label is set
      - target_label: datacenter
        replacement: dc1

Handle external labels on source Prometheus:

yaml

# On subordinate Prometheus (prometheus-dc1)
global:
  external_labels:
    datacenter: 'dc1'
    cluster: 'production-dc1'

4. Fix 400/500 Errors

Server returning errors:

```bash # Check for error details in response curl -v "http://prometheus-dc1:9090/federate?match[]={__name__='up'}"

# Common causes: # - Invalid match[] parameter # - Too many matchers # - Memory issues on source Prometheus ```

Fix match parameter:

yaml

scrape_configs:
  - job_name: 'federation-dc1'
    metrics_path: '/federate'
    params:
      'match[]':
        # Use proper label matchers
        - '{job="node-exporter"}'
        - '{__name__=~"job:.*"}'
        # NOT: 'up' (must be a vector selector)

5. Handle Sample Limits

Too many samples from federation:

yaml

scrape_configs:
  - job_name: 'federation-dc1'
    honor_labels: true
    metrics_path: '/federate'
    sample_limit: 100000  # Increase limit
    params:
      'match[]':
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets: ['prometheus-dc1:9090']

Reduce federated metrics:

```yaml # On subordinate Prometheus, create recording rules groups: - name: federation_exports rules: # Pre-aggregate for federation - record: federation:node_cpu:rate5m expr: sum by (datacenter, instance) (rate(node_cpu_seconds_total[5m]))

record: federation:node_memory:usage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
`

Then federate only aggregated metrics:

yaml

scrape_configs:
  - job_name: 'federation-dc1'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"federation:.*"}'
    static_configs:
      - targets: ['prometheus-dc1:9090']

6. Fix Authentication

Federation requiring authentication:

```yaml scrape_configs: - job_name: 'federation-dc1' honor_labels: true metrics_path: '/federate' params: 'match[]': - '{__name__=~"job:.*"}'

# Basic auth basic_auth: username: federation_user password: secure_password

# TLS config scheme: https tls_config: ca_file: /etc/prometheus/certs/ca.crt cert_file: /etc/prometheus/certs/client.crt key_file: /etc/prometheus/certs/client.key

static_configs: - targets: ['prometheus-dc1:9090'] ```

Verification

Test Federation Query

```bash # Direct test of federation curl -s "http://prometheus-dc1:9090/federate?match[]={job='node-exporter'}" | head -20

# Check global Prometheus receives data curl -s 'http://localhost:9090/api/v1/query?query=up{datacenter="dc1"}' | jq . ```

Verify Labels Preserved

```promql # Check datacenter label is present count by (datacenter) ({__name__=~".+"})

# Check original job labels count by (job) ({datacenter="dc1"}) ```

Check Metrics Flow

```promql # Federation scrape success up{job="federation-dc1"} == 1

# Samples scraped scrape_samples_scraped{job="federation-dc1"}

# No errors rate(scrape_samples_scraped{job="federation-dc1"}[5m]) > 0 ```

Prevention

Add monitoring for federation:

```yaml groups: - name: federation_alerts rules: - alert: FederationDown expr: up{job=~"federation-.*"} == 0 for: 5m labels: severity: critical annotations: summary: "Federation target {{ $labels.instance }} is down" description: "Cannot scrape federation endpoint for {{ $labels.job }}"

alert: FederationSlow
expr: scrape_duration_seconds{job=~"federation-.*"} > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Federation scrape is slow"
description: "Scrape duration for {{ $labels.job }} is {{ $value }}s"

alert: FederationSampleLimit
expr: scrape_samples_scraped{job=~"federation-.*"} > 50000
for: 5m
labels:
severity: warning
annotations:
summary: "Federation returning many samples"
description: "{{ $labels.job }} returned {{ $value }} samples"

alert: FederationDataStale
expr: time() - timestamp(up{datacenter=~".+"}) > 300
for: 5m
labels:
severity: warning
annotations:
summary: "Federated data is stale"
description: "No data from {{ $labels.datacenter }} for {{ $value }}s"
`

Federation Architecture Best Practices

1.Federate pre-aggregated data: Use recording rules on subordinate Prometheus instances
2.Use honor_labels: Always set honor_labels: true
3.Add external labels: Set datacenter, cluster, or region labels
4.Limit matchers: Only federate needed metrics
5.Increase timeouts: Federation often needs longer timeouts
6.Monitor federation: Alert on scrape failures and stale data

The Problem

Diagnosis

Check Federation Endpoint

Check Federation Metrics

Check Network Connectivity

Solutions

1. Fix Federation URL Configuration

2. Fix Timeout Issues

3. Fix Label Conflicts

4. Fix 400/500 Errors

5. Handle Sample Limits

6. Fix Authentication

Verification

Test Federation Query

Verify Labels Preserved

Check Metrics Flow

Prevention

Federation Architecture Best Practices

Share this guide

More Monitoring Troubleshooting Guides

Metric Retention Expired

Timeseries Storage Full

Collector Agent Crashed

Webhook Notification Timeout

SMS Notification Failed

Email Notification Bounced