Fix Prometheus Remote Write Backlog Federation Failures

Introduction

Prometheus remote write backlog occurs when the local Prometheus server cannot send metrics to the remote storage backend fast enough, causing a queue to build up. This leads to memory exhaustion, samples being dropped, and monitoring gaps. Remote write is critical for long-term storage, multi-cluster aggregation, and global observability. When backlog grows unchecked, Prometheus may crash or stop scraping, creating blind spots in monitoring. Federation failures similarly prevent metric aggregation across Prometheus instances, breaking centralized dashboards and alerts.

Symptoms

Prometheus logs show remote write backlog or dropped samples warnings
prometheus_remote_storage_highest_timestamp_in_seconds lagging behind current time
prometheus_remote_storage_queue_highest_sent_timestamp_seconds not advancing
Metrics show prometheus_remote_storage_samples_dropped_total increasing
Remote storage (Thanos, Cortex, M3) shows ingestion gaps
Federation targets show DOWN status despite being healthy
Issue appears after traffic increase, new metrics added, or network degradation

Common Causes

Network latency between Prometheus and remote storage exceeds write deadline
Remote storage backend throttling ingestion (rate limits, capacity limits)
Insufficient Prometheus resources (memory, CPU) for queue processing
Series cardinality explosion (too many unique label combinations)
Remote write queue configuration too conservative
SSL/TLS handshake overhead for each batch
Federation scrape timeout too short for large metric sets
Remote storage schema changes or API incompatibility

Step-by-Step Fix

### 1. Check remote write queue metrics

Analyze queue health with Prometheus metrics:

```promql # Check queue size (should be near 0) prometheus_remote_storage_queue_length

# Check samples pending send prometheus_remote_storage_shards_to_send

# Check dropped samples rate(prometheus_remote_storage_samples_dropped_total[5m])

# Check send latency prometheus_remote_storage_sent_batch_duration_seconds

# Check timestamp lag prometheus_remote_storage_highest_timestamp_in_seconds - prometheus_remote_storage_queue_highest_sent_timestamp_seconds

# Ideal: lag < 30 seconds # Warning: lag > 60 seconds # Critical: lag > 300 seconds ```

Query current state:

```bash # Query Prometheus API curl -s http://localhost:9090/api/v1/query \ --data-urlencode "query=prometheus_remote_storage_queue_length" | jq

# Check multiple queues (if configured) curl -s http://localhost:9090/api/v1/query \ --data-urlencode "query=prometheus_remote_storage_queue_length{queue=\"thanos\"}" | jq ```

### 2. Check remote write configuration

Verify remote write settings:

yaml # prometheus.yml remote_write: - url: https://thanos-receive:19291/api/v1/receive remote_timeout: 30s write_relabel_configs: [] metadata_config: send: true send_interval: 1m queue_config: capacity: 10000 # Samples per shard max_shards: 50 # Max parallel senders min_shards: 1 # Minimum shards max_samples_per_send: 5000 # Batch size batch_send_deadline: 5s # Max time before send max_retries: 3 # Retry count retry_on_http_429: true # Handle rate limits

Tune queue configuration:

yaml # For high-throughput environments remote_write: - url: https://thanos-receive:19291/api/v1/receive remote_timeout: 60s # Increase timeout queue_config: capacity: 25000 # More buffer max_shards: 100 # More parallelism min_shards: 10 # Faster ramp-up max_samples_per_send: 10000 # Larger batches batch_send_deadline: 10s # More time per batch max_retries: 5 retry_on_http_429: true

Key parameters: - capacity: Samples buffered per shard (increase for burst tolerance) - max_shards: Parallel senders (increase for throughput) - max_samples_per_send: Batch size (larger = more efficient but higher latency) - batch_send_deadline: Time budget per send (must exceed network RTT + processing)

### 3. Check network connectivity and latency

Network issues cause send delays:

```bash # Check latency to remote storage ping -c 10 thanos-receive.example.com

# Check TCP connection time time nc -zv thanos-receive.example.com 19291

# Check for packet loss mtr -n -c 20 thanos-receive.example.com

# Check TLS handshake time time openssl s_client -connect thanos-receive.example.com:19291 </dev/null

# Check bandwidth availability iperf3 -c thanos-receive.example.com -p 19291 -t 30 ```

If latency is high:

```yaml # Increase timeout to accommodate latency remote_write: - url: https://thanos-receive:19291/api/v1/receive remote_timeout: 120s # For high-latency links

# Or use relabeling to reduce data volume write_relabel_configs: - source_labels: [__name__] regex: 'go_.*|process_.*' action: drop # Drop verbose Go metrics ```

### 4. Check series cardinality

High cardinality causes backlog:

```bash # Check total series count curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName'

# Or via PromQL count({__name__=~".+"}) by (__name__)

# Check top cardinality metrics curl -s http://localhost:9090/api/v1/status/tsdb | jq \ '.data.seriesCountByMetricName | sort_by(-.count) | .[0:20]'

# Find high-cardinality labels curl -s http://localhost:9090/api/v1/label/__name__/values | jq \ 'map(select(startswith("kube_"))) | length' ```

Reduce cardinality with relabeling:

```yaml remote_write: - url: https://thanos-receive:19291/api/v1/receive write_relabel_configs: # Drop high-cardinality pod UID - source_labels: [pod] regex: '.*' replacement: 'dropped' target_label: pod action: replace

# Drop verbose metrics - source_labels: [__name__] regex: 'go_gc_.*|go_memstats_.*' action: drop

# Keep only production metrics - source_labels: [environment] regex: 'production' action: keep

# Aggregate instance labels - source_labels: [instance] regex: '(.*):.*' replacement: '${1}' target_label: instance action: replace ```

### 5. Check remote storage backend health

Verify ingestion capacity:

```bash # Thanos receive check curl -s http://thanos-receive:19291/-/healthy curl -s http://thanos-receive:19291/-/ready

# Cortex check curl -s http://cortex:8080/ready

# M3DB check curl -s http://m3db-coordinator:7201/health

# Check ingestion rate curl -s http://thanos-receive:19291/metrics | grep thanos_receive_ingested_samples

# Check storage capacity curl -s http://thanos-receive:19291/metrics | grep thanos_store_bucket_objects_size_bytes ```

Check for backend throttling:

```bash # Check for 429 (Too Many Requests) responses curl -s http://prometheus:9090/metrics | grep prometheus_remote_storage_sent_failed_total

# Check rate limit headers curl -I https://thanos-receive:19291/api/v1/receive \ -X POST -d '' 2>/dev/null | grep -i "x-rateLimit"

# If rate limited, reduce write rate or increase backend capacity ```

### 6. Check Prometheus resource allocation

Insufficient resources cause processing delays:

```bash # Check Prometheus memory usage kubectl top pod prometheus-k8s-0 -n monitoring

# Check Prometheus flags kubectl exec prometheus-k8s-0 -n monitoring -- ps aux | grep prometheus

# Key flags: # --storage.tsdb.retention.time=15d # --storage.tsdb.retention.size=50GB # --web.max-concurrency=50 # --web.timeout=5m

# Check Go runtime metrics curl -s http://localhost:9090/metrics | grep -E "go_gc_duration|go_memstats" ```

Increase resources:

```yaml # Kubernetes resource limits resources: requests: memory: 4Gi cpu: 2 limits: memory: 8Gi cpu: 4

# Prometheus flags for high-throughput args: - --storage.tsdb.max-block-duration=4h - --storage.tsdb.min-block-duration=2h - --storage.tsdb.retention.time=15d - --web.max-concurrency=100 ```

### 7. Check federation configuration

Federation scrape issues:

yaml # prometheus.yml scrape_configs: - job_name: 'federate' honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="kubernetes-pods"}' - '{__name__=~"job:.*"}' static_configs: - targets: - 'prometheus-federation:9090' scrape_timeout: 30s scrape_interval: 30s

Optimize federation:

yaml scrape_configs: - job_name: 'federate' honor_labels: true metrics_path: '/federate' params: 'match[]': # Only federate recording rules and essential metrics - '{job="kubernetes-pods"}' - '{__name__=~"container_.*"}' - '{__name__=~"kube_pod_.*"}' - '{__name__=~"node_.*"}' static_configs: - targets: - 'prometheus-federation:9090' scrape_timeout: 60s # Increase for large metric sets scrape_interval: 60s # Reduce frequency if overloaded sample_limit: 100000 # Prevent runaway scrapes

### 8. Implement write relabeling for data reduction

Reduce data volume before sending:

```yaml remote_write: - url: https://thanos-receive:19291/api/v1/receive write_relabel_configs: # Drop debug metrics - source_labels: [level] regex: 'debug' action: drop

# Keep only critical namespaces - source_labels: [namespace] regex: 'production|staging' action: keep

# Drop per-container metrics, keep pod-level - source_labels: [container] regex: '.*' action: drop target_label: container

# Aggregate short-lived job names - source_labels: [job] regex: '.*-.*-[a-f0-9]{5}' replacement: 'aggregated-job' target_label: job action: replace

# Drop metrics older than retention - source_labels: [__name__] regex: 'probe_.*|scrape_.*' action: drop ```

### 9. Set up backlog monitoring and alerting

Alert on backlog growth:

```yaml # Prometheus alerting rules groups: - name: remote_write rules: - alert: PrometheusRemoteWriteBacklog expr: prometheus_remote_storage_queue_length > 10000 for: 5m labels: severity: warning annotations: summary: "Prometheus remote write backlog growing" description: "Queue length is {{ $value }}, samples may be dropped"

alert: PrometheusRemoteWriteDroppingSamples
expr: rate(prometheus_remote_storage_samples_dropped_total[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Prometheus dropping remote write samples"
description: "{{ $value }} samples/s being dropped"

alert: PrometheusRemoteWriteTimestampLag
expr: |
prometheus_remote_storage_highest_timestamp_in_seconds -
prometheus_remote_storage_queue_highest_sent_timestamp_seconds > 300
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus remote write timestamp lag exceeds 5 minutes"
description: "Lag is {{ $value }} seconds"
`

### 10. Enable debug logging for remote write

Diagnose send failures:

```yaml # prometheus.yml global: log_level: debug

# Or via command line args: - --log.level=debug - --log.format=logfmt ```

Check logs:

```bash # Filter remote write logs kubectl logs prometheus-k8s-0 -n monitoring | grep -i "remote_write"

# Look for specific errors kubectl logs prometheus-k8s-0 -n monitoring | grep -E "failed|error|timeout|backlog"

# Enable verbose logging temporarily kubectl exec prometheus-k8s-0 -n monitoring -- \ kill -SIGUSR1 $(pgrep prometheus) ```

Prevention

Size remote write queue for 2x normal throughput
Monitor queue length and timestamp lag continuously
Implement write relabeling to reduce cardinality
Set remote_timeout to 3x p99 network RTT
Use retry_on_http_429 for rate-limited backends
Configure sample_limit on federation scrapes
Test remote write failover scenarios regularly
Document backend capacity limits and scaling procedures

**Server returned HTTP 429**: Remote storage rate limiting
**Server returned HTTP 503**: Remote storage unavailable
**Context deadline exceeded**: Write timeout too short
**Connection refused**: Network connectivity lost

How to Fix Prometheus Remote Write Backlog and Federation Failures

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Related Errors

Share this guide