Introduction
Prometheus remote write backlog occurs when the local Prometheus server cannot send metrics to the remote storage backend fast enough, causing a queue to build up. This leads to memory exhaustion, samples being dropped, and monitoring gaps. Remote write is critical for long-term storage, multi-cluster aggregation, and global observability. When backlog grows unchecked, Prometheus may crash or stop scraping, creating blind spots in monitoring. Federation failures similarly prevent metric aggregation across Prometheus instances, breaking centralized dashboards and alerts.
Symptoms
- Prometheus logs show
remote write backlogordropped sampleswarnings prometheus_remote_storage_highest_timestamp_in_secondslagging behind current timeprometheus_remote_storage_queue_highest_sent_timestamp_secondsnot advancing- Metrics show
prometheus_remote_storage_samples_dropped_totalincreasing - Remote storage (Thanos, Cortex, M3) shows ingestion gaps
- Federation targets show
DOWNstatus despite being healthy - Issue appears after traffic increase, new metrics added, or network degradation
Common Causes
- Network latency between Prometheus and remote storage exceeds write deadline
- Remote storage backend throttling ingestion (rate limits, capacity limits)
- Insufficient Prometheus resources (memory, CPU) for queue processing
- Series cardinality explosion (too many unique label combinations)
- Remote write queue configuration too conservative
- SSL/TLS handshake overhead for each batch
- Federation scrape timeout too short for large metric sets
- Remote storage schema changes or API incompatibility
Step-by-Step Fix
### 1. Check remote write queue metrics
Analyze queue health with Prometheus metrics:
```promql # Check queue size (should be near 0) prometheus_remote_storage_queue_length
# Check samples pending send prometheus_remote_storage_shards_to_send
# Check dropped samples rate(prometheus_remote_storage_samples_dropped_total[5m])
# Check send latency prometheus_remote_storage_sent_batch_duration_seconds
# Check timestamp lag prometheus_remote_storage_highest_timestamp_in_seconds - prometheus_remote_storage_queue_highest_sent_timestamp_seconds
# Ideal: lag < 30 seconds # Warning: lag > 60 seconds # Critical: lag > 300 seconds ```
Query current state:
```bash # Query Prometheus API curl -s http://localhost:9090/api/v1/query \ --data-urlencode "query=prometheus_remote_storage_queue_length" | jq
# Check multiple queues (if configured) curl -s http://localhost:9090/api/v1/query \ --data-urlencode "query=prometheus_remote_storage_queue_length{queue=\"thanos\"}" | jq ```
### 2. Check remote write configuration
Verify remote write settings:
yaml
# prometheus.yml
remote_write:
- url: https://thanos-receive:19291/api/v1/receive
remote_timeout: 30s
write_relabel_configs: []
metadata_config:
send: true
send_interval: 1m
queue_config:
capacity: 10000 # Samples per shard
max_shards: 50 # Max parallel senders
min_shards: 1 # Minimum shards
max_samples_per_send: 5000 # Batch size
batch_send_deadline: 5s # Max time before send
max_retries: 3 # Retry count
retry_on_http_429: true # Handle rate limits
Tune queue configuration:
yaml
# For high-throughput environments
remote_write:
- url: https://thanos-receive:19291/api/v1/receive
remote_timeout: 60s # Increase timeout
queue_config:
capacity: 25000 # More buffer
max_shards: 100 # More parallelism
min_shards: 10 # Faster ramp-up
max_samples_per_send: 10000 # Larger batches
batch_send_deadline: 10s # More time per batch
max_retries: 5
retry_on_http_429: true
Key parameters:
- capacity: Samples buffered per shard (increase for burst tolerance)
- max_shards: Parallel senders (increase for throughput)
- max_samples_per_send: Batch size (larger = more efficient but higher latency)
- batch_send_deadline: Time budget per send (must exceed network RTT + processing)
### 3. Check network connectivity and latency
Network issues cause send delays:
```bash # Check latency to remote storage ping -c 10 thanos-receive.example.com
# Check TCP connection time time nc -zv thanos-receive.example.com 19291
# Check for packet loss mtr -n -c 20 thanos-receive.example.com
# Check TLS handshake time time openssl s_client -connect thanos-receive.example.com:19291 </dev/null
# Check bandwidth availability iperf3 -c thanos-receive.example.com -p 19291 -t 30 ```
If latency is high:
```yaml # Increase timeout to accommodate latency remote_write: - url: https://thanos-receive:19291/api/v1/receive remote_timeout: 120s # For high-latency links
# Or use relabeling to reduce data volume write_relabel_configs: - source_labels: [__name__] regex: 'go_.*|process_.*' action: drop # Drop verbose Go metrics ```
### 4. Check series cardinality
High cardinality causes backlog:
```bash # Check total series count curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName'
# Or via PromQL count({__name__=~".+"}) by (__name__)
# Check top cardinality metrics curl -s http://localhost:9090/api/v1/status/tsdb | jq \ '.data.seriesCountByMetricName | sort_by(-.count) | .[0:20]'
# Find high-cardinality labels curl -s http://localhost:9090/api/v1/label/__name__/values | jq \ 'map(select(startswith("kube_"))) | length' ```
Reduce cardinality with relabeling:
```yaml remote_write: - url: https://thanos-receive:19291/api/v1/receive write_relabel_configs: # Drop high-cardinality pod UID - source_labels: [pod] regex: '.*' replacement: 'dropped' target_label: pod action: replace
# Drop verbose metrics - source_labels: [__name__] regex: 'go_gc_.*|go_memstats_.*' action: drop
# Keep only production metrics - source_labels: [environment] regex: 'production' action: keep
# Aggregate instance labels - source_labels: [instance] regex: '(.*):.*' replacement: '${1}' target_label: instance action: replace ```
### 5. Check remote storage backend health
Verify ingestion capacity:
```bash # Thanos receive check curl -s http://thanos-receive:19291/-/healthy curl -s http://thanos-receive:19291/-/ready
# Cortex check curl -s http://cortex:8080/ready
# M3DB check curl -s http://m3db-coordinator:7201/health
# Check ingestion rate curl -s http://thanos-receive:19291/metrics | grep thanos_receive_ingested_samples
# Check storage capacity curl -s http://thanos-receive:19291/metrics | grep thanos_store_bucket_objects_size_bytes ```
Check for backend throttling:
```bash # Check for 429 (Too Many Requests) responses curl -s http://prometheus:9090/metrics | grep prometheus_remote_storage_sent_failed_total
# Check rate limit headers curl -I https://thanos-receive:19291/api/v1/receive \ -X POST -d '' 2>/dev/null | grep -i "x-rateLimit"
# If rate limited, reduce write rate or increase backend capacity ```
### 6. Check Prometheus resource allocation
Insufficient resources cause processing delays:
```bash # Check Prometheus memory usage kubectl top pod prometheus-k8s-0 -n monitoring
# Check Prometheus flags kubectl exec prometheus-k8s-0 -n monitoring -- ps aux | grep prometheus
# Key flags: # --storage.tsdb.retention.time=15d # --storage.tsdb.retention.size=50GB # --web.max-concurrency=50 # --web.timeout=5m
# Check Go runtime metrics curl -s http://localhost:9090/metrics | grep -E "go_gc_duration|go_memstats" ```
Increase resources:
```yaml # Kubernetes resource limits resources: requests: memory: 4Gi cpu: 2 limits: memory: 8Gi cpu: 4
# Prometheus flags for high-throughput args: - --storage.tsdb.max-block-duration=4h - --storage.tsdb.min-block-duration=2h - --storage.tsdb.retention.time=15d - --web.max-concurrency=100 ```
### 7. Check federation configuration
Federation scrape issues:
yaml
# prometheus.yml
scrape_configs:
- job_name: 'federate'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="kubernetes-pods"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 'prometheus-federation:9090'
scrape_timeout: 30s
scrape_interval: 30s
Optimize federation:
yaml
scrape_configs:
- job_name: 'federate'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# Only federate recording rules and essential metrics
- '{job="kubernetes-pods"}'
- '{__name__=~"container_.*"}'
- '{__name__=~"kube_pod_.*"}'
- '{__name__=~"node_.*"}'
static_configs:
- targets:
- 'prometheus-federation:9090'
scrape_timeout: 60s # Increase for large metric sets
scrape_interval: 60s # Reduce frequency if overloaded
sample_limit: 100000 # Prevent runaway scrapes
### 8. Implement write relabeling for data reduction
Reduce data volume before sending:
```yaml remote_write: - url: https://thanos-receive:19291/api/v1/receive write_relabel_configs: # Drop debug metrics - source_labels: [level] regex: 'debug' action: drop
# Keep only critical namespaces - source_labels: [namespace] regex: 'production|staging' action: keep
# Drop per-container metrics, keep pod-level - source_labels: [container] regex: '.*' action: drop target_label: container
# Aggregate short-lived job names - source_labels: [job] regex: '.*-.*-[a-f0-9]{5}' replacement: 'aggregated-job' target_label: job action: replace
# Drop metrics older than retention - source_labels: [__name__] regex: 'probe_.*|scrape_.*' action: drop ```
### 9. Set up backlog monitoring and alerting
Alert on backlog growth:
```yaml # Prometheus alerting rules groups: - name: remote_write rules: - alert: PrometheusRemoteWriteBacklog expr: prometheus_remote_storage_queue_length > 10000 for: 5m labels: severity: warning annotations: summary: "Prometheus remote write backlog growing" description: "Queue length is {{ $value }}, samples may be dropped"
- alert: PrometheusRemoteWriteDroppingSamples
- expr: rate(prometheus_remote_storage_samples_dropped_total[5m]) > 0
- for: 1m
- labels:
- severity: critical
- annotations:
- summary: "Prometheus dropping remote write samples"
- description: "{{ $value }} samples/s being dropped"
- alert: PrometheusRemoteWriteTimestampLag
- expr: |
- prometheus_remote_storage_highest_timestamp_in_seconds -
- prometheus_remote_storage_queue_highest_sent_timestamp_seconds > 300
- for: 5m
- labels:
- severity: critical
- annotations:
- summary: "Prometheus remote write timestamp lag exceeds 5 minutes"
- description: "Lag is {{ $value }} seconds"
`
### 10. Enable debug logging for remote write
Diagnose send failures:
```yaml # prometheus.yml global: log_level: debug
# Or via command line args: - --log.level=debug - --log.format=logfmt ```
Check logs:
```bash # Filter remote write logs kubectl logs prometheus-k8s-0 -n monitoring | grep -i "remote_write"
# Look for specific errors kubectl logs prometheus-k8s-0 -n monitoring | grep -E "failed|error|timeout|backlog"
# Enable verbose logging temporarily kubectl exec prometheus-k8s-0 -n monitoring -- \ kill -SIGUSR1 $(pgrep prometheus) ```
Prevention
- Size remote write queue for 2x normal throughput
- Monitor queue length and timestamp lag continuously
- Implement write relabeling to reduce cardinality
- Set
remote_timeoutto 3x p99 network RTT - Use
retry_on_http_429for rate-limited backends - Configure
sample_limiton federation scrapes - Test remote write failover scenarios regularly
- Document backend capacity limits and scaling procedures
Related Errors
- **Server returned HTTP 429**: Remote storage rate limiting
- **Server returned HTTP 503**: Remote storage unavailable
- **Context deadline exceeded**: Write timeout too short
- **Connection refused**: Network connectivity lost