Fix Prometheus HA Pair Issues

The Problem

You have two Prometheus instances running in HA mode, but you're experiencing issues:

Duplicate alerts firing from both instances
Inconsistent data between the two Prometheus servers
Alertmanager receiving alerts from both without deduplication

bash

level=warn ts=2026-04-04T23:55:12.345Z caller=alerting.go:234 msg="Alert already exists" alert="HighCPU" instance="prometheus-1"
level=error ts=2026-04-04T23:55:13.456Z caller=alerting.go:235 msg="Duplicate alert source" sources="prometheus-0,prometheus-1"

HA pair issues cause alert noise, data gaps, and unreliable monitoring.

Diagnosis

Check External Labels

bash

# Check external labels on each Prometheus
curl -s http://prometheus-0:9090/api/v1/status/config | jq '.data.global.external_labels'
curl -s http://prometheus-1:9090/api/v1/status/config | jq '.data.global.external_labels'

Check Alertmanager Connections

```bash # Verify both Prometheus instances are connected to Alertmanager curl -s http://alertmanager:9093/api/v2/status | jq '.data'

# Check alert silences and inhibition rules curl -s http://alertmanager:9093/api/v2/silences | jq . ```

Check for Duplicate Alerts

```promql # Count alerts from each Prometheus count by (prometheus) (ALERTS{alertstate="firing"})

# Alerts without replica label ALERTS{alertstate="firing"} unless ALERTS{alertstate="firing",prometheus=~".+"} ```

Check Data Consistency

```promql # Compare metrics from both Prometheus # Query prometheus-0 directly {job="node-exporter"} @ prometheus-0

# Compare timestamps timestamp(up{job="node-exporter"}) @ prometheus-0 == timestamp(up{job="node-exporter"}) @ prometheus-1 ```

Solutions

1. Configure External Labels

Each Prometheus instance must have unique external labels:

```yaml # prometheus-0 configuration # prometheus.yml global: external_labels: prometheus: 'prometheus-0' cluster: 'production' replica: '0'

# prometheus-1 configuration global: external_labels: prometheus: 'prometheus-1' cluster: 'production' replica: '1' ```

These labels are used by Alertmanager to deduplicate alerts.

2. Configure Alertmanager for Deduplication

Alertmanager uses external labels for deduplication:

```yaml # alertmanager.yml global: # Resolve timeout resolve_timeout: 5m

route: group_by: ['alertname', 'cluster', 'prometheus'] group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: 'default'

receivers: - name: 'default' webhook_configs: - url: 'http://notification-service/webhook'

# Inhibition rules for deduplication inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'cluster'] ```

The group_by must include the unique replica label.

3. Deduplicate via Thanos/VictoriaMetrics

For long-term storage deduplication:

yaml

# Thanos Receive configuration
receive:
  # Enable deduplication
  dedup_enabled: true
  replica_label: 'prometheus'
  # Hash ring configuration
  hashring:
    members:
      - address: thanos-receive-0
      - address: thanos-receive-1

Or use Victoria Metrics:

bash

# Victoria Metrics deduplication settings
vminsert -dedup.minScrapeInterval=15s

Configure Prometheus remote write:

yaml

# Both Prometheus instances
remote_write:
  - url: "https://thanos-receive:19291/api/v1/write"
    # Ensure external_labels are set globally

4. Fix Scrape Configuration Differences

Both Prometheus should have identical scrape configs:

```bash # Compare configurations diff prometheus-0.yml prometheus-1.yml

# Or via API curl -s http://prometheus-0:9090/api/v1/status/config > config-0.json curl -s http://prometheus-1:9090/api/v1/status/config > config-1.json diff config-0.json config-1.json ```

Ensure identical configs:

```yaml # Shared configuration file for both instances # prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s

# Only difference should be external_labels # Use separate files for external labels or environment variables ```

Using environment variables:

```yaml # prometheus.yml global: external_labels: prometheus: '${PROMETHEUS_REPLICA}' cluster: 'production'

# Set via command line or environment export PROMETHEUS_REPLICA=prometheus-0 prometheus --config.file=prometheus.yml ```

5. Handle Alert Evaluation Timing

Alerts may fire at different times due to timing differences:

```yaml # Sync evaluation intervals global: evaluation_interval: 30s # Same on both

# Use consistent 'for' durations groups: - name: application_alerts rules: - alert: HighCPU expr: rate(process_cpu_seconds_total[1m]) > 0.8 for: 5m # Should be > scrape_interval ```

6. Configure Kubernetes HA

For Kubernetes deployments:

yaml

# prometheus-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
spec:
  replicas: 2
  serviceName: prometheus
  template:
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:latest
          args:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/data'
            - '--external.label.prometheus=prometheus-$(POD_NAME)'
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name

Service configuration:

yaml

apiVersion: v1
kind: Service
metadata:
  name: prometheus
spec:
  type: ClusterIP
  ports:
    - port: 9090
  selector:
    app: prometheus
---
# Headless service for StatefulSet
apiVersion: v1
kind: Service
metadata:
  name: prometheus-headless
spec:
  type: ClusterIP
  clusterIP: None
  ports:
    - port: 9090
  selector:
    app: prometheus

7. Alertmanager HA Configuration

For Alertmanager HA:

```yaml # alertmanager.yml for cluster mode cluster: peers: - alertmanager-0:9094 - alertmanager-1:9094 gossip_interval: 10s peer_timeout: 30s

high_availability: enabled: true ```

Kubernetes deployment:

yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: alertmanager
spec:
  replicas: 2
  serviceName: alertmanager
  template:
    spec:
      containers:
        - name: alertmanager
          image: prom/alertmanager:latest
          args:
            - '--cluster.peer=alertmanager-1:9094'
            - '--cluster.peer=alertmanager-0:9094'
          ports:
            - containerPort: 9093
            - containerPort: 9094

Verification

Check Alert Deduplication

```bash # Verify alerts from both sources curl -s http://alertmanager:9093/api/v2/alerts | jq '.[] | {labels: .labels, fingerprint: .fingerprint}'

# Check Alertmanager cluster status curl -s http://alertmanager:9093/api/v2/status | jq '.cluster' ```

Verify External Labels

```promql # Check both Prometheus have different replica labels count by (prometheus) ({__name__=~"prometheus_.+"})

# Query from each Prometheus {prometheus="prometheus-0"} {prometheus="prometheus-1"} ```

Check Data Consistency

bash

# Compare sample counts
curl -s 'http://prometheus-0:9090/api/v1/query?query=count({__name__=~".+"})' | jq '.data.result[0].value'
curl -s 'http://prometheus-1:9090/api/v1/query?query=count({__name__=~".+"})' | jq '.data.result[0].value'

Prevention

Add monitoring for HA pair:

```yaml groups: - name: ha_pair_alerts rules: - alert: PrometheusReplicaMissingExternalLabel expr: absent({prometheus=~".+"}) for: 5m labels: severity: critical annotations: summary: "Prometheus replica missing external label" description: "Prometheus is missing the 'prometheus' external label required for HA"

alert: AlertmanagerHADown
expr: count(alertmanager_cluster_members) != count(alertmanager_cluster_members_info)
for: 5m
labels:
severity: critical
annotations:
summary: "Alertmanager cluster degraded"
description: "Expected {{ $value }} Alertmanager members but fewer are healthy"

alert: PrometheusHAConfigMismatch
expr: |
count by (job) ({__name__=~".+"}) @ prometheus-0 !=
count by (job) ({__name__=~".+"}) @ prometheus-1
for: 10m
labels:
severity: warning
annotations:
summary: "Prometheus HA configuration mismatch"

alert: DuplicateAlertSources
expr: count by (alertname) (ALERTS{alertstate="firing"}) > 1
for: 1m
labels:
severity: warning
annotations:
summary: "Duplicate alerts detected"
description: "Alert {{ $labels.alertname }} firing from multiple sources"
`

The Problem

Diagnosis

Check External Labels

Check Alertmanager Connections

Check for Duplicate Alerts

Check Data Consistency

Solutions

1. Configure External Labels

2. Configure Alertmanager for Deduplication

3. Deduplicate via Thanos/VictoriaMetrics

4. Fix Scrape Configuration Differences

5. Handle Alert Evaluation Timing

6. Configure Kubernetes HA

7. Alertmanager HA Configuration

Verification

Check Alert Deduplication

Verify External Labels

Check Data Consistency

Prevention

Share this guide

More Monitoring Troubleshooting Guides

Metric Retention Expired

Timeseries Storage Full

Collector Agent Crashed

Webhook Notification Timeout

SMS Notification Failed

Email Notification Bounced