The Problem

You have two Prometheus instances running in HA mode, but you're experiencing issues:

  • Duplicate alerts firing from both instances
  • Inconsistent data between the two Prometheus servers
  • Alertmanager receiving alerts from both without deduplication
bash
level=warn ts=2026-04-04T23:55:12.345Z caller=alerting.go:234 msg="Alert already exists" alert="HighCPU" instance="prometheus-1"
level=error ts=2026-04-04T23:55:13.456Z caller=alerting.go:235 msg="Duplicate alert source" sources="prometheus-0,prometheus-1"

HA pair issues cause alert noise, data gaps, and unreliable monitoring.

Diagnosis

Check External Labels

bash
# Check external labels on each Prometheus
curl -s http://prometheus-0:9090/api/v1/status/config | jq '.data.global.external_labels'
curl -s http://prometheus-1:9090/api/v1/status/config | jq '.data.global.external_labels'

Check Alertmanager Connections

```bash # Verify both Prometheus instances are connected to Alertmanager curl -s http://alertmanager:9093/api/v2/status | jq '.data'

# Check alert silences and inhibition rules curl -s http://alertmanager:9093/api/v2/silences | jq . ```

Check for Duplicate Alerts

```promql # Count alerts from each Prometheus count by (prometheus) (ALERTS{alertstate="firing"})

# Alerts without replica label ALERTS{alertstate="firing"} unless ALERTS{alertstate="firing",prometheus=~".+"} ```

Check Data Consistency

```promql # Compare metrics from both Prometheus # Query prometheus-0 directly {job="node-exporter"} @ prometheus-0

# Compare timestamps timestamp(up{job="node-exporter"}) @ prometheus-0 == timestamp(up{job="node-exporter"}) @ prometheus-1 ```

Solutions

1. Configure External Labels

Each Prometheus instance must have unique external labels:

```yaml # prometheus-0 configuration # prometheus.yml global: external_labels: prometheus: 'prometheus-0' cluster: 'production' replica: '0'

# prometheus-1 configuration global: external_labels: prometheus: 'prometheus-1' cluster: 'production' replica: '1' ```

These labels are used by Alertmanager to deduplicate alerts.

2. Configure Alertmanager for Deduplication

Alertmanager uses external labels for deduplication:

```yaml # alertmanager.yml global: # Resolve timeout resolve_timeout: 5m

route: group_by: ['alertname', 'cluster', 'prometheus'] group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: 'default'

receivers: - name: 'default' webhook_configs: - url: 'http://notification-service/webhook'

# Inhibition rules for deduplication inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'cluster'] ```

The group_by must include the unique replica label.

3. Deduplicate via Thanos/VictoriaMetrics

For long-term storage deduplication:

yaml
# Thanos Receive configuration
receive:
  # Enable deduplication
  dedup_enabled: true
  replica_label: 'prometheus'
  # Hash ring configuration
  hashring:
    members:
      - address: thanos-receive-0
      - address: thanos-receive-1

Or use Victoria Metrics:

bash
# Victoria Metrics deduplication settings
vminsert -dedup.minScrapeInterval=15s

Configure Prometheus remote write:

yaml
# Both Prometheus instances
remote_write:
  - url: "https://thanos-receive:19291/api/v1/write"
    # Ensure external_labels are set globally

4. Fix Scrape Configuration Differences

Both Prometheus should have identical scrape configs:

```bash # Compare configurations diff prometheus-0.yml prometheus-1.yml

# Or via API curl -s http://prometheus-0:9090/api/v1/status/config > config-0.json curl -s http://prometheus-1:9090/api/v1/status/config > config-1.json diff config-0.json config-1.json ```

Ensure identical configs:

```yaml # Shared configuration file for both instances # prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s

# Only difference should be external_labels # Use separate files for external labels or environment variables ```

Using environment variables:

```yaml # prometheus.yml global: external_labels: prometheus: '${PROMETHEUS_REPLICA}' cluster: 'production'

# Set via command line or environment export PROMETHEUS_REPLICA=prometheus-0 prometheus --config.file=prometheus.yml ```

5. Handle Alert Evaluation Timing

Alerts may fire at different times due to timing differences:

```yaml # Sync evaluation intervals global: evaluation_interval: 30s # Same on both

# Use consistent 'for' durations groups: - name: application_alerts rules: - alert: HighCPU expr: rate(process_cpu_seconds_total[1m]) > 0.8 for: 5m # Should be > scrape_interval ```

6. Configure Kubernetes HA

For Kubernetes deployments:

yaml
# prometheus-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
spec:
  replicas: 2
  serviceName: prometheus
  template:
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:latest
          args:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/data'
            - '--external.label.prometheus=prometheus-$(POD_NAME)'
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name

Service configuration:

yaml
apiVersion: v1
kind: Service
metadata:
  name: prometheus
spec:
  type: ClusterIP
  ports:
    - port: 9090
  selector:
    app: prometheus
---
# Headless service for StatefulSet
apiVersion: v1
kind: Service
metadata:
  name: prometheus-headless
spec:
  type: ClusterIP
  clusterIP: None
  ports:
    - port: 9090
  selector:
    app: prometheus

7. Alertmanager HA Configuration

For Alertmanager HA:

```yaml # alertmanager.yml for cluster mode cluster: peers: - alertmanager-0:9094 - alertmanager-1:9094 gossip_interval: 10s peer_timeout: 30s

high_availability: enabled: true ```

Kubernetes deployment:

yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: alertmanager
spec:
  replicas: 2
  serviceName: alertmanager
  template:
    spec:
      containers:
        - name: alertmanager
          image: prom/alertmanager:latest
          args:
            - '--cluster.peer=alertmanager-1:9094'
            - '--cluster.peer=alertmanager-0:9094'
          ports:
            - containerPort: 9093
            - containerPort: 9094

Verification

Check Alert Deduplication

```bash # Verify alerts from both sources curl -s http://alertmanager:9093/api/v2/alerts | jq '.[] | {labels: .labels, fingerprint: .fingerprint}'

# Check Alertmanager cluster status curl -s http://alertmanager:9093/api/v2/status | jq '.cluster' ```

Verify External Labels

```promql # Check both Prometheus have different replica labels count by (prometheus) ({__name__=~"prometheus_.+"})

# Query from each Prometheus {prometheus="prometheus-0"} {prometheus="prometheus-1"} ```

Check Data Consistency

bash
# Compare sample counts
curl -s 'http://prometheus-0:9090/api/v1/query?query=count({__name__=~".+"})' | jq '.data.result[0].value'
curl -s 'http://prometheus-1:9090/api/v1/query?query=count({__name__=~".+"})' | jq '.data.result[0].value'

Prevention

Add monitoring for HA pair:

```yaml groups: - name: ha_pair_alerts rules: - alert: PrometheusReplicaMissingExternalLabel expr: absent({prometheus=~".+"}) for: 5m labels: severity: critical annotations: summary: "Prometheus replica missing external label" description: "Prometheus is missing the 'prometheus' external label required for HA"

  • alert: AlertmanagerHADown
  • expr: count(alertmanager_cluster_members) != count(alertmanager_cluster_members_info)
  • for: 5m
  • labels:
  • severity: critical
  • annotations:
  • summary: "Alertmanager cluster degraded"
  • description: "Expected {{ $value }} Alertmanager members but fewer are healthy"
  • alert: PrometheusHAConfigMismatch
  • expr: |
  • count by (job) ({__name__=~".+"}) @ prometheus-0 !=
  • count by (job) ({__name__=~".+"}) @ prometheus-1
  • for: 10m
  • labels:
  • severity: warning
  • annotations:
  • summary: "Prometheus HA configuration mismatch"
  • alert: DuplicateAlertSources
  • expr: count by (alertname) (ALERTS{alertstate="firing"}) > 1
  • for: 1m
  • labels:
  • severity: warning
  • annotations:
  • summary: "Duplicate alerts detected"
  • description: "Alert {{ $labels.alertname }} firing from multiple sources"
  • `