What's Actually Happening

Patroni-managed PostgreSQL cluster cannot elect a leader node. All nodes remain in replica or uninitialized state, making the cluster unavailable for writes.

The Error You'll See

No leader in cluster:

```bash $ patronictl list

MemberHostRoleStateTLLag
node-110.0.0.1:5432Replicarunning
node-210.0.0.2:5432Replicarunning
node-310.0.0.3:5432Replicarunning

# Should show one node as Leader ```

Patroni logs:

```bash $ journalctl -u patroni | grep -i leader

INFO: no leader node found in DCS INFO: starting leader election race WARNING: failed to acquire leader lock ```

DCS unavailable:

bash
ERROR: Failed to update DCS: connection refused

Why This Happens

  1. 1.DCS unavailable - Etcd/Consul/Kubernetes API unreachable
  2. 2.Network partition - Nodes cannot communicate
  3. 3.All nodes down - No healthy nodes to elect
  4. 4.Quorum loss - Insufficient nodes for consensus
  5. 5.Configuration mismatch - Inconsistent cluster config
  6. 6.DCS data corruption - Lock key corrupted or missing

Step 1: Check Cluster Status

```bash # List cluster members: patronictl list

# Detailed status: patronictl list -d

# Check specific node: patronictl query postgres-cluster --member node-1

# Check DCS (Distributed Configuration Store): # For etcd: etcdctl get /service/postgres-cluster/leader

# For Consul: consul kv get service/postgres-cluster/leader

# For Kubernetes: kubectl get configmap postgres-cluster-config -o yaml

# Check Patroni API on each node: curl http://10.0.0.1:8008/patroni curl http://10.0.0.2:8008/patroni curl http://10.0.0.3:8008/patroni ```

Step 2: Check DCS Connectivity

```bash # Check DCS backend:

# For etcd: etcdctl endpoint health etcdctl endpoint status

# Check Patroni DCS configuration: cat /etc/patroni/patroni.yml | grep -A 20 dcs

# Example etcd config: dcs: etcd: host: 10.0.0.10:2379

# Test etcd connectivity: etcdctl get /service/postgres-cluster --prefix

# For Consul: consul members consul kv get -recurse service/postgres-cluster

# For Kubernetes: kubectl get endpoints kubectl describe configmap postgres-cluster-config

# If DCS unavailable, fix DCS first: # For etcd: restart etcd cluster # For Consul: check Consul leader # For Kubernetes: check API server ```

Step 3: Check Node Health

```bash # Check PostgreSQL on each node: ssh node-1 "systemctl status postgresql" ssh node-2 "systemctl status postgresql" ssh node-3 "systemctl status postgresql"

# Check Patroni process: ssh node-1 "systemctl status patroni"

# Check PostgreSQL connectivity: psql -h node-1 -U postgres -c "SELECT pg_is_in_recovery();" psql -h node-2 -U postgres -c "SELECT pg_is_in_recovery();"

# Should show: # - One node returns false (primary) # - Others return true (replicas)

# If all return true, no primary exists

# Check Patroni logs on each node: ssh node-1 "journalctl -u patroni -n 50" ```

Step 4: Force Leader Election

```bash # If cluster healthy but no leader, force election:

# Method 1: Use patronictl to promote: patronictl switchover postgres-cluster --master node-1 --force

# Method 2: Remove leader key and let election happen: # For etcd: etcdctl del /service/postgres-cluster/leader

# For Consul: consul kv delete service/postgres-cluster/leader

# For Kubernetes: kubectl patch configmap postgres-cluster-config --type=json -p='[{"op": "remove", "path": "/data/leader"}]'

# Wait 10-30 seconds for election: patronictl list

# Should now show leader

# Method 3: Initialize specific node: patronictl initialize postgres-cluster --init-from node-1 ```

Step 5: Check Node Connectivity

```bash # Check network between nodes: ssh node-1 "ping node-2" ssh node-1 "ping node-3"

# Check PostgreSQL port: ssh node-1 "nc -zv node-2 5432" ssh node-1 "nc -zv node-3 5432"

# Check Patroni API port: ssh node-1 "nc -zv node-2 8008"

# Check firewall: ssh node-1 "iptables -L -n | grep 5432"

# Allow PostgreSQL ports: iptables -I INPUT -p tcp --dport 5432 -j ACCEPT iptables -I INPUT -p tcp --dport 8008 -j ACCEPT

# Check for network partition: # Nodes in different partitions cannot elect leader # All nodes must be able to communicate ```

Step 6: Check Raft Consensus

```bash # If using DCS raft mode (Patroni 2.0+):

# Check Raft configuration: cat /etc/patroni/patroni.yml | grep -A 10 raft

# Example: raft: self_addr: 10.0.0.1:2222 partner_addrs: ['10.0.0.2:2222', '10.0.0.3:2222']

# Check Raft port connectivity: nc -zv 10.0.0.2 2222 nc -zv 10.0.0.3 2222

# Check Raft leader: # Raft leader handles DCS operations

# If Raft fails, nodes cannot coordinate # Restart Patroni on all nodes: systemctl restart patroni

# Check logs for Raft errors: journalctl -u patroni | grep -i raft ```

Step 7: Recover Failed Node

```bash # If node PostgreSQL data corrupted:

# Check PostgreSQL data: ssh node-1 "ls -la /var/lib/postgresql/data/"

# If data missing or corrupted: # Remove node from cluster temporarily: patronictl delete postgres-cluster node-1

# Reinitialize node: patronictl reinit postgres-cluster node-1

# Or manually: ssh node-1 "rm -rf /var/lib/postgresql/data/*" ssh node-1 "pg_basebackup -h node-2 -U postgres -D /var/lib/postgresql/data" ssh node-1 "systemctl restart patroni"

# Check node rejoined: patronictl list

# If all nodes corrupted, bootstrap new cluster: patronictl bootstrap postgres-cluster --force ```

Step 8: Check Configuration

```bash # Verify Patroni configuration on all nodes: cat /etc/patroni/patroni.yml

# Key settings must be consistent: # - cluster name # - DCS configuration # - PostgreSQL parameters

# Scope (cluster name): scope: postgres-cluster # Must be same on all nodes

# PostgreSQL configuration: postgresql: parameters: max_connections: 200 wal_level: replica max_wal_senders: 10 # Must allow replication

# Check for config drift: diff /etc/patroni/patroni.yml.node1 /etc/patroni/patroni.yml.node2

# Fix inconsistencies: # Copy correct config to all nodes scp /etc/patroni/patroni.yml node-2:/etc/patroni/patroni.yml systemctl restart patroni ```

Step 9: DCS Data Recovery

```bash # If DCS data corrupted:

# Check DCS keys: # For etcd: etcdctl get /service/postgres-cluster --prefix --keys-only

# Expected keys: # /service/postgres-cluster/leader # /service/postgres-cluster/members/node-1 # /service/postgres-cluster/members/node-2 # /service/postgres-cluster/optime/leader # /service/postgres-cluster/config

# If leader key missing: # Add placeholder: etcdctl put /service/postgres-cluster/leader '{"leader": "node-1"}'

# If members keys missing: # Patroni will recreate on restart systemctl restart patroni

# If config corrupted: etcdctl del /service/postgres-cluster/config # Patroni will recreate from local config

# For Consul: consul kv get -recurse service/postgres-cluster consul kv put service/postgres-cluster/leader '{"leader": "node-1"}' ```

Step 10: Monitor Cluster Health

```bash # Create monitoring script: cat << 'EOF' > /usr/local/bin/monitor-patroni.sh #!/bin/bash

echo "=== Patroni Cluster Status ===" patronictl list

echo "" echo "=== DCS Status ===" etcdctl endpoint health

echo "" echo "=== Leader Check ===" LEADER=$(patronictl list | grep Leader | awk '{print $1}') if [ -z "$LEADER" ]; then echo "ERROR: No leader in cluster!" # Send alert else echo "OK: Leader is $LEADER" fi

echo "" echo "=== Node Health ===" for node in node-1 node-2 node-3; do curl -s http://$node:8008/patroni | jq '.state' done EOF

chmod +x /usr/local/bin/monitor-patroni.sh

# Patroni exposes Prometheus metrics: curl http://localhost:8008/metrics

# Key metrics: # patroni_dcs_last_seen - last DCS update # patroni_postgresql_running - PostgreSQL state # patroni_cluster_size - number of members # patroni_is_leader - is this node leader

# Alert rules: - alert: PatroniNoLeader expr: patroni_cluster_size > 0 and sum(patroni_is_leader) == 0 for: 1m labels: severity: critical annotations: summary: "Patroni cluster has no leader" ```

Patroni Cluster No Leader Checklist

CheckCommandExpected
Cluster listpatronictl listHas Leader
DCS healthetcdctl healthHealthy
Node PostgreSQLsystemctlRunning
DCS leader keyetcdctl get leaderExists
Network portsnc -zv 5432Connected
Node connectivitypingReachable
Patroni configpatroni.ymlConsistent

Verify the Fix

```bash # After resolving leader election

# 1. Check cluster has leader patronictl list // One node shows Leader role

# 2. Verify primary can write psql -h <leader> -U postgres -c "CREATE TABLE test (id int);" // Table created

# 3. Check replicas syncing psql -h node-2 -U postgres -c "SELECT pg_last_wal_receive_lsn();" // LSN advancing

# 4. Test failover works patronictl switchover postgres-cluster --master node-2 // New leader elected

# 5. Monitor DCS etcdctl get /service/postgres-cluster/leader // Leader key updates

# 6. Check all nodes healthy patronictl list -d // All nodes running ```

  • [Fix Etcd Leader Election Failed](/articles/fix-etcd-leader-election-failed)
  • [Fix PostgreSQL WAL Archive Stuck](/articles/fix-postgresql-wal-archive-stuck)
  • [Fix PostgreSQL Connection Limit Exceeded](/articles/fix-postgresql-connection-limit-exceeded)