What's Actually Happening
Patroni-managed PostgreSQL cluster cannot elect a leader node. All nodes remain in replica or uninitialized state, making the cluster unavailable for writes.
The Error You'll See
No leader in cluster:
```bash $ patronictl list
| Member | Host | Role | State | TL | Lag |
|---|---|---|---|---|---|
| node-1 | 10.0.0.1:5432 | Replica | running | ||
| node-2 | 10.0.0.2:5432 | Replica | running | ||
| node-3 | 10.0.0.3:5432 | Replica | running |
# Should show one node as Leader ```
Patroni logs:
```bash $ journalctl -u patroni | grep -i leader
INFO: no leader node found in DCS INFO: starting leader election race WARNING: failed to acquire leader lock ```
DCS unavailable:
ERROR: Failed to update DCS: connection refusedWhy This Happens
- 1.DCS unavailable - Etcd/Consul/Kubernetes API unreachable
- 2.Network partition - Nodes cannot communicate
- 3.All nodes down - No healthy nodes to elect
- 4.Quorum loss - Insufficient nodes for consensus
- 5.Configuration mismatch - Inconsistent cluster config
- 6.DCS data corruption - Lock key corrupted or missing
Step 1: Check Cluster Status
```bash # List cluster members: patronictl list
# Detailed status: patronictl list -d
# Check specific node: patronictl query postgres-cluster --member node-1
# Check DCS (Distributed Configuration Store): # For etcd: etcdctl get /service/postgres-cluster/leader
# For Consul: consul kv get service/postgres-cluster/leader
# For Kubernetes: kubectl get configmap postgres-cluster-config -o yaml
# Check Patroni API on each node: curl http://10.0.0.1:8008/patroni curl http://10.0.0.2:8008/patroni curl http://10.0.0.3:8008/patroni ```
Step 2: Check DCS Connectivity
```bash # Check DCS backend:
# For etcd: etcdctl endpoint health etcdctl endpoint status
# Check Patroni DCS configuration: cat /etc/patroni/patroni.yml | grep -A 20 dcs
# Example etcd config: dcs: etcd: host: 10.0.0.10:2379
# Test etcd connectivity: etcdctl get /service/postgres-cluster --prefix
# For Consul: consul members consul kv get -recurse service/postgres-cluster
# For Kubernetes: kubectl get endpoints kubectl describe configmap postgres-cluster-config
# If DCS unavailable, fix DCS first: # For etcd: restart etcd cluster # For Consul: check Consul leader # For Kubernetes: check API server ```
Step 3: Check Node Health
```bash # Check PostgreSQL on each node: ssh node-1 "systemctl status postgresql" ssh node-2 "systemctl status postgresql" ssh node-3 "systemctl status postgresql"
# Check Patroni process: ssh node-1 "systemctl status patroni"
# Check PostgreSQL connectivity: psql -h node-1 -U postgres -c "SELECT pg_is_in_recovery();" psql -h node-2 -U postgres -c "SELECT pg_is_in_recovery();"
# Should show: # - One node returns false (primary) # - Others return true (replicas)
# If all return true, no primary exists
# Check Patroni logs on each node: ssh node-1 "journalctl -u patroni -n 50" ```
Step 4: Force Leader Election
```bash # If cluster healthy but no leader, force election:
# Method 1: Use patronictl to promote: patronictl switchover postgres-cluster --master node-1 --force
# Method 2: Remove leader key and let election happen: # For etcd: etcdctl del /service/postgres-cluster/leader
# For Consul: consul kv delete service/postgres-cluster/leader
# For Kubernetes: kubectl patch configmap postgres-cluster-config --type=json -p='[{"op": "remove", "path": "/data/leader"}]'
# Wait 10-30 seconds for election: patronictl list
# Should now show leader
# Method 3: Initialize specific node: patronictl initialize postgres-cluster --init-from node-1 ```
Step 5: Check Node Connectivity
```bash # Check network between nodes: ssh node-1 "ping node-2" ssh node-1 "ping node-3"
# Check PostgreSQL port: ssh node-1 "nc -zv node-2 5432" ssh node-1 "nc -zv node-3 5432"
# Check Patroni API port: ssh node-1 "nc -zv node-2 8008"
# Check firewall: ssh node-1 "iptables -L -n | grep 5432"
# Allow PostgreSQL ports: iptables -I INPUT -p tcp --dport 5432 -j ACCEPT iptables -I INPUT -p tcp --dport 8008 -j ACCEPT
# Check for network partition: # Nodes in different partitions cannot elect leader # All nodes must be able to communicate ```
Step 6: Check Raft Consensus
```bash # If using DCS raft mode (Patroni 2.0+):
# Check Raft configuration: cat /etc/patroni/patroni.yml | grep -A 10 raft
# Example: raft: self_addr: 10.0.0.1:2222 partner_addrs: ['10.0.0.2:2222', '10.0.0.3:2222']
# Check Raft port connectivity: nc -zv 10.0.0.2 2222 nc -zv 10.0.0.3 2222
# Check Raft leader: # Raft leader handles DCS operations
# If Raft fails, nodes cannot coordinate # Restart Patroni on all nodes: systemctl restart patroni
# Check logs for Raft errors: journalctl -u patroni | grep -i raft ```
Step 7: Recover Failed Node
```bash # If node PostgreSQL data corrupted:
# Check PostgreSQL data: ssh node-1 "ls -la /var/lib/postgresql/data/"
# If data missing or corrupted: # Remove node from cluster temporarily: patronictl delete postgres-cluster node-1
# Reinitialize node: patronictl reinit postgres-cluster node-1
# Or manually: ssh node-1 "rm -rf /var/lib/postgresql/data/*" ssh node-1 "pg_basebackup -h node-2 -U postgres -D /var/lib/postgresql/data" ssh node-1 "systemctl restart patroni"
# Check node rejoined: patronictl list
# If all nodes corrupted, bootstrap new cluster: patronictl bootstrap postgres-cluster --force ```
Step 8: Check Configuration
```bash # Verify Patroni configuration on all nodes: cat /etc/patroni/patroni.yml
# Key settings must be consistent: # - cluster name # - DCS configuration # - PostgreSQL parameters
# Scope (cluster name): scope: postgres-cluster # Must be same on all nodes
# PostgreSQL configuration: postgresql: parameters: max_connections: 200 wal_level: replica max_wal_senders: 10 # Must allow replication
# Check for config drift: diff /etc/patroni/patroni.yml.node1 /etc/patroni/patroni.yml.node2
# Fix inconsistencies: # Copy correct config to all nodes scp /etc/patroni/patroni.yml node-2:/etc/patroni/patroni.yml systemctl restart patroni ```
Step 9: DCS Data Recovery
```bash # If DCS data corrupted:
# Check DCS keys: # For etcd: etcdctl get /service/postgres-cluster --prefix --keys-only
# Expected keys: # /service/postgres-cluster/leader # /service/postgres-cluster/members/node-1 # /service/postgres-cluster/members/node-2 # /service/postgres-cluster/optime/leader # /service/postgres-cluster/config
# If leader key missing: # Add placeholder: etcdctl put /service/postgres-cluster/leader '{"leader": "node-1"}'
# If members keys missing: # Patroni will recreate on restart systemctl restart patroni
# If config corrupted: etcdctl del /service/postgres-cluster/config # Patroni will recreate from local config
# For Consul: consul kv get -recurse service/postgres-cluster consul kv put service/postgres-cluster/leader '{"leader": "node-1"}' ```
Step 10: Monitor Cluster Health
```bash # Create monitoring script: cat << 'EOF' > /usr/local/bin/monitor-patroni.sh #!/bin/bash
echo "=== Patroni Cluster Status ===" patronictl list
echo "" echo "=== DCS Status ===" etcdctl endpoint health
echo "" echo "=== Leader Check ===" LEADER=$(patronictl list | grep Leader | awk '{print $1}') if [ -z "$LEADER" ]; then echo "ERROR: No leader in cluster!" # Send alert else echo "OK: Leader is $LEADER" fi
echo "" echo "=== Node Health ===" for node in node-1 node-2 node-3; do curl -s http://$node:8008/patroni | jq '.state' done EOF
chmod +x /usr/local/bin/monitor-patroni.sh
# Patroni exposes Prometheus metrics: curl http://localhost:8008/metrics
# Key metrics: # patroni_dcs_last_seen - last DCS update # patroni_postgresql_running - PostgreSQL state # patroni_cluster_size - number of members # patroni_is_leader - is this node leader
# Alert rules: - alert: PatroniNoLeader expr: patroni_cluster_size > 0 and sum(patroni_is_leader) == 0 for: 1m labels: severity: critical annotations: summary: "Patroni cluster has no leader" ```
Patroni Cluster No Leader Checklist
| Check | Command | Expected |
|---|---|---|
| Cluster list | patronictl list | Has Leader |
| DCS health | etcdctl health | Healthy |
| Node PostgreSQL | systemctl | Running |
| DCS leader key | etcdctl get leader | Exists |
| Network ports | nc -zv 5432 | Connected |
| Node connectivity | ping | Reachable |
| Patroni config | patroni.yml | Consistent |
Verify the Fix
```bash # After resolving leader election
# 1. Check cluster has leader patronictl list // One node shows Leader role
# 2. Verify primary can write psql -h <leader> -U postgres -c "CREATE TABLE test (id int);" // Table created
# 3. Check replicas syncing psql -h node-2 -U postgres -c "SELECT pg_last_wal_receive_lsn();" // LSN advancing
# 4. Test failover works patronictl switchover postgres-cluster --master node-2 // New leader elected
# 5. Monitor DCS etcdctl get /service/postgres-cluster/leader // Leader key updates
# 6. Check all nodes healthy patronictl list -d // All nodes running ```
Related Issues
- [Fix Etcd Leader Election Failed](/articles/fix-etcd-leader-election-failed)
- [Fix PostgreSQL WAL Archive Stuck](/articles/fix-postgresql-wal-archive-stuck)
- [Fix PostgreSQL Connection Limit Exceeded](/articles/fix-postgresql-connection-limit-exceeded)