What's Actually Happening

Corosync cluster fails to form between nodes. Nodes cannot communicate or establish cluster membership.

The Error You'll See

```bash $ corosync-quorumtool -s

Quorum information ------------------ Date: Mon Jan 1 12:00:00 2024 Quorum provider: corosync_votequorum Nodes configured: 3 Nodes expected: 3 Quorate: No ```

No membership:

bash
Membership information
----------------------
  Nodeid      Rings    Address
  No members

Connection error:

bash
[TOTEM ] Token has not been received in 1180 ms

Authentication error:

bash
[QB  ] error: libqb authentication failed

Why This Happens

  1. 1.Network connectivity - Nodes cannot reach each other
  2. 2.Configuration mismatch - Different corosync.conf on nodes
  3. 3.Port blocking - Firewall blocks UDP 5405/5406
  4. 4.Authentication failure - Wrong cluster key
  5. 5.Multicast issues - Multicast not working
  6. 6.Node ID conflict - Duplicate node IDs
  7. 7.Interface binding - Wrong interface configured

Step 1: Check Corosync Status

```bash # Check corosync service: systemctl status corosync

# Check process: ps aux | grep corosync

# Check quorum: corosync-quorumtool -s

# Check membership: corosync-quorumtool -m

# Check nodes: corosync-quorumtool -l

# Check corosync configuration: cat /etc/corosync/corosync.conf

# Check logs: journalctl -u corosync -f

# Corosync log: tail -f /var/log/corosync/corosync.log

# Check version: corosync -v

# Check running config: corosync-cfgtool -s

# Check ring status: corosync-cfgtool -r ```

Step 2: Check Network Connectivity

```bash # Test connectivity between nodes: ping node2 ping node3

# Check corosync ports: # UDP 5405 - Multicast/Unicast # UDP 5406 - Quorum

# Test ports: nc -zuv node2 5405 nc -zuv node2 5406

# Check firewall: iptables -L -n | grep 5405 ufw status | grep 5405

# Allow corosync ports: iptables -I INPUT -p udp --dport 5405 -j ACCEPT iptables -I INPUT -p udp --dport 5406 -j ACCEPT

# Using ufw: ufw allow 5405/udp ufw allow 5406/udp

# Using firewalld: firewall-cmd --add-port=5405-5406/udp --permanent firewall-cmd --reload

# Check network interface: ip addr show

# Check if interface exists: ip link show eth1

# Check IP address: ip addr show eth1 | grep inet

# Test multicast: # On node1: socat - UDP4-DATAGRAM:239.0.0.1:5405,broadcast

# On node2: socat UDP4-RECV:5405,bind=239.0.0.1 - ```

Step 3: Check Configuration

```bash # View configuration: cat /etc/corosync/corosync.conf

# Compare on all nodes: for node in node1 node2 node3; do echo "=== $node ===" ssh $node "cat /etc/corosync/corosync.conf" done

# Key sections to check: # totem - Transport configuration # nodelist - Node definitions # quorum - Quorum settings # logging - Log configuration

# Verify totem section: corosync-cfgtool -s

# Verify nodelist: corosync-quorumtool -l

# Check bind address matches interface IP: grep bindaddr /etc/corosync/corosync.conf

# Common issues: # 1. Different cluster names # 2. Different node IDs # 3. Wrong bind address # 4. Mismatched transport mode

# Regenerate config: pcs cluster sync

# Or manually copy: scp /etc/corosync/corosync.conf node2:/etc/corosync/ ```

Step 4: Fix Totem Configuration

```bash # Check totem section: corosync-cfgtool -s

# Totem configuration: totem { version: 2 cluster_name: mycluster transport: udpu crypto_cipher: aes256 crypto_hash: sha256 }

# For multicast: totem { transport: udp interface { member { memberaddr: 239.0.0.1 } ringnumber: 0 bindnetaddr: 192.168.1.0 mcastport: 5405 } }

# For unicast (udpu): totem { transport: udpu interface { ringnumber: 0 bindnetaddr: 192.168.1.0 mcastport: 5405 } }

# Check token timeout: # Increase if network slow: token: 10000 # 10 seconds

# Check consensus timeout: consensus: 12000 # Should be 1.2 * token

# Check join timeout: join: 60

# Check miss count: token_retransmits_before_loss_const: 10

# Reload corosync: pcs cluster reload corosync # Or: systemctl reload corosync ```

Step 5: Check Nodelist Configuration

```bash # Check nodelist: corosync-quorumtool -l

# Nodelist configuration: nodelist { node { ring0_addr: node1 nodeid: 1 } node { ring0_addr: node2 nodeid: 2 } node { ring0_addr: node3 nodeid: 3 } }

# Verify node IDs are unique: grep nodeid /etc/corosync/corosync.conf

# Check node addresses resolve: for addr in $(grep ring0_addr /etc/corosync/corosync.conf | awk '{print $3}'); do echo "Testing $addr:" ping -c 2 $addr done

# Check DNS resolution: nslookup node1 dig node1

# Or use IP addresses: ring0_addr: 192.168.1.10

# Check hostname: hostname

# Check /etc/hosts: cat /etc/hosts | grep -E "node1|node2|node3"

# Add missing entries: echo "192.168.1.10 node1" >> /etc/hosts echo "192.168.1.11 node2" >> /etc/hosts ```

Step 6: Check Authentication

```bash # Check authkey: ls -la /etc/corosync/authkey

# Authkey should be same on all nodes: for node in node1 node2 node3; do echo "=== $node ===" ssh $node "md5sum /etc/corosync/authkey" done

# Generate new authkey: corosync-keygen

# Copy to other nodes: scp /etc/corosync/authkey node2:/etc/corosync/ scp /etc/corosync/authkey node3:/etc/corosync/

# Set permissions: chmod 400 /etc/corosync/authkey chown root:root /etc/corosync/authkey

# Or use pcs: pcs cluster auth node1 node2 node3 -u hacluster -p password

# Check crypto configuration: grep -E "crypto_cipher|crypto_hash" /etc/corosync/corosync.conf

# For no encryption (testing): # Remove or comment: # crypto_cipher: none # crypto_hash: none ```

Step 7: Check Quorum Configuration

```bash # Check quorum settings: corosync-quorumtool -s

# Quorum configuration: quorum { provider: corosync_votequorum expected_votes: 2 two_node: 1 wait_for_all: 0 last_man_standing: 1 auto_tie_breaker: 0 }

# For 2-node cluster: quorum { provider: corosync_votequorum two_node: 1 expected_votes: 2 }

# For 3-node cluster: quorum { provider: corosync_votequorum expected_votes: 3 }

# Check expected votes: pcs quorum config

# Update expected votes: pcs quorum update expected_votes 2

# Update two_node: pcs quorum update two_node 1

# Check last_man_standing: grep last_man_standing /etc/corosync/corosync.conf

# Reload: pcs cluster reload corosync ```

Step 8: Debug Cluster Formation

```bash # Enable debug logging: # In corosync.conf: logging { to_logfile: yes logfile: /var/log/corosync/corosync.log debug: on timestamp: on }

# Restart: systemctl restart corosync

# Watch logs: tail -f /var/log/corosync/corosync.log

# Check for specific errors: grep -i "error|fail|token" /var/log/corosync/corosync.log

# Common errors: # 1. "Token has not been received" - Network issue # 2. "Member joined" - Normal # 3. "Member left" - Node crash/network # 4. "Not endorsing" - Quorum issue

# Check ring status: corosync-cfgtool -r

# Expected output: RING ID 0 id = 192.168.1.10 status = ring 0 active with no faults

# Force ring reconfigure: corosync-cfgtool -R

# Check membership: corosync-quorumtool -m

# Check node count: corosync-quorumtool -l ```

Step 9: Fix Common Issues

```bash # Cluster not forming:

# 1. Check all nodes have same config: md5sum /etc/corosync/corosync.conf # Should match on all nodes

# 2. Check authkey matches: md5sum /etc/corosync/authkey # Should match on all nodes

# 3. Check network: ping node2 nc -zuv node2 5405

# 4. Check firewall allows ports: ufw allow 5405-5406/udp

# 5. Restart corosync on all nodes: systemctl restart corosync

# Token timeout errors:

# 1. Increase token timeout: # In corosync.conf: token: 20000

# 2. Check network latency: ping -c 10 node2 | grep rtt

# 3. Check for packet loss: ping -c 100 node2 | grep loss

# Multicast not working:

# 1. Switch to unicast: # Change transport: udp to transport: udpu

# 2. Add unicast addresses: nodelist { node { ring0_addr: node1 nodeid: 1 } }

# Node cannot join:

# 1. Check nodeid unique # 2. Check ring0_addr correct # 3. Check authkey matches # 4. Check cluster_name matches ```

Step 10: Corosync Verification Script

```bash # Create verification script: cat << 'EOF' > /usr/local/bin/check-corosync.sh #!/bin/bash

echo "=== Corosync Service ===" systemctl status corosync 2>/dev/null | head -5 || echo "Service not running"

echo "" echo "=== Process ===" ps aux | grep corosync | grep -v grep || echo "No corosync process"

echo "" echo "=== Quorum Status ===" corosync-quorumtool -s 2>/dev/null || echo "Cannot get quorum status"

echo "" echo "=== Membership ===" corosync-quorumtool -m 2>/dev/null || echo "Cannot get membership"

echo "" echo "=== Ring Status ===" corosync-cfgtool -r 2>/dev/null || echo "Cannot get ring status"

echo "" echo "=== Configuration ===" cat /etc/corosync/corosync.conf 2>/dev/null | head -30 || echo "No config file"

echo "" echo "=== Nodelist ===" grep -A 5 "nodelist" /etc/corosync/corosync.conf 2>/dev/null || echo "No nodelist"

echo "" echo "=== Authkey ===" ls -la /etc/corosync/authkey 2>/dev/null || echo "No authkey"

echo "" echo "=== Network Connectivity ===" for addr in $(grep ring0_addr /etc/corosync/corosync.conf 2>/dev/null | awk '{print $3}'); do echo "Node: $addr" ping -c 2 -W 2 $addr 2>&1 | tail -2 nc -zuv $addr 5405 2>&1 || true done

echo "" echo "=== Firewall ===" iptables -L -n 2>/dev/null | grep 5405 || ufw status 2>/dev/null | grep 5405 || echo "Check firewall manually"

echo "" echo "=== Cluster Nodes ===" pcs status nodes 2>/dev/null || echo "pcs not available"

echo "" echo "=== Recent Logs ===" journalctl -u corosync --no-pager -n 10 2>/dev/null || tail /var/log/corosync/corosync.log 2>/dev/null | tail -10 || echo "No logs"

echo "" echo "=== Recommendations ===" echo "1. Ensure all nodes can reach each other on port 5405/5406" echo "2. Verify corosync.conf is identical on all nodes" echo "3. Check authkey matches on all nodes" echo "4. Allow UDP ports 5405-5406 in firewall" echo "5. Use unicast (udpu) if multicast issues" echo "6. Increase token timeout for slow networks" echo "7. Check node IDs are unique" EOF

chmod +x /usr/local/bin/check-corosync.sh

# Usage: /usr/local/bin/check-corosync.sh ```

Corosync Cluster Formation Checklist

CheckExpected
Service runningcorosync active on all nodes
Network reachableAll nodes ping each other
Ports openUDP 5405, 5406 allowed
Config matchSame corosync.conf on all nodes
Authkey matchSame authkey on all nodes
Node IDsUnique per node
QuorumQuorate after formation

Verify the Fix

```bash # After fixing Corosync cluster

# 1. Check service systemctl status corosync // Active running

# 2. Check quorum corosync-quorumtool -s // Quorate: Yes

# 3. Check membership corosync-quorumtool -m // All nodes listed

# 4. Check ring corosync-cfgtool -r // Ring active with no faults

# 5. Check logs journalctl -u corosync -f // No errors

# 6. Check cluster pcs status // All nodes online ```

  • [Fix Pacemaker Resource Not Starting](/articles/fix-pacemaker-resource-not-starting)
  • [Fix Keepalived VIP Not Failover](/articles/fix-keepalived-vip-not-failover)
  • [Fix etcd Cluster Unhealthy](/articles/fix-etcd-cluster-unhealthy)