Introduction
Keepalived provides high availability through virtual IP (VIP) failover between nodes using VRRP protocol. When VIP failover fails, the virtual IP does not migrate to the backup node after the primary fails, causing service outage. Failover issues can result from configuration errors, network problems, firewall blocking VRRP, or state synchronization failures.
Symptoms
Error indicators in Keepalived logs:
VRRP script failed
Lost advert after 3.205s
Ip address associated with VRID does not match
VRRP instance entering FAULT state
Gratuitous ARP not receivedObservable indicators: - VIP remains on failed primary node - Backup node stays in BACKUP state, never becomes MASTER - VIP not accessible after primary failure - Network shows VIP on wrong host - Clients unable to reach VIP
Common Causes
- 1.Firewall blocking VRRP - Protocol 112 not allowed
- 2.VRID mismatch - Different VRID on master and backup
- 3.Interface wrong - VRRP on incorrect network interface
- 4.Authentication mismatch - auth_pass different between nodes
- 5.Priority issues - Backup priority higher than master
- 6.Script check failures - Track_script marking node as fault
- 7.Network connectivity - Nodes cannot reach each other
Step-by-Step Fix
Step 1: Check Keepalived Status
```bash # Check Keepalived status on both nodes systemctl status keepalived
# View Keepalived logs journalctl -u keepalived -f
# Check process ps aux | grep keepalived
# Show current state cat /var/log/keepalived.log | tail -50 ```
Step 2: Check VIP Assignment
```bash # On primary - check if VIP assigned ip addr show | grep <VIP-address>
# On backup - should NOT have VIP ip addr show | grep <VIP-address>
# Check ARP table for VIP arp -n | grep <VIP-address>
# Show all IP addresses ip addr show eth0 ```
Step 3: Check VRRP Advertisements
```bash # Capture VRRP traffic tcpdump -i eth0 proto 112 -n
# Watch for VRRP adverts tcpdump -i eth0 vrrp -n -v
# Check if adverts received on backup tcpdump -i eth0 host <master-ip> and proto 112
# Check specific VRID tcpdump -i eth0 'proto 112 and vrrp[0] == 1' -n # VRID 1 ```
Step 4: Check Firewall Rules
```bash # Check iptables for VRRP iptables -L -n | grep 112
# Or check firewalld firewall-cmd --list-all | grep vrrp
# Add VRRP protocol rule iptables -I INPUT -p 112 -j ACCEPT iptables -I OUTPUT -p 112 -j ACCEPT
# For firewalld firewall-cmd --add-protocol=vrrp --permanent firewall-cmd --reload
# Also allow multicast iptables -I INPUT -d 224.0.0.0/4 -j ACCEPT ```
Step 5: Check Configuration on Both Nodes
```bash # Compare configurations cat /etc/keepalived/keepalived.conf | grep -v "^#"
# Check on both nodes - must match except priority diff /etc/keepalived/keepalived.conf <(ssh backup-node 'cat /etc/keepalived/keepalived.conf') ```
Step 6: Fix Configuration Mismatch
```conf # Primary node configuration global_defs { router_id LVS_PRIMARY vrrp_skip_check_adv_addr_no }
vrrp_instance VI_1 { state MASTER # Primary is MASTER interface eth0 virtual_router_id 51 # MUST match on all nodes priority 100 # Primary has higher priority advert_int 1
authentication { auth_type PASS auth_pass mypassword123 # MUST match on all nodes }
virtual_ipaddress { 192.168.1.100/24 dev eth0 } }
# Backup node configuration vrrp_instance VI_1 { state BACKUP # Backup is BACKUP interface eth0 virtual_router_id 51 # SAME as primary priority 90 # Lower than primary advert_int 1
authentication { auth_type PASS auth_pass mypassword123 # SAME as primary }
virtual_ipaddress { 192.168.1.100/24 dev eth0 } } ```
Step 7: Fix Track Script Issues
```bash # Check track scripts cat /etc/keepalived/keepalived.conf | grep -A5 track_script
# Check if script returns success /etc/keepalived/check_script.sh echo $? # Should return 0 for success
# Run script manually bash -x /etc/keepalived/check_haproxy.sh ```
```conf # Configure track script correctly vrrp_script check_haproxy { script "/usr/bin/killall -0 haproxy" interval 2 weight -20 # Decrease priority by 20 if script fails fall 3 # 3 failures before applying weight rise 2 # 2 successes before removing weight }
vrrp_instance VI_1 { state MASTER interface eth0 virtual_router_id 51 priority 100
track_script { check_haproxy # Reference the script }
virtual_ipaddress { 192.168.1.100 } } ```
Step 8: Force Failover for Testing
```bash # Stop Keepalived on primary systemctl stop keepalived
# Check VIP moved to backup ip addr show eth0 | grep 192.168.1.100 # On backup
# Start Keepalived on primary systemctl start keepalived
# VIP should return to primary after startup ```
Step 9: Verify the Fix
```bash # Monitor Keepalived state changes journalctl -u keepalived -f | grep -i transition
# Watch VIP assignment watch -n 1 'ip addr show eth0 | grep 192.168.1.100'
# Test from client ping 192.168.1.100
# Monitor ARP updates arping -I eth0 192.168.1.100 ```
Advanced Diagnosis
Debug VRRP State
```bash # Enable debug logging sed -i 's/DAEMON_ARGS="-d"/DAEMON_ARGS="-d -D"/' /etc/default/keepalived systemctl restart keepalived
# View detailed logs journalctl -u keepalived -n 100
# Check for specific errors grep -i "fault|error|mismatch" /var/log/keepalived.log ```
Check Multicast Connectivity
```bash # Test multicast reception ping 224.0.0.18
# Check multicast routes ip route show | grep 224
# Add multicast route if missing ip route add 224.0.0.0/4 dev eth0 ```
Verify Gratuitous ARP
```bash # Capture ARP announcements tcpdump -i eth0 arp -n -v | grep Gratuitous
# Send manual gratuitous ARP arping -I eth0 -c 3 -U 192.168.1.100
# Check ARP cache on neighbors ssh neighbor 'arp -n | grep 192.168.1.100' ```
Multiple VIPs Configuration
```conf # Multiple VIPs in single instance vrrp_instance VI_1 { state MASTER interface eth0 virtual_router_id 51 priority 100
virtual_ipaddress { 192.168.1.100/24 dev eth0 192.168.1.101/24 dev eth0 }
virtual_ipaddress_excluded { 192.168.1.200/32 dev lo # Non-VRRP IP } } ```
Unicast Mode (Non-Multicast)
```conf # Use unicast if multicast blocked vrrp_instance VI_1 { state MASTER interface eth0 virtual_router_id 51 priority 100
unicast_src_ip 192.168.1.10 unicast_peers { 192.168.1.11 # Backup node IP }
virtual_ipaddress { 192.168.1.100 } } ```
Common Pitfalls
- VRID different on nodes - Most common failover failure cause
- auth_pass mismatch - Authentication fails, adverts rejected
- Priority backup > master - Backup thinks it should be master
- Interface name wrong - VRRP on wrong interface
- Firewall blocking protocol 112 - VRRP blocked
- Track script weight too high - Makes priority negative
- script returns non-zero - Marks instance as fault
Best Practices
```conf # Production Keepalived configuration global_defs { router_id LVS_PRIMARY enable_script_security script_user root vrrp_strict # Follow VRRP RFC strictly vrrp_garp_master_refresh 60 vrrp_garp_master_refresh_repeat 2 }
vrrp_script check_service { script "/usr/local/bin/check_service.sh" interval 3 weight -20 fall 3 rise 2 user root }
vrrp_instance VI_1 { state MASTER interface eth0 virtual_router_id 51 priority 100 advert_int 1
authentication { auth_type PASS auth_pass secret123 }
track_script { check_service weight -20 }
track_interface { eth0 }
virtual_ipaddress { 192.168.1.100/24 dev eth0 label eth0:vip }
notify_master "/usr/local/bin/vip_master.sh" notify_backup "/usr/local/bin/vip_backup.sh" notify_fault "/usr/local/bin/vip_fault.sh" } ```
# Check script
#!/bin/bash
# /usr/local/bin/check_service.sh
systemctl is-active --quiet haproxy && exit 0 || exit 1Related Issues
- HAProxy Backend Down
- HAProxy Health Check Failing
- F5 BIG-IP Pool Member Down
- AWS ALB Target Unhealthy