Fix Keepalived VIP Not Failing Over

Introduction

Keepalived provides high availability through virtual IP (VIP) failover between nodes using VRRP protocol. When VIP failover fails, the virtual IP does not migrate to the backup node after the primary fails, causing service outage. Failover issues can result from configuration errors, network problems, firewall blocking VRRP, or state synchronization failures.

Symptoms

Error indicators in Keepalived logs:

bash

VRRP script failed
Lost advert after 3.205s
Ip address associated with VRID does not match
VRRP instance entering FAULT state
 Gratuitous ARP not received

Observable indicators: - VIP remains on failed primary node - Backup node stays in BACKUP state, never becomes MASTER - VIP not accessible after primary failure - Network shows VIP on wrong host - Clients unable to reach VIP

Common Causes

1.Firewall blocking VRRP - Protocol 112 not allowed
2.VRID mismatch - Different VRID on master and backup
3.Interface wrong - VRRP on incorrect network interface
4.Authentication mismatch - auth_pass different between nodes
5.Priority issues - Backup priority higher than master
6.Script check failures - Track_script marking node as fault
7.Network connectivity - Nodes cannot reach each other

Step-by-Step Fix

Step 1: Check Keepalived Status

```bash # Check Keepalived status on both nodes systemctl status keepalived

# View Keepalived logs journalctl -u keepalived -f

# Check process ps aux | grep keepalived

# Show current state cat /var/log/keepalived.log | tail -50 ```

Step 2: Check VIP Assignment

```bash # On primary - check if VIP assigned ip addr show | grep <VIP-address>

# On backup - should NOT have VIP ip addr show | grep <VIP-address>

# Check ARP table for VIP arp -n | grep <VIP-address>

# Show all IP addresses ip addr show eth0 ```

Step 3: Check VRRP Advertisements

```bash # Capture VRRP traffic tcpdump -i eth0 proto 112 -n

# Watch for VRRP adverts tcpdump -i eth0 vrrp -n -v

# Check if adverts received on backup tcpdump -i eth0 host <master-ip> and proto 112

# Check specific VRID tcpdump -i eth0 'proto 112 and vrrp[0] == 1' -n # VRID 1 ```

Step 4: Check Firewall Rules

```bash # Check iptables for VRRP iptables -L -n | grep 112

# Or check firewalld firewall-cmd --list-all | grep vrrp

# Add VRRP protocol rule iptables -I INPUT -p 112 -j ACCEPT iptables -I OUTPUT -p 112 -j ACCEPT

# For firewalld firewall-cmd --add-protocol=vrrp --permanent firewall-cmd --reload

# Also allow multicast iptables -I INPUT -d 224.0.0.0/4 -j ACCEPT ```

Step 5: Check Configuration on Both Nodes

```bash # Compare configurations cat /etc/keepalived/keepalived.conf | grep -v "^#"

# Check on both nodes - must match except priority diff /etc/keepalived/keepalived.conf <(ssh backup-node 'cat /etc/keepalived/keepalived.conf') ```

Step 6: Fix Configuration Mismatch

```conf # Primary node configuration global_defs { router_id LVS_PRIMARY vrrp_skip_check_adv_addr_no }

vrrp_instance VI_1 { state MASTER # Primary is MASTER interface eth0 virtual_router_id 51 # MUST match on all nodes priority 100 # Primary has higher priority advert_int 1

authentication { auth_type PASS auth_pass mypassword123 # MUST match on all nodes }

virtual_ipaddress { 192.168.1.100/24 dev eth0 } }

# Backup node configuration vrrp_instance VI_1 { state BACKUP # Backup is BACKUP interface eth0 virtual_router_id 51 # SAME as primary priority 90 # Lower than primary advert_int 1

authentication { auth_type PASS auth_pass mypassword123 # SAME as primary }

virtual_ipaddress { 192.168.1.100/24 dev eth0 } } ```

Step 7: Fix Track Script Issues

```bash # Check track scripts cat /etc/keepalived/keepalived.conf | grep -A5 track_script

# Check if script returns success /etc/keepalived/check_script.sh echo $? # Should return 0 for success

# Run script manually bash -x /etc/keepalived/check_haproxy.sh ```

```conf # Configure track script correctly vrrp_script check_haproxy { script "/usr/bin/killall -0 haproxy" interval 2 weight -20 # Decrease priority by 20 if script fails fall 3 # 3 failures before applying weight rise 2 # 2 successes before removing weight }

vrrp_instance VI_1 { state MASTER interface eth0 virtual_router_id 51 priority 100

track_script { check_haproxy # Reference the script }

virtual_ipaddress { 192.168.1.100 } } ```

Step 8: Force Failover for Testing

```bash # Stop Keepalived on primary systemctl stop keepalived

# Check VIP moved to backup ip addr show eth0 | grep 192.168.1.100 # On backup

# Start Keepalived on primary systemctl start keepalived

# VIP should return to primary after startup ```

Step 9: Verify the Fix

```bash # Monitor Keepalived state changes journalctl -u keepalived -f | grep -i transition

# Watch VIP assignment watch -n 1 'ip addr show eth0 | grep 192.168.1.100'

# Test from client ping 192.168.1.100

# Monitor ARP updates arping -I eth0 192.168.1.100 ```

Advanced Diagnosis

Debug VRRP State

```bash # Enable debug logging sed -i 's/DAEMON_ARGS="-d"/DAEMON_ARGS="-d -D"/' /etc/default/keepalived systemctl restart keepalived

# View detailed logs journalctl -u keepalived -n 100

# Check for specific errors grep -i "fault|error|mismatch" /var/log/keepalived.log ```

Check Multicast Connectivity

```bash # Test multicast reception ping 224.0.0.18

# Check multicast routes ip route show | grep 224

# Add multicast route if missing ip route add 224.0.0.0/4 dev eth0 ```

Verify Gratuitous ARP

```bash # Capture ARP announcements tcpdump -i eth0 arp -n -v | grep Gratuitous

# Send manual gratuitous ARP arping -I eth0 -c 3 -U 192.168.1.100

# Check ARP cache on neighbors ssh neighbor 'arp -n | grep 192.168.1.100' ```

Multiple VIPs Configuration

```conf # Multiple VIPs in single instance vrrp_instance VI_1 { state MASTER interface eth0 virtual_router_id 51 priority 100

virtual_ipaddress { 192.168.1.100/24 dev eth0 192.168.1.101/24 dev eth0 }

virtual_ipaddress_excluded { 192.168.1.200/32 dev lo # Non-VRRP IP } } ```

Unicast Mode (Non-Multicast)

```conf # Use unicast if multicast blocked vrrp_instance VI_1 { state MASTER interface eth0 virtual_router_id 51 priority 100

unicast_src_ip 192.168.1.10 unicast_peers { 192.168.1.11 # Backup node IP }

virtual_ipaddress { 192.168.1.100 } } ```

Common Pitfalls

VRID different on nodes - Most common failover failure cause
auth_pass mismatch - Authentication fails, adverts rejected
Priority backup > master - Backup thinks it should be master
Interface name wrong - VRRP on wrong interface
Firewall blocking protocol 112 - VRRP blocked
Track script weight too high - Makes priority negative
script returns non-zero - Marks instance as fault

Best Practices

```conf # Production Keepalived configuration global_defs { router_id LVS_PRIMARY enable_script_security script_user root vrrp_strict # Follow VRRP RFC strictly vrrp_garp_master_refresh 60 vrrp_garp_master_refresh_repeat 2 }

vrrp_script check_service { script "/usr/local/bin/check_service.sh" interval 3 weight -20 fall 3 rise 2 user root }

vrrp_instance VI_1 { state MASTER interface eth0 virtual_router_id 51 priority 100 advert_int 1

authentication { auth_type PASS auth_pass secret123 }

track_script { check_service weight -20 }

track_interface { eth0 }

virtual_ipaddress { 192.168.1.100/24 dev eth0 label eth0:vip }

notify_master "/usr/local/bin/vip_master.sh" notify_backup "/usr/local/bin/vip_backup.sh" notify_fault "/usr/local/bin/vip_fault.sh" } ```

bash

# Check script
#!/bin/bash
# /usr/local/bin/check_service.sh
systemctl is-active --quiet haproxy && exit 0 || exit 1

HAProxy Backend Down
HAProxy Health Check Failing
F5 BIG-IP Pool Member Down
AWS ALB Target Unhealthy

Keepalived VIP Not Failing Over