Introduction SQL Server AlwaysOn Availability Groups rely on Windows Server Failover Clustering (WSFC) for automatic failover. When the failover process times out—due to health check failures, network latency, or resource contention—the availability group remains on the failed primary, causing application downtime.
Symptoms - Automatic failover does not occur when primary becomes unavailable - WSFC logs show `Cluster resource 'SQL Server Availability Group' failed` - `sys.dm_hadr_availability_group_states` shows `synchronization_health_desc` as NOT_HEALTHY - Failover attempt hangs for minutes and then rolls back - Application connection strings pointing to listener cannot connect
Common Causes - Health check timeout (`HealthCheckTimeout`) too short for the workload - WSFC quorum lost due to network partition - Secondary replica not synchronized (`SYNCHRONIZED` state not reached) - DNS resolution failure for the availability group listener - Resource DLL timeout preventing cluster resource movement
Step-by-Step Fix 1. **Check availability group health status": ```sql SELECT ag.name AS ag_name, ars.role_desc, ars.synchronized_desc, ars.synchronization_health_desc, ar.replica_server_name, ar.availability_mode_desc, ar.failover_mode_desc FROM sys.dm_hadr_availability_replica_states ars JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id JOIN sys.availability_groups ag ON ar.group_id = ag.group_id; ```
- 1.**Check WSFC cluster status":
- 2.```powershell
- 3.# PowerShell
- 4.Get-ClusterNode
- 5.Get-ClusterResource
- 6.Get-ClusterGroup
# Check quorum (Get-Cluster).QuorumState ```
- 1.**Adjust health check timeout":
- 2.```powershell
- 3.# Increase the health check timeout (default: 30000ms)
- 4.(Get-ClusterResource "SQL Server Availability Group").Parameters.HealthCheckTimeout = 60000
# Or via T-SQL ALTER AVAILABILITY GROUP [MyAG] MODIFY REPLICA ON N'SecondaryServer' WITH (SESSION_TIMEOUT = 30); ```
- 1.**Perform manual failover if automatic fails":
- 2.```sql
- 3.-- On the secondary replica
- 4.ALTER AVAILABILITY GROUP [MyAG] FAILOVER;
-- If the primary is completely down, force failover (data loss possible) ALTER AVAILABILITY GROUP [MyAG] FORCE_FAILOVER_ALLOW_DATA_LOSS; ```
- 1.**Verify the listener is working after failover":
- 2.```powershell
- 3.# Test listener connectivity
- 4.Test-NetConnection -ComputerName ag-listener.example.com -Port 1433
# Check DNS Resolve-DnsName ag-listener.example.com ```