Home / SQL Server / SQL Server AlwaysOn Availability Group Failover Timeout

SQL Server

SQL Server AlwaysOn Availability Group Failover Timeout

How to diagnose and fix SQL Server AlwaysOn Availability Group failover timeouts that prevent automatic disaster recovery.

Yesterday2 min read

Abstract illustration for a troubleshooting knowledge base category.

Introduction SQL Server AlwaysOn Availability Groups rely on Windows Server Failover Clustering (WSFC) for automatic failover. When the failover process times out—due to health check failures, network latency, or resource contention—the availability group remains on the failed primary, causing application downtime.

Symptoms - Automatic failover does not occur when primary becomes unavailable - WSFC logs show `Cluster resource 'SQL Server Availability Group' failed` - `sys.dm_hadr_availability_group_states` shows `synchronization_health_desc` as NOT_HEALTHY - Failover attempt hangs for minutes and then rolls back - Application connection strings pointing to listener cannot connect

Common Causes - Health check timeout (`HealthCheckTimeout`) too short for the workload - WSFC quorum lost due to network partition - Secondary replica not synchronized (`SYNCHRONIZED` state not reached) - DNS resolution failure for the availability group listener - Resource DLL timeout preventing cluster resource movement

Step-by-Step Fix 1. **Check availability group health status": ```sql SELECT ag.name AS ag_name, ars.role_desc, ars.synchronized_desc, ars.synchronization_health_desc, ar.replica_server_name, ar.availability_mode_desc, ar.failover_mode_desc FROM sys.dm_hadr_availability_replica_states ars JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id JOIN sys.availability_groups ag ON ar.group_id = ag.group_id; ```

1.**Check WSFC cluster status":
2.```powershell
3.# PowerShell
4.Get-ClusterNode
5.Get-ClusterResource
6.Get-ClusterGroup

# Check quorum (Get-Cluster).QuorumState ```

1.**Adjust health check timeout":
2.```powershell
3.# Increase the health check timeout (default: 30000ms)
4.(Get-ClusterResource "SQL Server Availability Group").Parameters.HealthCheckTimeout = 60000

# Or via T-SQL ALTER AVAILABILITY GROUP [MyAG] MODIFY REPLICA ON N'SecondaryServer' WITH (SESSION_TIMEOUT = 30); ```

1.**Perform manual failover if automatic fails":
2.```sql
3.-- On the secondary replica
4.ALTER AVAILABILITY GROUP [MyAG] FAILOVER;

-- If the primary is completely down, force failover (data loss possible) ALTER AVAILABILITY GROUP [MyAG] FORCE_FAILOVER_ALLOW_DATA_LOSS; ```

1.**Verify the listener is working after failover":
2.```powershell
3.# Test listener connectivity
4.Test-NetConnection -ComputerName ag-listener.example.com -Port 1433

# Check DNS Resolve-DnsName ag-listener.example.com ```

Prevention - Configure automatic failover only between synchronous replicas - Monitor `synchronization_health_desc` continuously with alerting - Set `HealthCheckTimeout` appropriately (60s for production) - Test failover monthly during maintenance windows - Ensure WSFC has proper quorum configuration (disk or cloud witness) - Use separate networks for WSFC heartbeat and application traffic - Monitor listener DNS resolution from application servers