Introduction
DNS TTL (Time To Live) controls how long resolvers cache a DNS record. A high TTL (e.g., 86400 seconds / 24 hours) reduces DNS query load but means that when you change an IP address, clients continue using the old cached IP for up to the TTL duration. During a failover event, this can extend downtime from minutes to hours because a significant portion of your users are still directed to the failed server.
Symptoms
- After changing an A record, some users still reach the old IP
dig example.comshows the new IP but users report the site is still down- Failover completed but DNS propagation takes hours
dig example.com @8.8.8.8shows old IP while@1.1.1.1shows new IP- Monitoring shows the new server is healthy but user complaints continue
Common Causes
- TTL set to 24 hours (86400) or more for A/CNAME records
- TTL not reduced before a planned maintenance or migration
- ISP resolvers ignoring TTL changes and caching for longer than specified
- DNS provider not honoring low TTL settings for the record
- Recursive resolvers implementing minimum TTL policies
Step-by-Step Fix
- 1.Check the current TTL for the record:
- 2.```bash
- 3.dig example.com A +noall +ttlid +answer
- 4.# Output: example.com. 86400 IN A 1.2.3.4
- 5.# The 86400 is the TTL in seconds (24 hours)
- 6.
` - 7.Reduce the TTL before making IP changes:
- 8.```bash
- 9.# Set TTL to 300 seconds (5 minutes) BEFORE the planned change
- 10.# This must be done at least one TTL period (old TTL) before the change
- 11.# In your DNS management console, change TTL from 86400 to 300
- 12.# Wait 24 hours (the old TTL) for all caches to expire
- 13.
` - 14.For emergency failover, lower TTL and update IP simultaneously:
- 15.```bash
- 16.# Change both TTL and IP in your DNS provider
- 17.# Even with high TTL, some resolvers will pick up the change sooner
- 18.# Use a DNS provider that supports low TTLs (< 60 seconds)
- 19.
` - 20.Force clients to use updated DNS:
- 21.```bash
- 22.# On affected client machines:
- 23.# Windows:
- 24.ipconfig /flushdns
- 25.# macOS:
- 26.sudo dscacheutil -flushcache
- 27.sudo killall -HUP mDNSResponder
- 28.# Linux (systemd-resolved):
- 29.sudo systemd-resolve --flush-caches
- 30.
` - 31.Use multiple A records for faster failover:
- 32.```bash
- 33.# Add multiple A records - clients try them in order
- 34.example.com. 300 IN A 1.2.3.4
- 35.example.com. 300 IN A 5.6.7.8
- 36.# If the first IP fails, some clients will try the second
- 37.
` - 38.Implement health-check-based DNS with your provider:
- 39.Many DNS providers (Cloudflare, Route53, DNSMadeEasy) offer health checks that automatically update DNS records when a server fails:
- 40.```bash
- 41.# AWS Route53 example with health check
- 42.aws route53 change-resource-record-sets \
- 43.--hosted-zone-id ZONEID \
- 44.--change-batch file://failover-config.json
- 45.
`
Prevention
- Set TTL to 300 seconds (5 minutes) for production A/CNAME records
- Lower TTL to 60 seconds at least 24 hours before planned maintenance
- Use DNS providers that support low TTLs and health-check-based failover
- Implement global server load balancing (GSLB) for automatic geographic failover
- Document the TTL change procedure as part of your incident response plan