Introduction

DNS TTL (Time To Live) controls how long resolvers cache a DNS record. A high TTL (e.g., 86400 seconds / 24 hours) reduces DNS query load but means that when you change an IP address, clients continue using the old cached IP for up to the TTL duration. During a failover event, this can extend downtime from minutes to hours because a significant portion of your users are still directed to the failed server.

Symptoms

  • After changing an A record, some users still reach the old IP
  • dig example.com shows the new IP but users report the site is still down
  • Failover completed but DNS propagation takes hours
  • dig example.com @8.8.8.8 shows old IP while @1.1.1.1 shows new IP
  • Monitoring shows the new server is healthy but user complaints continue

Common Causes

  • TTL set to 24 hours (86400) or more for A/CNAME records
  • TTL not reduced before a planned maintenance or migration
  • ISP resolvers ignoring TTL changes and caching for longer than specified
  • DNS provider not honoring low TTL settings for the record
  • Recursive resolvers implementing minimum TTL policies

Step-by-Step Fix

  1. 1.Check the current TTL for the record:
  2. 2.```bash
  3. 3.dig example.com A +noall +ttlid +answer
  4. 4.# Output: example.com. 86400 IN A 1.2.3.4
  5. 5.# The 86400 is the TTL in seconds (24 hours)
  6. 6.`
  7. 7.Reduce the TTL before making IP changes:
  8. 8.```bash
  9. 9.# Set TTL to 300 seconds (5 minutes) BEFORE the planned change
  10. 10.# This must be done at least one TTL period (old TTL) before the change
  11. 11.# In your DNS management console, change TTL from 86400 to 300
  12. 12.# Wait 24 hours (the old TTL) for all caches to expire
  13. 13.`
  14. 14.For emergency failover, lower TTL and update IP simultaneously:
  15. 15.```bash
  16. 16.# Change both TTL and IP in your DNS provider
  17. 17.# Even with high TTL, some resolvers will pick up the change sooner
  18. 18.# Use a DNS provider that supports low TTLs (< 60 seconds)
  19. 19.`
  20. 20.Force clients to use updated DNS:
  21. 21.```bash
  22. 22.# On affected client machines:
  23. 23.# Windows:
  24. 24.ipconfig /flushdns
  25. 25.# macOS:
  26. 26.sudo dscacheutil -flushcache
  27. 27.sudo killall -HUP mDNSResponder
  28. 28.# Linux (systemd-resolved):
  29. 29.sudo systemd-resolve --flush-caches
  30. 30.`
  31. 31.Use multiple A records for faster failover:
  32. 32.```bash
  33. 33.# Add multiple A records - clients try them in order
  34. 34.example.com. 300 IN A 1.2.3.4
  35. 35.example.com. 300 IN A 5.6.7.8
  36. 36.# If the first IP fails, some clients will try the second
  37. 37.`
  38. 38.Implement health-check-based DNS with your provider:
  39. 39.Many DNS providers (Cloudflare, Route53, DNSMadeEasy) offer health checks that automatically update DNS records when a server fails:
  40. 40.```bash
  41. 41.# AWS Route53 example with health check
  42. 42.aws route53 change-resource-record-sets \
  43. 43.--hosted-zone-id ZONEID \
  44. 44.--change-batch file://failover-config.json
  45. 45.`

Prevention

  • Set TTL to 300 seconds (5 minutes) for production A/CNAME records
  • Lower TTL to 60 seconds at least 24 hours before planned maintenance
  • Use DNS providers that support low TTLs and health-check-based failover
  • Implement global server load balancing (GSLB) for automatic geographic failover
  • Document the TTL change procedure as part of your incident response plan