Introduction

DNS recursive resolution is the process where a resolver follows the chain from root servers through TLD servers to authoritative servers to resolve a domain. When this chain breaks, resolution fails with errors like SERVFAIL, REFUSED, or timeout. These failures are more complex than simple timeouts because they involve multiple servers in the resolution path, any of which can cause problems.

Symptoms

  • dig returns SERVFAIL for specific domains
  • Resolution works for some domains but not others
  • dig +trace shows resolution stops mid-chain
  • Random resolution failures for external domains
  • Error messages like "lame delegation" or "refused"
  • Internal domains work but external domains fail
  • Resolution succeeds from some locations but not others

Common Causes

  • Recursive resolver misconfigured or overloaded
  • Root/TLD servers unreachable from resolver
  • Authoritative servers refusing queries
  • Lame delegation (NS points to non-authoritative server)
  • DNSSEC validation failures
  • Delegation inconsistencies between parent and child zones
  • Network ACL blocking resolver's queries

Step-by-Step Fix

  1. 1.Trace the full resolution path to identify where it breaks.

```bash # Use dig +trace to follow the complete resolution dig example.com +trace

# Output shows each step: # 1. Root servers (.) # 2. TLD servers (.com) # 3. Authoritative servers (example.com)

# Look for where it stops: # - No output after root servers: TLD unreachable # - Stops at TLD: delegation or authoritative issue # - Gets to auth server but fails: auth server problem

# Detailed trace dig example.com +trace +dnssec

# Add verbose output dig example.com +trace +verbose ```

  1. 1.Test each level of the resolution chain manually.

```bash # Step 1: Test root servers dig @a.root-servers.net example.com NS

# Step 2: Query TLD servers directly dig @a.gtld-servers.net example.com NS

# Step 3: Query authoritative servers dig @ns1.example.com example.com A

# Test each with DNSSEC dig @a.root-servers.net . DNSKEY +dnssec dig @a.gtld-servers.net com. DNSKEY +dnssec dig @ns1.example.com example.com DNSKEY +dnssec

# If any step fails, that's where the problem is ```

  1. 1.Check for lame delegation.

```bash # Lame delegation: NS record points to server not authoritative for zone

# Get NS records from parent (TLD) dig @a.gtld-servers.net example.com NS +short

# Test each NS for authoritative response for ns in $(dig @a.gtld-servers.net example.com NS +short); do echo "Testing ${ns%.}:" response=$(dig @${ns%.} example.com SOA)

# Check for authoritative answer (aa flag) if echo "$response" | grep -q "flags:.*aa"; then echo " Authoritative: YES" else echo " Authoritative: NO - LAME DELEGATION" fi

# Check response status echo "$response" | grep "status:" done

# Lame delegation shows: # - No aa flag # - May show REFUSED or SERVFAIL ```

  1. 1.Verify delegation matches between parent and child.

```bash # Get NS from parent zone echo "=== Parent Zone Delegation ===" parent_ns=$(dig @a.gtld-servers.net example.com NS +short | sort) echo "$parent_ns"

# Get NS from child zone (first authoritative server) echo -e "\n=== Child Zone NS Records ===" first_ns=$(echo "$parent_ns" | head -1) child_ns=$(dig @${first_ns%.} example.com NS +short | sort) echo "$child_ns"

# Compare echo -e "\n=== Comparison ===" if [ "$parent_ns" = "$child_ns" ]; then echo "Delegation matches" else echo "DELEGATION MISMATCH" echo "Parent has: $parent_ns" echo "Child has: $child_ns" fi ```

  1. 1.Check for DNSSEC validation failures.

```bash # Test DNSSEC validation dig example.com +dnssec

# Look for RRSIG records in answer # Look for AD (Authenticated Data) flag

# Check DNSSEC chain dig example.com DNSKEY +dnssec dig example.com DS +dnssec

# Verify DS at parent dig @a.gtld-servers.net example.com DS +dnssec

# Common DNSSEC issues: # - DS record at parent doesn't match DNSKEY # - Expired signatures # - Missing RRSIG records # - Incorrect key rollover

# Test with DNSSEC validation disabled dig example.com +cd +dnssec # +cd = checking disabled # If this works but normal query fails, DNSSEC is the problem ```

  1. 1.Test recursive resolver functionality.

```bash # Test if your recursive resolver is working dig @localhost example.com

# Or specific resolver IP dig @192.168.1.1 example.com

# Check resolver statistics # BIND: rndc stats cat /var/named/data/named_stats.txt

# Unbound: unbound-control stats

# PowerDNS Recursor: rec_control get-statistics

# Check for error counters: # - resolver.nonresolving # - resolver.authresponses # - resolver.dropped ```

  1. 1.Check if authoritative servers respond correctly.

```bash # Get authoritative servers auth_servers=$(dig example.com NS +short)

# Test each authoritative server for ns in $auth_servers; do echo "=== Testing ${ns%.} ==="

# Test UDP (standard) echo "UDP test:" dig @${ns%.} example.com A +time=5

# Test TCP (for large responses) echo "TCP test:" dig @${ns%.} example.com A +tcp +time=5

# Check EDNS support echo "EDNS test:" dig @${ns%.} example.com A +dnssec +time=5

# Check for REFUSED echo "SOA test:" dig @${ns%.} example.com SOA +time=5 done ```

  1. 1.Diagnose REFUSED responses.

```bash # REFUSED means server refuses to answer

# Test with specific query type dig @ns1.example.com example.com A dig @ns1.example.com example.com ANY

# Check if recursion is expected but refused dig @ns1.example.com example.com +recurse

# Common REFUSED causes: # 1. Query sent to authoritative server with recursion desired # 2. ACL blocking your IP # 3. Server not configured for the zone

# Check server ACL (if you control it) # BIND: # allow-query { any; }; # allow-recursion { trusted; };

# Test from different IP if possible # Use online tools like: # - https://dnschecker.org # - https://dnsviz.net ```

  1. 1.Fix recursive resolver configuration.

```bash # For BIND recursive resolver # Check named.conf options:

options { recursion yes; allow-recursion { any; }; # Or restrict to trusted allow-query { any; }; allow-query-cache { any; };

// Forward to specific servers if needed forwarders { 8.8.8.8; 1.1.1.1; }; forward only; # or first

// DNSSEC validation dnssec-validation auto; };

# Apply changes rndc reload

# For Unbound server: do-ip4: yes do-ip6: yes access-control: 0.0.0.0/0 allow access-control: ::0/0 allow

# Apply changes unbound-control reload ```

  1. 1.Check for EDNS and packet size issues.

```bash # EDNS (Extension Mechanisms for DNS) allows larger packets # Some firewalls/devices drop large UDP packets

# Test without EDNS dig example.com +bufsize=512

# Test with EDNS dig example.com +bufsize=4096

# If small buffer works but large fails, MTU/firewall issue

# Test with TCP fallback dig example.com +tcp

# Force ignore truncation dig example.com +ignore

# Check if resolver supports EDNS dig @8.8.8.8 example.com +dnssec +bufsize=4096 ```

Verification

After fixing recursive resolution issues:

```bash # 1. Full resolution trace echo "=== Full Resolution Trace ===" dig example.com +trace

# 2. Verify DNSSEC validation echo -e "\n=== DNSSEC Validation ===" dig example.com +dnssec | grep -E "flags:|RRSIG"

# 3. Test multiple domains echo -e "\n=== Multi-Domain Test ===" for domain in google.com cloudflare.com amazon.com; do echo -n "$domain: " dig $domain +short | head -1 done

# 4. Check authoritative servers echo -e "\n=== Authoritative Server Status ===" for ns in $(dig example.com NS +short); do echo -n "${ns%.}: " dig @${ns%.} example.com SOA +short | head -1 done

# 5. Verify delegation consistency echo -e "\n=== Delegation Consistency ===" parent=$(dig @a.gtld-servers.net example.com NS +short | sort) child=$(dig @$(echo $parent | head -1) example.com NS +short | sort) [ "$parent" = "$child" ] && echo "OK" || echo "MISMATCH"

# 6. Performance test echo -e "\n=== Resolution Performance ===" for i in 1 2 3; do dig example.com | grep "Query time" done ```

Recursive Resolution Flow Reference

bash
Client -> Recursive Resolver
         |
         v
    Root Servers (.)
         |
         v
    TLD Servers (.com, .net, etc.)
         |
         v
    Authoritative Servers (example.com)
         |
         v
    Answer returned to client

Each step can fail independently. Use +trace to identify which step is broken.