Fix Varnish Cache Fill Loop Origin Thundering Herd

Introduction

When a popular cached object expires in Varnish, thousands of waiting requests may simultaneously attempt to fetch a fresh copy from the origin server. This "thundering herd" can overwhelm the origin, causing it to slow down or crash. If the origin crashes, Varnish cannot fill the cache, and subsequent requests also fail, creating a cascade. The origin may recover briefly, get hit again, and crash repeatedly in a fill loop.

Symptoms

Origin server CPU and memory spike periodically
Varnish logs show repeated cache misses for the same object:
`
VCL_call MISS
BackendOpen 32 boot.default 10.0.1.100 8080
BackendClose 32 boot.default
`
Origin server logs show burst of identical requests:
`
GET /popular-page - 200 - 4500ms (normally 50ms)
`
Varnish error log:
`
Backend connection failed (10.0.1.100:8080): Connection refused
`
Site goes down every time a popular cache entry expires (e.g., every hour on the hour)

Common Causes

Popular object expires without grace period
Varnish not configured for request coalescing (collapsing identical requests)
Origin server cannot handle the burst of uncached requests
TTL too short for high-traffic content
No stale content serving while revalidating

Step-by-Step Fix

1.Enable request coalescing in Varnish VCL:
2.```vcl
3.# /etc/varnish/default.vcl
4.sub vcl_recv {
5.# Coalesce requests for the same URL
6.# If a fetch is already in progress, wait for it
7.if (req.method == "GET" || req.method == "HEAD") {
8.return (hash);
9.}
10.}

sub vcl_backend_response { # Set a generous grace period set beresp.grace = 1h;

# Set stale-while-revalidate set beresp.stale_while_revalidate = 300s; }

sub vcl_deliver { # Add debugging headers if (obj.hits > 0) { set resp.http.X-Cache = "HIT"; } else { set resp.http.X-Cache = "MISS"; } set resp.http.X-Cache-Hits = obj.hits; } ```

1.Configure Varnish to serve stale content while fetching fresh:
2.```vcl
3.sub vcl_recv {
4.# If backend is unhealthy, serve stale content
5.if (std.healthy(req.backend_hint)) {
6.# Healthy backend - allow up to 5 minutes of stale content while revalidating
7.if (req.http.Cache-Control ~ "no-cache") {
8.set req.hash_always_miss = true;
9.}
10.} else {
11.# Unhealthy backend - serve stale content up to grace period
12.set req.grace = 6h;
13.}
14.}

sub vcl_backend_response { # Grace period: serve stale for 1 hour after expiry set beresp.grace = 1h;

# If origin is slow, keep serving stale for up to 5 minutes set beresp.stale_while_revalidate = 300s; } ```

1.Implement a shield/semi-pass setup for multi-Varnish:
2.```vcl
3.# On edge Varnish nodes, set a backend that points to a shield Varnish
4.# The shield handles all origin fetches, preventing thundering herd

# Shield Varnish VCL backend origin { .host = "10.0.1.100"; .port = "8080"; .first_byte_timeout = 30s; .connect_timeout = 5s; .between_bytes_timeout = 10s; } ```

1.Monitor cache hit rate and detect thundering herd:
2.```bash
3.# Check Varnish statistics
4.varnishstat -1 | grep -E "MAIN.cache_hit|MAIN.cache_miss"

# Calculate hit rate varnishstat -1 -f MAIN.cache_hit,MAIN.cache_miss | \ awk '/cache_hit/{hit=$2} /cache_miss/{miss=$2} END{printf "Hit rate: %.2f%%\n", hit/(hit+miss)*100}'

# Watch for cache miss spikes watch -n 1 'varnishstat -1 | grep cache_miss' ```

1.Add origin server protection with rate limiting:
2.```vcl
3.# Limit concurrent requests to origin per URL
4.import std;
5.import semaphore;

sub vcl_recv { # Only allow 1 concurrent fetch per URL to origin if (req.method == "GET" && semaphore.lock(req.url, 1)) { # This request will fetch from origin } else { # Wait for the existing fetch to complete # Varnish will automatically coalesce } } ```

1.Implement exponential backoff for failed origin requests:
2.```vcl
3.sub vcl_backend_error {
4.# If origin fails, serve stale content
5.if (beresp.status >= 500 && beresp.status < 600) {
6.set beresp.grace = 24h; # Extended grace on error
7.return (deliver); # Deliver whatever stale content we have
8.}

# Retry with backoff if (bereq.retries < 3) { set beresp.ttl = 10s; # Short TTL after failure return (retry); } } ```

Prevention

Always set beresp.grace to at least 1 hour for popular content
Use stale-while-revalidate to serve stale during revalidation
Monitor cache hit rate - alert if it drops below 90%
Use a Varnish shield architecture for high-traffic sites
Set appropriate TTL based on content update frequency
Test cache expiration scenarios with load testing tools
Configure origin server with adequate capacity for cache miss bursts

Fix Site Down Varnish Cache Fill Loop Origin Thundering Herd

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Share this guide

More Downtime Troubleshooting Guides

Fix Grafana Alerting Notification Channel Delivery Failing

Fix Prometheus Target Down Scraping Failing

Fix MetricBeat System Module Causing High System Overhead

Fix Filebeat Logstash Pipeline Backpressure

Fix Packetbeat Nginx Logs Missing Metrics After Update

Fix Nomad Client Drained With No Allocations Scheduled