Introduction
Kubernetes worker and scheduler incidents often look random until you line up ownership, retries, and startup order. When dead-letter queue fills because a poison message is never quarantined, the queue or scheduler is usually fine on its own, but the worker contract is not: acknowledgements never land, retries are too aggressive, or two nodes disagree about who should own the next run.
Symptoms
- Queue depth or scheduled lag grows while the service still appears up
- Messages repeat, disappear, or arrive much later than expected
- The incident got worse after recovery because retries piled onto fresh work
- A rollout or failover changed which node thinks it owns the job
Common Causes
- Workers crash or time out before the queue sees an acknowledgement
- Retry policy is faster than the system can drain the backlog
- Leader-election or scheduling windows are too short for the real environment
- Workers start consuming before their dependencies or config are actually ready
Step-by-Step Fix
- 1.Inspect the live state first
- 2.Capture the active runtime path before changing anything so you know whether the process is stale, partially rolled, or reading the wrong dependency.
date -u
printenv | sort | head -80
grep -R "error\|warn\|timeout\|retry\|version" logs . 2>/dev/null | tail -80- 1.Compare the active configuration with the intended one
- 2.Look for drift between the live process and the deployment or configuration files it should be following.
grep -R "timeout\|retry\|path\|secret\|buffer\|cache\|lease\|schedule" config deploy . 2>/dev/null | head -120- 1.Apply one explicit fix path
- 2.Prefer one clear configuration change over several partial tweaks so every instance converges on the same behavior.
retry:
maxAttempts: 5
backoff: exponential
ackMode: explicit
leaderElection:
leaseDuration: 30s
renewDeadline: 20s- 1.Verify the full request or worker path end to end
- 2.Retest the same path that was failing rather than assuming a green deployment log means the runtime has recovered.
grep -R "processed\|retry\|dead letter\|scheduled\|leader" logs . 2>/dev/null | tail -120
curl -s https://example.com/worker-status | headPrevention
- Publish active version, config, and runtime identity in one observable place
- Verify the real traffic path after every rollout instead of relying on one green health log
- Treat caches, workers, and background consumers as part of the same production system
- Keep one source of truth for credentials, timeouts, routing, and cleanup rules