Introduction
Email worker and scheduler incidents often look random until you line up ownership, retries, and startup order. When duplicate execution starts after failover, the queue or scheduler is usually fine on its own, but the worker contract is not: acknowledgements never land, retries are too aggressive, or two nodes disagree about who should own the next run.
Symptoms
- Queue depth or scheduled lag grows while the service still appears up
- Messages repeat, disappear, or arrive much later than expected
- The incident got worse after recovery because retries piled onto fresh work
- A rollout or failover changed which node thinks it owns the job
Common Causes
- Workers crash or time out before the queue sees an acknowledgement
- Retry policy is faster than the system can drain the backlog
- Leader-election or scheduling windows are too short for the real environment
- Workers start consuming before their dependencies or config are actually ready
Step-by-Step Fix
- 1.Inspect the live state first
- 2.Capture the active runtime path before changing anything so you know whether the process is stale, partially rolled, or reading the wrong dependency.
date -u
printenv | sort | head -80
grep -R "error\|warn\|timeout\|retry\|version" logs . 2>/dev/null | tail -80- 1.Compare the active configuration with the intended one
- 2.Look for drift between the live process and the deployment or configuration files it should be following.
grep -R "timeout\|retry\|path\|secret\|buffer\|cache\|lease\|schedule" config deploy . 2>/dev/null | head -120- 1.Apply one explicit fix path
- 2.Prefer one clear configuration change over several partial tweaks so every instance converges on the same behavior.
retry:
maxAttempts: 5
backoff: exponential
ackMode: explicit
leaderElection:
leaseDuration: 30s
renewDeadline: 20s- 1.Verify the full request or worker path end to end
- 2.Retest the same path that was failing rather than assuming a green deployment log means the runtime has recovered.
grep -R "processed\|retry\|dead letter\|scheduled\|leader" logs . 2>/dev/null | tail -120
curl -s https://example.com/worker-status | headPrevention
- Publish active version, config, and runtime identity in one observable place
- Verify the real traffic path after every rollout instead of relying on one green health log
- Treat caches, workers, and background consumers as part of the same production system
- Keep one source of truth for credentials, timeouts, routing, and cleanup rules