Introduction
Message brokers guarantee ordering within a partition -- messages are delivered in the same order they were appended. However, during broker failover with unclean leader election, a new leader may not have all the messages that the old leader had, causing sequence gaps or reordering. This breaks consumer processing logic that depends on strict ordering, such as event sourcing or state machine transitions.
Symptoms
- Consumers receive messages out of sequence number order
- State machine transitions fail because prerequisite events are missing
- Event-sourced aggregates produce incorrect state due to missing intermediate events
- Consumer logs show sequence number gaps or backward jumps
- Error message:
Sequence number 4523 received after 4530, ordering violated
Common Causes
- Unclean leader election enabled, allowing out-of-sync replica to become leader
- Broker crash before in-flight writes are flushed to disk, losing recent messages
- Network partition causes leader change with ISR that is missing recent commits
- Producer sends messages with
acks=1instead ofacks=all, not waiting for full replication - Consumer processes messages asynchronously, reordering within the application layer
Step-by-Step Fix
| jq -r '.sequenceNumber' |
|---|
Prevention
- Always set
acks=allon producers to ensure messages are replicated before acknowledgment - Disable
unclean.leader.election.enablein production to prevent out-of-sync leader election - Use
min.insync.replicas=2or higher to guarantee message durability across multiple brokers - Implement sequence number validation in consumers that rejects out-of-order messages
- For critical ordering requirements, use single-partition topics or partition by a strict ordering key
- Monitor ISR shrink events as an early warning indicator of potential ordering violations