Introduction

Message brokers guarantee ordering within a partition -- messages are delivered in the same order they were appended. However, during broker failover with unclean leader election, a new leader may not have all the messages that the old leader had, causing sequence gaps or reordering. This breaks consumer processing logic that depends on strict ordering, such as event sourcing or state machine transitions.

Symptoms

  • Consumers receive messages out of sequence number order
  • State machine transitions fail because prerequisite events are missing
  • Event-sourced aggregates produce incorrect state due to missing intermediate events
  • Consumer logs show sequence number gaps or backward jumps
  • Error message: Sequence number 4523 received after 4530, ordering violated

Common Causes

  • Unclean leader election enabled, allowing out-of-sync replica to become leader
  • Broker crash before in-flight writes are flushed to disk, losing recent messages
  • Network partition causes leader change with ISR that is missing recent commits
  • Producer sends messages with acks=1 instead of acks=all, not waiting for full replication
  • Consumer processes messages asynchronously, reordering within the application layer

Step-by-Step Fix

jq -r '.sequenceNumber'

Prevention

  • Always set acks=all on producers to ensure messages are replicated before acknowledgment
  • Disable unclean.leader.election.enable in production to prevent out-of-sync leader election
  • Use min.insync.replicas=2 or higher to guarantee message durability across multiple brokers
  • Implement sequence number validation in consumers that rejects out-of-order messages
  • For critical ordering requirements, use single-partition topics or partition by a strict ordering key
  • Monitor ISR shrink events as an early warning indicator of potential ordering violations