Introduction

Kafka brokers roll log segments when they reach a configured size or time threshold. During rolling, the broker closes the current active segment file and opens a new one. Under peak write throughput, this file operation can cause brief but noticeable write latency spikes that cascade into producer timeouts, especially when segment size is misconfigured for the traffic volume.

Symptoms

  • Producer request latency spikes coincide with log segment rollover events
  • Broker logs show Rolling new log segment for partition during peak hours
  • Producer timeout rate increases at regular intervals matching segment roll frequency
  • p99 produce latency jumps from single-digit milliseconds to hundreds of milliseconds
  • Error message: Expiring record batch due to timeout

Common Causes

  • log.segment.bytes set too small, causing frequent segment rolls under high throughput
  • log.roll.hours triggering time-based rolls during peak traffic windows
  • Disk I/O bottleneck during segment file close and index flush operations
  • Too many partitions per broker amplifying concurrent segment roll operations
  • File system page cache pressure from simultaneous segment creation and index building

Step-by-Step Fix

  1. 1.Check current segment configuration and roll frequency: Identify how often segments are rolling.
  2. 2.```bash
  3. 3.grep "log.segment" /etc/kafka/server.properties
  4. 4.ls -lhS /var/lib/kafka/data/my-topic-0/ | head -20
  5. 5.`
  6. 6.Increase log segment size to reduce roll frequency during peak hours: Size segments for the workload.
  7. 7.```properties
  8. 8.log.segment.bytes=1073741824
  9. 9.log.roll.hours=168
  10. 10.log.roll.jitter.hours=24
  11. 11.`
  12. 12.Stagger segment roll timing with jitter: Prevent all partitions from rolling simultaneously.
  13. 13.```properties
  14. 14.log.roll.jitter.hours=24
  15. 15.`
  16. 16.Optimize disk I/O for segment operations: Ensure the broker disk can handle concurrent segment writes.
  17. 17.```bash
  18. 18.# Check disk I/O during segment roll
  19. 19.iostat -x 1 10 | grep -A1 "kafka-data"
  20. 20.`
  21. 21.Monitor produce latency around segment roll events: Verify the fix reduces latency spikes.
  22. 22.```bash
  23. 23.kafka-broker-api-versions.sh --bootstrap-server localhost:9092
  24. 24.# Monitor via JMX: kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce
  25. 25.`

Prevention

  • Set log.segment.bytes to 1GB or higher for high-throughput topics to reduce roll frequency
  • Configure log.roll.jitter.hours to stagger segment rolls across partitions
  • Monitor disk IOPS and throughput to ensure the storage tier can handle peak write load plus segment roll overhead
  • Use dedicated fast storage (NVMe SSD) for Kafka data directories to minimize segment roll latency
  • Set log.roll.hours to 7 days to rely primarily on size-based rolling rather than time-based