Introduction
Kafka brokers roll log segments when they reach a configured size or time threshold. During rolling, the broker closes the current active segment file and opens a new one. Under peak write throughput, this file operation can cause brief but noticeable write latency spikes that cascade into producer timeouts, especially when segment size is misconfigured for the traffic volume.
Symptoms
- Producer request latency spikes coincide with log segment rollover events
- Broker logs show
Rolling new log segment for partitionduring peak hours - Producer timeout rate increases at regular intervals matching segment roll frequency
p99produce latency jumps from single-digit milliseconds to hundreds of milliseconds- Error message:
Expiring record batch due to timeout
Common Causes
log.segment.bytesset too small, causing frequent segment rolls under high throughputlog.roll.hourstriggering time-based rolls during peak traffic windows- Disk I/O bottleneck during segment file close and index flush operations
- Too many partitions per broker amplifying concurrent segment roll operations
- File system page cache pressure from simultaneous segment creation and index building
Step-by-Step Fix
- 1.Check current segment configuration and roll frequency: Identify how often segments are rolling.
- 2.```bash
- 3.grep "log.segment" /etc/kafka/server.properties
- 4.ls -lhS /var/lib/kafka/data/my-topic-0/ | head -20
- 5.
` - 6.Increase log segment size to reduce roll frequency during peak hours: Size segments for the workload.
- 7.```properties
- 8.log.segment.bytes=1073741824
- 9.log.roll.hours=168
- 10.log.roll.jitter.hours=24
- 11.
` - 12.Stagger segment roll timing with jitter: Prevent all partitions from rolling simultaneously.
- 13.```properties
- 14.log.roll.jitter.hours=24
- 15.
` - 16.Optimize disk I/O for segment operations: Ensure the broker disk can handle concurrent segment writes.
- 17.```bash
- 18.# Check disk I/O during segment roll
- 19.iostat -x 1 10 | grep -A1 "kafka-data"
- 20.
` - 21.Monitor produce latency around segment roll events: Verify the fix reduces latency spikes.
- 22.```bash
- 23.kafka-broker-api-versions.sh --bootstrap-server localhost:9092
- 24.# Monitor via JMX: kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce
- 25.
`
Prevention
- Set
log.segment.bytesto 1GB or higher for high-throughput topics to reduce roll frequency - Configure
log.roll.jitter.hoursto stagger segment rolls across partitions - Monitor disk IOPS and throughput to ensure the storage tier can handle peak write load plus segment roll overhead
- Use dedicated fast storage (NVMe SSD) for Kafka data directories to minimize segment roll latency
- Set
log.roll.hoursto 7 days to rely primarily on size-based rolling rather than time-based