Introduction
Message brokers monitor disk usage using low and high watermarks. When disk usage crosses the high watermark threshold, the broker stops accepting new messages to prevent disk exhaustion and potential data corruption. This protection mechanism causes immediate production failures across all topics and partitions stored on the affected disk.
Symptoms
- Producers receive
RESOURCE_ERRORordisk fullexceptions when attempting to publish - Broker logs contain
disk watermark exceededorfree disk space below thresholdwarnings - Broker rejects connections or enters read-only mode for affected partitions
- Consumer processing stalls as no new messages arrive
- Monitoring dashboards show disk usage above 90% with flatlined production rates
Common Causes
- Log retention period set too high, accumulating more data than disk capacity
- Consumer lag causing message backlog that prevents log segment cleanup
- Unexpected traffic spike producing messages faster than retention policies can compact
- Disk not sized to handle peak message volume plus retention buffer
- Compaction or deletion policies failing silently, preventing old segment cleanup
Step-by-Step Fix
- 1.Check current disk usage and watermark configuration: Identify how far over the threshold the disk has gone.
- 2.```bash
- 3.df -h /var/lib/kafka/data
- 4.# Check broker watermark settings in server.properties
- 5.grep "log.disk" /etc/kafka/server.properties
- 6.
` - 7.Reduce log retention period temporarily to free disk space: Lower retention to trigger immediate segment deletion.
- 8.```bash
- 9.# Reduce retention to 1 hour temporarily
- 10.kafka-configs.sh --bootstrap-server localhost:9092 \
- 11.--alter --entity-type topics --entity-name my-topic \
- 12.--add-config retention.ms=3600000
- 13.
` - 14.Force log cleanup to run immediately: Trigger the log cleaner to reclaim disk space.
- 15.```properties
- 16.log.cleaner.enable=true
- 17.log.cleaner.threads=4
- 18.log.cleanup.policy=delete
- 19.
` - 20.Delete or truncate non-critical topics: Remove temporary or test topics consuming disk space.
- 21.```bash
- 22.kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic test-topic
- 23.kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic staging-events
- 24.
` - 25.Expand disk capacity or add broker nodes: If the workload has permanently outgrown current capacity.
- 26.```bash
- 27.# Check partition distribution across brokers
- 28.kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
- 29.--generate --topics-to-move-json-file topics.json --broker-list "0,1,2,3"
- 30.
`
Prevention
- Set disk high watermark at 85% of total capacity to provide adequate buffer before disk exhaustion
- Configure automated alerts at 70% (warning) and 80% (critical) disk usage
- Size disk to handle at least 3x the expected daily message volume at peak retention
- Enable log compaction for topics where only the latest value per key matters
- Monitor consumer lag continuously, as persistent lag is the primary cause of retention backlog
- Implement tiered storage to offload older segments to cheaper object storage