Introduction

Message brokers monitor disk usage using low and high watermarks. When disk usage crosses the high watermark threshold, the broker stops accepting new messages to prevent disk exhaustion and potential data corruption. This protection mechanism causes immediate production failures across all topics and partitions stored on the affected disk.

Symptoms

  • Producers receive RESOURCE_ERROR or disk full exceptions when attempting to publish
  • Broker logs contain disk watermark exceeded or free disk space below threshold warnings
  • Broker rejects connections or enters read-only mode for affected partitions
  • Consumer processing stalls as no new messages arrive
  • Monitoring dashboards show disk usage above 90% with flatlined production rates

Common Causes

  • Log retention period set too high, accumulating more data than disk capacity
  • Consumer lag causing message backlog that prevents log segment cleanup
  • Unexpected traffic spike producing messages faster than retention policies can compact
  • Disk not sized to handle peak message volume plus retention buffer
  • Compaction or deletion policies failing silently, preventing old segment cleanup

Step-by-Step Fix

  1. 1.Check current disk usage and watermark configuration: Identify how far over the threshold the disk has gone.
  2. 2.```bash
  3. 3.df -h /var/lib/kafka/data
  4. 4.# Check broker watermark settings in server.properties
  5. 5.grep "log.disk" /etc/kafka/server.properties
  6. 6.`
  7. 7.Reduce log retention period temporarily to free disk space: Lower retention to trigger immediate segment deletion.
  8. 8.```bash
  9. 9.# Reduce retention to 1 hour temporarily
  10. 10.kafka-configs.sh --bootstrap-server localhost:9092 \
  11. 11.--alter --entity-type topics --entity-name my-topic \
  12. 12.--add-config retention.ms=3600000
  13. 13.`
  14. 14.Force log cleanup to run immediately: Trigger the log cleaner to reclaim disk space.
  15. 15.```properties
  16. 16.log.cleaner.enable=true
  17. 17.log.cleaner.threads=4
  18. 18.log.cleanup.policy=delete
  19. 19.`
  20. 20.Delete or truncate non-critical topics: Remove temporary or test topics consuming disk space.
  21. 21.```bash
  22. 22.kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic test-topic
  23. 23.kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic staging-events
  24. 24.`
  25. 25.Expand disk capacity or add broker nodes: If the workload has permanently outgrown current capacity.
  26. 26.```bash
  27. 27.# Check partition distribution across brokers
  28. 28.kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  29. 29.--generate --topics-to-move-json-file topics.json --broker-list "0,1,2,3"
  30. 30.`

Prevention

  • Set disk high watermark at 85% of total capacity to provide adequate buffer before disk exhaustion
  • Configure automated alerts at 70% (warning) and 80% (critical) disk usage
  • Size disk to handle at least 3x the expected daily message volume at peak retention
  • Enable log compaction for topics where only the latest value per key matters
  • Monitor consumer lag continuously, as persistent lag is the primary cause of retention backlog
  • Implement tiered storage to offload older segments to cheaper object storage