Introduction

Linux load average includes processes in both running (R) and uninterruptible sleep (D) states. When the load average is high but CPU utilization is low, the bottleneck is typically I/O wait - processes blocked waiting for disk, network filesystem, or storage subsystem responses. This is a critical distinction because adding more CPU will not help; the fix requires addressing the storage layer.

Symptoms

  • uptime shows load average of 15+ on an 8-core system
  • top shows %id (idle) above 80% but %wa (wait) above 30%
  • vmstat 1 shows wa column consistently high
  • iostat -x 1 shows %util near 100% for one or more disks
  • Applications are slow but CPU usage graphs look normal

Common Causes

  • Failing disk with retry operations causing I/O latency
  • NFS mount to slow or unreachable server
  • RAID rebuild in progress consuming all I/O bandwidth
  • Log rotation or backup job creating massive write load
  • Swap thrashing causing excessive page I/O
  • Database performing full table scans on spinning disk

Step-by-Step Fix

  1. 1.Confirm I/O wait is the bottleneck:
  2. 2.```bash
  3. 3.vmstat 1 10
  4. 4.# Check the 'wa' column - consistently above 20% confirms I/O bottleneck

mpstat -P ALL 1 5 # Check %iowait per CPU core ```

  1. 1.Identify which processes are generating I/O:
  2. 2.```bash
  3. 3.iotop -oP
  4. 4.# Shows real-time I/O usage per process

pidstat -d 1 5 # Shows I/O statistics per process

# Find processes with files open on the busy disk for pid in $(pgrep -f "myapp"); do echo "PID $pid:" cat /proc/$pid/io 2>/dev/null done ```

  1. 1.Identify which disk is the bottleneck:
  2. 2.```bash
  3. 3.iostat -x 1 5
  4. 4.# Look for high %util, high await (average wait time), and high svctm
  5. 5.# await > 50ms indicates a problem
  6. 6.`
  7. 7.Check for failing disk hardware:
  8. 8.```bash
  9. 9.dmesg | grep -iE "error|retry|reset|timeout|I/O" | tail -20
  10. 10.smartctl -a /dev/sda | grep -E "Reallocated|Pending|Uncorrectable|UDMA"
  11. 11.`
  12. 12.Check for NFS-related I/O waits:
  13. 13.```bash
  14. 14.mount | grep nfs
  15. 15.nfsiostat
  16. 16.nfsstat -c
  17. 17.# High retransmission count indicates NFS server issues
  18. 18.`
  19. 19.Reduce I/O pressure immediately:
  20. 20.```bash
  21. 21.# Stop non-essential I/O-heavy services
  22. 22.sudo systemctl stop backup-service
  23. 23.sudo systemctl stop logrotate.timer

# Reduce I/O scheduler queue depth for latency-sensitive workloads echo 32 | sudo tee /sys/block/sda/queue/nr_requests

# Set I/O scheduler to deadline or mq-deadline for better latency echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler ```

Prevention

  • Monitor I/O wait as a separate metric from CPU usage in your monitoring system
  • Use SSDs for I/O-intensive workloads; spinning disks should be limited to archival
  • Configure I/O scheduler appropriate for the workload (bfq for desktop, mq-deadline for servers)
  • Rate-limit backup and log rotation jobs using ionice or systemd IOWeight=
  • Use ioping for regular I/O latency benchmarking and alerting on degradation