Introduction
Linux load average includes processes in both running (R) and uninterruptible sleep (D) states. When the load average is high but CPU utilization is low, the bottleneck is typically I/O wait - processes blocked waiting for disk, network filesystem, or storage subsystem responses. This is a critical distinction because adding more CPU will not help; the fix requires addressing the storage layer.
Symptoms
uptimeshows load average of 15+ on an 8-core systemtopshows%id(idle) above 80% but%wa(wait) above 30%vmstat 1showswacolumn consistently highiostat -x 1shows%utilnear 100% for one or more disks- Applications are slow but CPU usage graphs look normal
Common Causes
- Failing disk with retry operations causing I/O latency
- NFS mount to slow or unreachable server
- RAID rebuild in progress consuming all I/O bandwidth
- Log rotation or backup job creating massive write load
- Swap thrashing causing excessive page I/O
- Database performing full table scans on spinning disk
Step-by-Step Fix
- 1.Confirm I/O wait is the bottleneck:
- 2.```bash
- 3.vmstat 1 10
- 4.# Check the 'wa' column - consistently above 20% confirms I/O bottleneck
mpstat -P ALL 1 5 # Check %iowait per CPU core ```
- 1.Identify which processes are generating I/O:
- 2.```bash
- 3.iotop -oP
- 4.# Shows real-time I/O usage per process
pidstat -d 1 5 # Shows I/O statistics per process
# Find processes with files open on the busy disk for pid in $(pgrep -f "myapp"); do echo "PID $pid:" cat /proc/$pid/io 2>/dev/null done ```
- 1.Identify which disk is the bottleneck:
- 2.```bash
- 3.iostat -x 1 5
- 4.# Look for high %util, high await (average wait time), and high svctm
- 5.# await > 50ms indicates a problem
- 6.
` - 7.Check for failing disk hardware:
- 8.```bash
- 9.dmesg | grep -iE "error|retry|reset|timeout|I/O" | tail -20
- 10.smartctl -a /dev/sda | grep -E "Reallocated|Pending|Uncorrectable|UDMA"
- 11.
` - 12.Check for NFS-related I/O waits:
- 13.```bash
- 14.mount | grep nfs
- 15.nfsiostat
- 16.nfsstat -c
- 17.# High retransmission count indicates NFS server issues
- 18.
` - 19.Reduce I/O pressure immediately:
- 20.```bash
- 21.# Stop non-essential I/O-heavy services
- 22.sudo systemctl stop backup-service
- 23.sudo systemctl stop logrotate.timer
# Reduce I/O scheduler queue depth for latency-sensitive workloads echo 32 | sudo tee /sys/block/sda/queue/nr_requests
# Set I/O scheduler to deadline or mq-deadline for better latency echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler ```
Prevention
- Monitor I/O wait as a separate metric from CPU usage in your monitoring system
- Use SSDs for I/O-intensive workloads; spinning disks should be limited to archival
- Configure I/O scheduler appropriate for the workload (bfq for desktop, mq-deadline for servers)
- Rate-limit backup and log rotation jobs using
ioniceor systemdIOWeight= - Use
iopingfor regular I/O latency benchmarking and alerting on degradation