Fix Linux High IOWAIT Disk I/O Bottleneck

Introduction

Linux high iowait occurs when the CPU spends significant time waiting for disk I/O operations to complete. The wa (iowait) metric in top or vmstat indicates the percentage of CPU time spent idle while waiting for I/O. High iowait (>20-30%) causes system sluggishness, slow application response, and can indicate disk failures, I/O scheduler misconfiguration, or processes saturating disk bandwidth. Unlike CPU-bound issues, adding more CPU cores does not help— the bottleneck is disk subsystem throughput.

Symptoms

top shows wa (iowait) > 20-30% consistently
vmstat shows high bi (blocks in) and bo (blocks out)
Application response times increase, especially for database workloads
iostat shows %util near 100% on disk devices
iotop shows specific processes consuming excessive I/O
System feels sluggish even with low CPU usage
Issue appears after deploy with increased logging, backup job start, or disk degradation

Common Causes

Excessive synchronous writes (fsync after every write)
I/O scheduler mismatched to workload (mq-deadline vs none vs bfq)
Dirty page writeback thresholds too high, causing write storms
Disk nearing end of life with retry errors
RAID rebuild or scrub operation in progress
Database checkpoint or WAL flush causing write spikes
Log rotation or backup compression saturating disk

Step-by-Step Fix

### 1. Measure iowait and identify pattern

Use system tools to quantify iowait:

```bash # Check current iowait top -bn1 | grep "Cpu(s)" # Output: Cpu(s): 5.2%us, 2.1%sy, 0.0%ni, 70.3%id, 22.4%wa, ...

# vmstat shows iowait over time vmstat 1 10 # Output shows: r b swpd free buff cache si so bi bo in cs us sy id wa st # bi = blocks read, bo = blocks written, wa = iowait %

# sar historical data (if sysstat installed) sar -u 1 10 | grep -E "Average|linux" ```

Interpretation: - wa < 10%: Normal - wa 10-30%: Monitoring recommended - wa > 30%: Action required - wa > 50%: Critical, immediate action needed

### 2. Identify disk I/O saturation with iostat

Use iostat to see which disks are saturated:

```bash # Install sysstat if not present # Ubuntu/Debian sudo apt-get install sysstat

# RHEL/CentOS sudo yum install sysstat

# Run iostat iostat -xz 1 5

# Output interpretation: # Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util # sda 0.00 50.00 0.00 2000.00 0.00 100.00 0.00 66.67 0.00 10.00 0.50 0.00 40.00 20.00 100.00 ```

Key columns: - %util: Disk utilization (100% = saturated) - r_await / w_await: Average wait time in ms (>10ms is concerning) - aqu-sz: Average queue length (>1 indicates backlog) - svctm: Average service time

For NVMe devices:

```bash # NVMe-specific stats iostat -Nz 1 5

# Or use nvme-cli sudo nvme smart-log /dev/nvme0 ```

### 3. Identify processes causing I/O with iotop

Find which processes are consuming disk I/O:

```bash # Install iotop sudo apt-get install iotop # Ubuntu/Debian sudo yum install iotop # RHEL/CentOS

# Run iotop (requires root) sudo iotop -oP

# Options: # -o: Only show processes doing I/O # -P: Show processes only (not threads) # -b: Batch mode (for logging) # -n N: Number of iterations

# Single snapshot for logging sudo iotop -obPn5 > /tmp/iotop.log ```

Output columns: - DISK READ / DISK WRITE: I/O rate per process - SWAPIN: Percentage of time swapping - IO>: Percentage of time spent doing I/O

### 4. Check for disk errors and SMART status

Disk failures cause retries and high latency:

```bash # Check kernel ring buffer for disk errors dmesg | grep -iE "error|fail|sector|I/O"

# Check SMART status sudo smartctl -H /dev/sda

# Full SMART report sudo smartctl -a /dev/sda

# Key SMART attributes to check: # - Reallocated_Sector_Ct (>0 is bad) # - Current_Pending_Sector (>0 is bad) # - Offline_Uncorrectable (>0 is bad) # - Wear_Leveling_Count (low on SSDs is bad) # - Reallocated_Event_Count (increasing is bad) ```

For RAID arrays:

```bash # Check RAID status cat /proc/mdstat

# Check for degraded array or rebuild sudo mdadm --detail /dev/md0

# Check hardware RAID (LSI) sudo MegaCli -LDGetProp -LAll -aAll ```

### 5. Tune I/O scheduler for workload

Match I/O scheduler to workload type:

```bash # Check current scheduler cat /sys/block/sda/queue/scheduler # Output: [mq-deadline] kyber bfq none # Brackets show current scheduler

# Available schedulers: # - mq-deadline: Good for mixed read/write (default for HDD) # - kyber: Low-latency focused (good for SSD) # - bfq: Fair bandwidth allocation (good for desktop) # - none: No scheduling (best for NVMe)

# Change scheduler temporarily echo none > /sys/block/nvme0n1/queue/scheduler

# Change scheduler permanently (GRUB) # Add to /etc/default/grub: # GRUB_CMDLINE_LINUX="elevator=none" sudo update-grub # Ubuntu sudo grub2-mkconfig -o /boot/grub2/grub.cfg # RHEL

# Or create udev rule cat > /etc/udev/rules.d/60-scheduler.rules << 'EOF' ACTION=="add|change", KERNEL=="nvme*", ATTR{queue/scheduler}="none" ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="mq-deadline" EOF ```

Scheduler recommendations: - NVMe SSD: none (hardware handles queuing) - SATA SSD: kyber or mq-deadline - HDD: mq-deadline or bfq - Database (random I/O): none or kyber - Streaming (sequential I/O): mq-deadline

### 6. Tune dirty page writeback

Linux buffers writes in memory before flushing to disk. Aggressive settings can cause write storms:

```bash # Check current settings sysctl vm.dirty_ratio sysctl vm.dirty_background_ratio sysctl vm.dirty_expire_centisecs sysctl vm.dirty_writeback_centisecs

# Output explanation: # dirty_ratio = 20 (max % of RAM for dirty pages) # dirty_background_ratio = 10 (start background writeback at this %) # dirty_expire_centisecs = 3000 (consider dirty pages old after 30s) # dirty_writeback_centisecs = 500 (writeback daemon wakes every 5s) ```

Tune for lower latency:

```bash # For interactive/low-latency workloads sudo sysctl -w vm.dirty_ratio=10 sudo sysctl -w vm.dirty_background_ratio=5 sudo sysctl -w vm.dirty_expire_centisecs=1500 sudo sysctl -w vm.dirty_writeback_centisecs=500

# Make permanent cat >> /etc/sysctl.conf << 'EOF' # Reduce dirty pages for lower I/O latency vm.dirty_ratio = 10 vm.dirty_background_ratio = 5 vm.dirty_expire_centisecs = 1500 vm.dirty_writeback_centisecs = 500 EOF ```

For high-throughput write workloads:

bash # For batch processing / data ingestion sudo sysctl -w vm.dirty_ratio=40 sudo sysctl -w vm.dirty_background_ratio=20 # Allows more buffering, fewer fsync operations

### 7. Check for I/O throttling from cgroups

Container or cgroup limits can cause I/O bottlenecks:

```bash # Check if process is in a cgroup cat /proc/<PID>/cgroup

# Check cgroup I/O limits (cgroup v2) cat /sys/fs/cgroup/<cgroup_path>/io.max cat /sys/fs/cgroup/<cgroup_path>/io.pressure

# Check PSI (Pressure Stall Information) cat /proc/pressure/io # Output: some avg10=0.00 avg60=0.00 avg300=0.00 total=12345 # some = time when some tasks were stalled # full = time when all tasks were stalled (severe congestion) ```

For Docker containers:

```bash # Check container I/O limits docker inspect <container-id> | grep -i blki

# Or check cgroup directly cat /sys/fs/cgroup/blkio/docker/<container-id>/blkio.throttle.io_service_bytes ```

### 8. Profile I/O with blktrace

For deep I/O analysis, use blktrace:

```bash # Install blktrace sudo apt-get install blktrace # Ubuntu/Debian sudo yum install blktrace # RHEL/CentOS

# Capture I/O traces sudo blktrace -d /dev/sda -o /tmp/trace & sleep 30 sudo killall blktrace

# Analyze traces blkparse /tmp/trace | head -100

# Or use btrace for summary btrace /dev/sda > /tmp/full-trace.txt

# Visualize with blktrace tools blkgraph /dev/sda # Shows queue depth over time ```

### 9. Check filesystem mount options

Filesystem options impact I/O performance:

```bash # Check mount options mount | grep -E "ext4|xfs"

# Or findmnt -lo SOURCE,TARGET,FSTYPE,OPTIONS ```

Recommended mount options:

```bash # For ext4 (add to /etc/fstab) /dev/sda1 /data ext4 noatime,nodiratime,commit=60 0 2

# For XFS /dev/sda1 /data xfs noatime,nodiratime,logbufs=8,logbsize=256k 0 0 ```

Option explanation: - noatime: Don't update access time on reads (reduces writes) - nodiratime: Don't update directory access time - commit=60: Commit data every 60 seconds (default 5, higher = better performance but more data at risk) - logbufs=8: XFS log buffers (default 8, increase for high write I/O) - logbsize=256k: XFS log buffer size (default varies)

### 10. Identify runaway I/O processes and limit

Kill or throttle processes causing excessive I/O:

```bash # Find top I/O consumers ps aux --sort=-%cpu | head -10 ps aux --sort=-vsz | head -10

# Use ionice to deprioritize I/O sudo ionice -c3 -p <PID> # Idle priority (lowest) sudo ionice -c2 -n7 -p <PID> # Best-effort, lowest priority

# ionice classes: # -c1: Real-time (highest priority, use carefully) # -c2: Best-effort (default, -n0 to -n7 priority) # -c3: Idle (only gets I/O when no one else wants it)

# Example: Deprioritize backup process pgrep backup-script | xargs -I{} sudo ionice -c3 -p {} ```

Set I/O limits with cgroups:

```bash # Create cgroup with I/O limit sudo cgcreate -g blkio:/iolimited

# Set read/write limit (in bytes per second) echo "8:0 104857600" | sudo tee /sys/fs/cgroup/blkio/iolimited/blkio.throttle.read_bps_device echo "8:0 104857600" | sudo tee /sys/fs/cgroup/blkio/iolimited/blkio.throttle.write_bps_device # 8:0 = major:minor for /dev/sda # 104857600 = 100 MB/s limit

# Move process to cgroup echo <PID> | sudo tee /sys/fs/cgroup/blkio/iolimited/cgroup.procs ```

Prevention

Monitor iowait and set alerts at 20% threshold
Use ionice for backup and batch jobs
Schedule scrub/rebuild operations during maintenance windows
Monitor SMART attributes for early disk failure detection
Use appropriate I/O scheduler for storage type
Implement log rate limiting for verbose applications
Configure dirty page writeback for workload type
Use SSD for write-intensive workloads

**task blocked for more than 120 seconds**: I/O hung, check dmesg
**No space left on device**: Disk full, not I/O bottleneck
**Input/output error**: Disk failure or filesystem corruption

How to Fix Linux High IOWAIT and Disk I/O Bottleneck

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Related Errors

Share this guide