Introduction

Linux high iowait occurs when the CPU spends significant time waiting for disk I/O operations to complete. The wa (iowait) metric in top or vmstat indicates the percentage of CPU time spent idle while waiting for I/O. High iowait (>20-30%) causes system sluggishness, slow application response, and can indicate disk failures, I/O scheduler misconfiguration, or processes saturating disk bandwidth. Unlike CPU-bound issues, adding more CPU cores does not help— the bottleneck is disk subsystem throughput.

Symptoms

  • top shows wa (iowait) > 20-30% consistently
  • vmstat shows high bi (blocks in) and bo (blocks out)
  • Application response times increase, especially for database workloads
  • iostat shows %util near 100% on disk devices
  • iotop shows specific processes consuming excessive I/O
  • System feels sluggish even with low CPU usage
  • Issue appears after deploy with increased logging, backup job start, or disk degradation

Common Causes

  • Excessive synchronous writes (fsync after every write)
  • I/O scheduler mismatched to workload (mq-deadline vs none vs bfq)
  • Dirty page writeback thresholds too high, causing write storms
  • Disk nearing end of life with retry errors
  • RAID rebuild or scrub operation in progress
  • Database checkpoint or WAL flush causing write spikes
  • Log rotation or backup compression saturating disk

Step-by-Step Fix

### 1. Measure iowait and identify pattern

Use system tools to quantify iowait:

```bash # Check current iowait top -bn1 | grep "Cpu(s)" # Output: Cpu(s): 5.2%us, 2.1%sy, 0.0%ni, 70.3%id, 22.4%wa, ...

# vmstat shows iowait over time vmstat 1 10 # Output shows: r b swpd free buff cache si so bi bo in cs us sy id wa st # bi = blocks read, bo = blocks written, wa = iowait %

# sar historical data (if sysstat installed) sar -u 1 10 | grep -E "Average|linux" ```

Interpretation: - wa < 10%: Normal - wa 10-30%: Monitoring recommended - wa > 30%: Action required - wa > 50%: Critical, immediate action needed

### 2. Identify disk I/O saturation with iostat

Use iostat to see which disks are saturated:

```bash # Install sysstat if not present # Ubuntu/Debian sudo apt-get install sysstat

# RHEL/CentOS sudo yum install sysstat

# Run iostat iostat -xz 1 5

# Output interpretation: # Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util # sda 0.00 50.00 0.00 2000.00 0.00 100.00 0.00 66.67 0.00 10.00 0.50 0.00 40.00 20.00 100.00 ```

Key columns: - %util: Disk utilization (100% = saturated) - r_await / w_await: Average wait time in ms (>10ms is concerning) - aqu-sz: Average queue length (>1 indicates backlog) - svctm: Average service time

For NVMe devices:

```bash # NVMe-specific stats iostat -Nz 1 5

# Or use nvme-cli sudo nvme smart-log /dev/nvme0 ```

### 3. Identify processes causing I/O with iotop

Find which processes are consuming disk I/O:

```bash # Install iotop sudo apt-get install iotop # Ubuntu/Debian sudo yum install iotop # RHEL/CentOS

# Run iotop (requires root) sudo iotop -oP

# Options: # -o: Only show processes doing I/O # -P: Show processes only (not threads) # -b: Batch mode (for logging) # -n N: Number of iterations

# Single snapshot for logging sudo iotop -obPn5 > /tmp/iotop.log ```

Output columns: - DISK READ / DISK WRITE: I/O rate per process - SWAPIN: Percentage of time swapping - IO>: Percentage of time spent doing I/O

### 4. Check for disk errors and SMART status

Disk failures cause retries and high latency:

```bash # Check kernel ring buffer for disk errors dmesg | grep -iE "error|fail|sector|I/O"

# Check SMART status sudo smartctl -H /dev/sda

# Full SMART report sudo smartctl -a /dev/sda

# Key SMART attributes to check: # - Reallocated_Sector_Ct (>0 is bad) # - Current_Pending_Sector (>0 is bad) # - Offline_Uncorrectable (>0 is bad) # - Wear_Leveling_Count (low on SSDs is bad) # - Reallocated_Event_Count (increasing is bad) ```

For RAID arrays:

```bash # Check RAID status cat /proc/mdstat

# Check for degraded array or rebuild sudo mdadm --detail /dev/md0

# Check hardware RAID (LSI) sudo MegaCli -LDGetProp -LAll -aAll ```

### 5. Tune I/O scheduler for workload

Match I/O scheduler to workload type:

```bash # Check current scheduler cat /sys/block/sda/queue/scheduler # Output: [mq-deadline] kyber bfq none # Brackets show current scheduler

# Available schedulers: # - mq-deadline: Good for mixed read/write (default for HDD) # - kyber: Low-latency focused (good for SSD) # - bfq: Fair bandwidth allocation (good for desktop) # - none: No scheduling (best for NVMe)

# Change scheduler temporarily echo none > /sys/block/nvme0n1/queue/scheduler

# Change scheduler permanently (GRUB) # Add to /etc/default/grub: # GRUB_CMDLINE_LINUX="elevator=none" sudo update-grub # Ubuntu sudo grub2-mkconfig -o /boot/grub2/grub.cfg # RHEL

# Or create udev rule cat > /etc/udev/rules.d/60-scheduler.rules << 'EOF' ACTION=="add|change", KERNEL=="nvme*", ATTR{queue/scheduler}="none" ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="mq-deadline" EOF ```

Scheduler recommendations: - NVMe SSD: none (hardware handles queuing) - SATA SSD: kyber or mq-deadline - HDD: mq-deadline or bfq - Database (random I/O): none or kyber - Streaming (sequential I/O): mq-deadline

### 6. Tune dirty page writeback

Linux buffers writes in memory before flushing to disk. Aggressive settings can cause write storms:

```bash # Check current settings sysctl vm.dirty_ratio sysctl vm.dirty_background_ratio sysctl vm.dirty_expire_centisecs sysctl vm.dirty_writeback_centisecs

# Output explanation: # dirty_ratio = 20 (max % of RAM for dirty pages) # dirty_background_ratio = 10 (start background writeback at this %) # dirty_expire_centisecs = 3000 (consider dirty pages old after 30s) # dirty_writeback_centisecs = 500 (writeback daemon wakes every 5s) ```

Tune for lower latency:

```bash # For interactive/low-latency workloads sudo sysctl -w vm.dirty_ratio=10 sudo sysctl -w vm.dirty_background_ratio=5 sudo sysctl -w vm.dirty_expire_centisecs=1500 sudo sysctl -w vm.dirty_writeback_centisecs=500

# Make permanent cat >> /etc/sysctl.conf << 'EOF' # Reduce dirty pages for lower I/O latency vm.dirty_ratio = 10 vm.dirty_background_ratio = 5 vm.dirty_expire_centisecs = 1500 vm.dirty_writeback_centisecs = 500 EOF ```

For high-throughput write workloads:

bash # For batch processing / data ingestion sudo sysctl -w vm.dirty_ratio=40 sudo sysctl -w vm.dirty_background_ratio=20 # Allows more buffering, fewer fsync operations

### 7. Check for I/O throttling from cgroups

Container or cgroup limits can cause I/O bottlenecks:

```bash # Check if process is in a cgroup cat /proc/<PID>/cgroup

# Check cgroup I/O limits (cgroup v2) cat /sys/fs/cgroup/<cgroup_path>/io.max cat /sys/fs/cgroup/<cgroup_path>/io.pressure

# Check PSI (Pressure Stall Information) cat /proc/pressure/io # Output: some avg10=0.00 avg60=0.00 avg300=0.00 total=12345 # some = time when some tasks were stalled # full = time when all tasks were stalled (severe congestion) ```

For Docker containers:

```bash # Check container I/O limits docker inspect <container-id> | grep -i blki

# Or check cgroup directly cat /sys/fs/cgroup/blkio/docker/<container-id>/blkio.throttle.io_service_bytes ```

### 8. Profile I/O with blktrace

For deep I/O analysis, use blktrace:

```bash # Install blktrace sudo apt-get install blktrace # Ubuntu/Debian sudo yum install blktrace # RHEL/CentOS

# Capture I/O traces sudo blktrace -d /dev/sda -o /tmp/trace & sleep 30 sudo killall blktrace

# Analyze traces blkparse /tmp/trace | head -100

# Or use btrace for summary btrace /dev/sda > /tmp/full-trace.txt

# Visualize with blktrace tools blkgraph /dev/sda # Shows queue depth over time ```

### 9. Check filesystem mount options

Filesystem options impact I/O performance:

```bash # Check mount options mount | grep -E "ext4|xfs"

# Or findmnt -lo SOURCE,TARGET,FSTYPE,OPTIONS ```

Recommended mount options:

```bash # For ext4 (add to /etc/fstab) /dev/sda1 /data ext4 noatime,nodiratime,commit=60 0 2

# For XFS /dev/sda1 /data xfs noatime,nodiratime,logbufs=8,logbsize=256k 0 0 ```

Option explanation: - noatime: Don't update access time on reads (reduces writes) - nodiratime: Don't update directory access time - commit=60: Commit data every 60 seconds (default 5, higher = better performance but more data at risk) - logbufs=8: XFS log buffers (default 8, increase for high write I/O) - logbsize=256k: XFS log buffer size (default varies)

### 10. Identify runaway I/O processes and limit

Kill or throttle processes causing excessive I/O:

```bash # Find top I/O consumers ps aux --sort=-%cpu | head -10 ps aux --sort=-vsz | head -10

# Use ionice to deprioritize I/O sudo ionice -c3 -p <PID> # Idle priority (lowest) sudo ionice -c2 -n7 -p <PID> # Best-effort, lowest priority

# ionice classes: # -c1: Real-time (highest priority, use carefully) # -c2: Best-effort (default, -n0 to -n7 priority) # -c3: Idle (only gets I/O when no one else wants it)

# Example: Deprioritize backup process pgrep backup-script | xargs -I{} sudo ionice -c3 -p {} ```

Set I/O limits with cgroups:

```bash # Create cgroup with I/O limit sudo cgcreate -g blkio:/iolimited

# Set read/write limit (in bytes per second) echo "8:0 104857600" | sudo tee /sys/fs/cgroup/blkio/iolimited/blkio.throttle.read_bps_device echo "8:0 104857600" | sudo tee /sys/fs/cgroup/blkio/iolimited/blkio.throttle.write_bps_device # 8:0 = major:minor for /dev/sda # 104857600 = 100 MB/s limit

# Move process to cgroup echo <PID> | sudo tee /sys/fs/cgroup/blkio/iolimited/cgroup.procs ```

Prevention

  • Monitor iowait and set alerts at 20% threshold
  • Use ionice for backup and batch jobs
  • Schedule scrub/rebuild operations during maintenance windows
  • Monitor SMART attributes for early disk failure detection
  • Use appropriate I/O scheduler for storage type
  • Implement log rate limiting for verbose applications
  • Configure dirty page writeback for workload type
  • Use SSD for write-intensive workloads
  • **task blocked for more than 120 seconds**: I/O hung, check dmesg
  • **No space left on device**: Disk full, not I/O bottleneck
  • **Input/output error**: Disk failure or filesystem corruption