Introduction
Linux disk I/O saturation occurs when storage devices cannot keep up with read/write demand, causing system-wide performance degradation characterized by high I/O wait times, process blocking, and unresponsive systems. This bottleneck manifests as high %util in iostat, elevated wa (I/O wait) in top/vmstat, processes stuck in uninterruptible sleep (D state), increased latency for disk operations, and cascading effects on network services and databases. Common causes include database checkpoint writes overwhelming disk, log file writes exceeding disk bandwidth, backup jobs competing with production workload, swap thrashing due to memory pressure, VM image files causing random I/O patterns, RAID rebuild consuming disk bandwidth, disk failing with bad sectors causing retries, I/O scheduler misconfiguration for workload type, multiple processes competing for limited IOPS, and container workloads without I/O limits. The fix requires identifying the I/O-bound processes, understanding the I/O pattern (sequential vs random, read vs write), tuning I/O scheduler and queue depths, implementing I/O limits via cgroups/ionice, and potentially upgrading storage hardware. This guide provides production-proven troubleshooting for disk I/O saturation across physical servers, VMs, and container workloads.
Symptoms
iostatshows%utilnear 100% for extended periodstopshows highwa(I/O wait) percentage (>30%)vmstatshows highbi(blocks in) andbo(blocks out)- Processes stuck in D state (uninterruptible sleep)
iotopshows processes with high I/O bandwidth usage- System feels sluggish, commands respond slowly
- Database queries timing out waiting for disk
- Network services slow due to log write blocking
dmesgshows I/O errors or timeout messages- SSD wearing out prematurely (high write amplification)
Common Causes
- Database checkpoint or WAL writes overwhelming disk
- Application log writes at high volume
- Backup software reading entire filesystem
- Swap activity due to insufficient RAM
- Memory pressure causing page cache thrashing
- Multiple VMs on same storage competing for IOPS
- Container workloads without I/O limits
- RAID rebuild or scrub operation
- Failing disk with retry loops
- I/O scheduler inappropriate for workload (CFQ vs deadline vs noop)
- Queue depth too low or too high
- Filesystem fragmentation (HDD)
- Small random I/O pattern on SSD without proper alignment
Step-by-Step Fix
### 1. Diagnose disk I/O saturation
Check overall I/O statistics:
```bash # Install required tools # Debian/Ubuntu apt-get install sysstat iotop hdparm
# RHEL/CentOS yum install sysstat iotop hdparm
# View real-time I/O stats iostat -xhz 1 # Extended stats, human-readable, 1-second interval
# Output interpretation: # Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util # sda 1.25 85.32 50.12 4521.33 0.12 12.45 9.23 12.71 0.82 15.23 1.30 40.10 52.99 1.15 9.95 # # Key metrics: # - %util: Percentage of time device was busy (100% = saturated) # - aqu-sz: Average queue length (>4 indicates queuing) # - r_await/w_await: Average latency ms (>10ms HDD, >1ms SSD is concerning) # - rkB/s/wkB/s: Throughput KB/s # - svctm: Service time per I/O (deprecated, use r_await/w_await)
# Compare read vs write load iostat -xhz 1 | grep -E "Device|sda"
# Check historical I/O (requires sar) sar -d -p # Daily I/O stats from sysstat sar -d -p -s 09:00:00 -e 10:00:00 # Specific time range ```
Check system-wide I/O wait:
```bash # Top shows I/O wait percentage top -bn1 | grep "Cpu(s)"
# Output: # Cpu(s): 12.5%us, 8.3%sy, 0.0%ni, 65.2%id, 14.0%wa, 0.0%hi, 0.0%si, 0.0%st # # Key fields: # - id (idle): Should be >20% for healthy system # - wa (I/O wait): >30% indicates I/O bottleneck # - us (user) + sy (system): CPU-bound if high with low wa
# vmstat shows I/O activity vmstat 1 5
# Output: # procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- # r b swpd free buff cache si so bi bo in cs us sy id wa st # 1 3 0 102400 51200 512000 0 0 150 1200 1500 3000 12 8 65 15 0 # # Key fields: # - b (blocked): Processes in uninterruptible sleep (D state) # - bi (blocks in): Blocks read from disk # - bo (blocks out): Blocks written to disk # - wa (wait): CPU I/O wait percentage
# Check processes in D state (I/O blocked) ps aux | awk '$8 ~ /D/ {print}'
# Or ps -eo pid,stat,comm | awk '$2 ~ /D/ {print}' ```
Identify I/O-heavy processes:
```bash # iotop shows per-process I/O (requires root) iotop -oPa # Only, accumulated, all
# Output: # TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND # 1234 be/4 root 1.50 G 500.23 M 0.00 % 95.23 % /usr/bin/mysqld # 2345 be/4 mysql 500.12 M 1.20 G 0.00 % 85.12 % /usr/bin/java # # Sort by read: iotop -oPa --sort=READ # Sort by write: iotop -oPa --sort=WRITE
# Alternative: pidstat for I/O pidstat -d 1 # Per-process I/O stats
# Output: # Linux 5.4.0 (hostname) 01/15/2024 _x86_64_ (4 CPU) # # 02:30:00 PM UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command # 02:30:01 PM 999 1234 150.12 50.23 0.00 850 mysqld # 02:30:01 PM 0 2345 50.23 120.45 0.00 420 java # # kB_ccwr/s: Cancelled write (overwritten before flush) # iodelay: I/O delay in clock ticks
# Check process open files (identifies which files being accessed) lsof -p 1234 | grep -E "\.(ibd|log|db)$"
# Or list by process name lsof -c mysqld | head -50 ```
### 2. Fix I/O scheduler configuration
Choose appropriate I/O scheduler:
```bash # Check current I/O scheduler cat /sys/block/sda/queue/scheduler
# Output: [mq-deadline] kyber bfq none # Brackets show current scheduler
# Available schedulers: # - mq-deadline: Good for mixed read/write, latency-sensitive (default for many SSDs) # - kyber: Token-based, good for SSDs, balances latency and throughput # - bfq: Budget Fair Queueing, good for interactive desktop, reduces latency variance # - none: No scheduler (NVMe default), best for fast storage
# Check device type (HDD vs SSD vs NVMe) lsblk -d -o name,rota,type,model # rota=1: Rotational (HDD) # rota=0: Non-rotational (SSD/NVMe)
# Change I/O scheduler (temporary, lost on reboot) echo mq-deadline > /sys/block/sda/queue/scheduler
# Change I/O scheduler (persistent) # Method 1: Kernel boot parameter # Edit /etc/default/grub GRUB_CMDLINE_LINUX="elevator=mq-deadline" update-grub # Debian/Ubuntu grub2-mkconfig -o /boot/grub2/grub.cfg # RHEL/CentOS
# Method 2: udev rule cat > /etc/udev/rules.d/60-scheduler.rules << 'EOF' ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="mq-deadline" ACTION=="add|change", KERNEL=="nvme*", ATTR{queue/scheduler}="none" EOF
# Method 3: systemd (modern approach) # /etc/systemd/system/io-scheduler.service [Unit] Description=Set I/O scheduler After=local-fs.target
[Service] Type=oneshot ExecStart=/bin/bash -c 'echo mq-deadline > /sys/block/sda/queue/scheduler' RemainAfterExit=yes
[Install] WantedBy=multi-user.target
systemctl enable io-scheduler ```
Tune scheduler parameters:
```bash # Check current queue settings cat /sys/block/sda/queue/*
# Key parameters: # - nr_requests: Max requests in queue (default 256) # - read_ahead_kb: Read-ahead buffer size (default 128KB) # - add_random: Add entropy to request completion (security)
# Increase queue depth for high-throughput workloads echo 1024 > /sys/block/sda/queue/nr_requests
# Increase read-ahead for sequential read workloads (databases, backups) echo 512 > /sys/block/sda/queue/read_ahead_kb # 512KB
# Decrease read-ahead for random I/O workloads echo 32 > /sys/block/sda/queue/read_ahead_kb # 32KB
# For NVMe drives echo none > /sys/block/nvme0n1/queue/scheduler echo 2048 > /sys/block/nvme0n1/queue/nr_requests ```
### 3. Limit process I/O impact
Use ionice for process prioritization:
```bash # Check current I/O priority ionice -p 1234
# Output: class: best-effort, prio: 4
# I/O priority classes: # - 0 (none): No priority (root only) # - 1 (realtime): Highest priority, use sparingly # - 2 (best-effort): Default class, priority 0-7 # - 3 (idle): Lowest priority, only when disk idle
# Set I/O priority for running process ionice -c 2 -n 0 -p 1234 # Best-effort, highest priority ionice -c 2 -n 7 -p 2345 # Best-effort, lowest priority ionice -c 3 -p 3456 # Idle (backup jobs, log rotation)
# Start command with I/O priority ionice -c 3 nice -n 19 rsync -av /source /dest ionice -c 2 -n 7 mysqld # MySQL with low I/O priority
# Combined CPU and I/O priority nice -n 10 ionice -c 2 -n 5 backup-script.sh ```
Use cgroups for I/O limits:
```bash # cgroups v2 (modern systems) # Create cgroup for limiting I/O mkdir -p /sys/fs/cgroup/limited-io echo "104857600" > /sys/fs/cgroup/limited-io/io.max # 100MB/s write limit echo "1000" > /sys/fs/cgroup/limited-io/io.bfq.weight # Weight 1000 (1-10000)
# Add process to cgroup echo 1234 > /sys/fs/cgroup/limited-io/cgroup.procs
# cgroups v1 (older systems) # Create blkio cgroup mkdir -p /sys/fs/cgroup/blkio/limited
# Set weight (100-1000, default 500) echo 100 > /sys/fs/cgroup/blkio/limited/blkio.weight
# Set throttle limits echo "8:0 104857600" > /sys/fs/cgroup/blkio/limited/blkio.throttle.write_bps_device # Format: major:minor bytes_per_second
# Add process echo 1234 > /sys/fs/cgroup/blkio/limited/cgroup.procs
# Systemd slice (modern, persistent) cat > /etc/systemd/system/low-priority.slice << 'EOF' [Slice] IOWeight=100 IOReadBandwidthMax=/dev/sda 50M IOWriteBandwidthMax=/dev/sda 50M EOF
systemctl daemon-reload
# Run service in slice systemctl set-property myservice.service Slice=low-priority.slice ```
### 4. Fix swap-induced I/O
Check swap activity:
```bash # Check swap usage free -h # Or swapon --show
# Check swap activity vmstat 1 5 | grep -E "si|so" # si: Swap in (pages from disk to memory) # so: Swap out (pages from memory to disk)
# If si/so consistently > 0, system is swapping
# Check which processes using swap for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file done | sort -k 2 -n -r | head -20
# Or use smem (if installed) smem -s swap -r ```
Reduce swap tendency:
```bash # Check current swappiness (0-100) cat /proc/sys/vm/swappiness
# Default: 60 # Lower = less aggressive swapping # Higher = more aggressive swapping
# Temporarily reduce swappiness sysctl vm.swappiness=10
# Persistently reduce swappiness echo "vm.swappiness=10" >> /etc/sysctl.conf sysctl -p
# For database servers, use even lower echo "vm.swappiness=1" >> /etc/sysctl.conf
# Check dirty page writeback settings cat /proc/sys/vm/dirty_ratio # Default: 20 (% of RAM) cat /proc/sys/vm/dirty_background_ratio # Default: 10 (% of RAM)
# Reduce to force earlier, smaller writes (smoother I/O) sysctl vm.dirty_ratio=10 sysctl vm.dirty_background_ratio=5
# Persist cat >> /etc/sysctl.conf << EOF vm.dirty_ratio=10 vm.dirty_background_ratio=5 EOF sysctl -p ```
Add swap if needed:
```bash # If system has no swap and running out of memory, add swap file # Note: This is a workaround, not a solution for I/O issues
# Create swap file fallocate -l 4G /swapfile chmod 600 /swapfile mkswap /swapfile swapon /swapfile
# Add to fstab for persistence echo "/swapfile none swap sw 0 0" >> /etc/fstab
# Check swap is active swapon --show
# Better solution: Add more RAM or reduce memory usage ```
### 5. Identify and fix I/O-heavy workloads
Database I/O optimization:
```bash # MySQL/MariaDB - Check I/O wait mysql -e "SHOW ENGINE INNODB STATUS\G" | grep -A5 "I/O thread"
# MySQL - Optimize for I/O # In my.cnf: # [mysqld] # innodb_io_capacity=2000 # Adjust based on storage (SSD: 2000+, HDD: 200) # innodb_io_capacity_max=4000 # innodb_flush_method=O_DIRECT # Bypass double buffering # innodb_flush_log_at_trx_commit=2 # Less frequent flush (trade durability) # innodb_log_file_size=1G # Larger logs = fewer checkpoints
# PostgreSQL - Check I/O pg_stat_bgwriter # Checkpoint and background writer stats
# PostgreSQL - Optimize # In postgresql.conf: # effective_io_concurrency = 200 # For SSDs # maintenance_io_concurrency = 200 # checkpoint_completion_target = 0.9 # Spread checkpoint writes # wal_buffers = 64MB # Larger WAL buffer
# MongoDB - Check I/O mongostat 1 # Real-time stats # Check for high locked % indicating I/O wait
# MongoDB - WiredTiger cache tuning # In mongod.conf: # storage: # wiredTiger: # engineConfig: # cacheSizeGB: 4 # Set to 50-60% of RAM ```
Log file I/O optimization:
```bash # Check log write volume ls -lah /var/log/ du -sh /var/log/*
# Rotate logs more frequently cat > /etc/logrotate.d/app << 'EOF' /var/log/app/*.log { hourly rotate 24 compress delaycompress missingok notifempty create 0640 app app } EOF
# Use tmpfs for high-volume logs (if logs not critical) mount -t tmpfs -o size=500M tmpfs /var/log/app
# Or redirect high-volume logs to /dev/null (development only) # /etc/rsyslog.d/99-ignore.conf :programname, isequal, "noisy-service" /dev/null & stop
# Use async logging (application-level) # Python: Use QueueHandler + QueueListener # Java: Use AsyncAppender in Log4j/Logback # Node.js: Use pino-async or similar ```
Backup job I/O management:
```bash # Schedule backups during off-peak hours # crontab -e # Run backup at 3 AM on Sunday 0 3 * * 0 /usr/local/bin/backup.sh
# Limit backup I/O impact # In backup script: ionice -c 3 nice -n 19 rsync -av /source /backup ionice -c 3 nice -n 19 tar czf /backup/backup.tar.gz /data
# Use rsync with bandwidth limit rsync -av --bwlimit=50000 /source /backup # 50MB/s limit
# Pause backup if I/O saturated # In backup script: while true; do util=$(iostat -d sda | tail -1 | awk '{print $NF}') if (( $(echo "$util > 80" | bc -l) )); then echo "I/O saturated, pausing backup" sleep 60 else # Continue backup break fi done ```
### 6. Monitor and alert on I/O saturation
Set up I/O monitoring:
```bash # Simple I/O monitoring script cat > /usr/local/bin/io-monitor.sh << 'EOF' #!/bin/bash
THRESHOLD_UTIL=80 THRESHOLD_WAIT=30
# Get I/O stats UTIL=$(iostat -d sda | tail -1 | awk '{print $NF}' | cut -d. -f1) WAIT=$(vmstat 1 2 | tail -1 | awk '{print $16}')
if [ "$UTIL" -gt "$THRESHOLD_UTIL" ] || [ "$WAIT" -gt "$THRESHOLD_WAIT" ]; then echo "$(date): I/O saturation detected - util=${UTIL}%, wait=${WAIT}%" >> /var/log/io-alerts.log
# Log top I/O processes echo "Top I/O processes:" >> /var/log/io-alerts.log iotop -b -n 5 -o >> /var/log/io-alerts.log
# Send alert (integrate with your monitoring) # curl -X POST https://alerting.example.com/webhook -d "I/O saturation on $(hostname)" fi EOF
chmod +x /usr/local/bin/io-monitor.sh
# Run every 5 minutes echo "*/5 * * * * /usr/local/bin/io-monitor.sh" | crontab - ```
Prometheus node_exporter metrics:
```yaml # node_exporter provides I/O metrics for Prometheus # Key metrics: # - node_disk_io_time_seconds_total: Time spent doing I/O # - node_disk_reads_completed_total: Total reads # - node_disk_writes_completed_total: Total writes # - node_disk_io_time_weighted_seconds: Queue time
# Prometheus alert rules # /etc/prometheus/rules/io-alerts.yml
groups: - name: disk-io rules: - alert: DiskIOSaturation expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80 for: 5m labels: severity: warning annotations: summary: "Disk I/O saturated on {{ $labels.device }}" description: "Device {{ $labels.device }} is at {{ $value }}% utilization"
- alert: DiskIOWaitHigh
- expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 30
- for: 5m
- labels:
- severity: warning
- annotations:
- summary: "High I/O wait on {{ $labels.instance }}"
- description: "I/O wait is {{ $value }}%"
`
Prevention
- Monitor I/O utilization with alerting at 70%, 80% thresholds
- Use SSDs for I/O-intensive workloads (databases, logs)
- Implement cgroup I/O limits for multi-tenant systems
- Schedule batch jobs (backups, reports) during off-peak hours
- Use ionice for non-critical background tasks
- Configure appropriate I/O scheduler for workload type
- Tune database I/O settings (innodb_io_capacity, effective_io_concurrency)
- Implement async logging where possible
- Use tmpfs for high-volume temporary files
- Document I/O tuning runbook for common scenarios
Related Errors
- **Linux out of memory**: Memory exhaustion triggering OOM killer
- **Linux load average high**: CPU or I/O bottleneck
- **Disk full no space left**: Filesystem capacity exhausted
- **Too many open files**: File descriptor limit reached
- **Kernel panic unable to mount root fs**: Boot disk failure