Introduction

Linux disk and filesystem errors occur when storage subsystems experience hardware failures, filesystem corruption, space exhaustion, I/O bottlenecks, or logical volume management issues. These errors manifest as read/write failures, slow disk performance, filesystem mount failures, inode exhaustion preventing file creation, LVM volume activation failures, and RAID degradation. Common causes include disk hardware failures (bad sectors, failing controllers), filesystem metadata corruption from unclean shutdowns, disk space or inode exhaustion, I/O scheduler misconfiguration, LVM metadata corruption, RAID array degradation, disk queue saturation causing high latency, and kernel I/O errors from failing hardware. The fix requires understanding Linux storage stack (block devices, filesystems, LVM, RAID), diagnostic tools (smartmontools, iostat, fsck), and recovery procedures. This guide provides production-proven troubleshooting for disk and filesystem issues across physical servers, VMs, and cloud instances.

Symptoms

  • No space left on device errors when writing files
  • Too many open files or Cannot create file: No space left on device (inode exhaustion)
  • I/O error in dmesg or system logs
  • EXT4-fs error: unable to read inode bitmap
  • XFS: Internal error xfs_trans_cancel at line xxx
  • read-only file system errors (filesystem remounted read-only)
  • sd X: [sda] Sense Key : Medium Error
  • SMART warnings: Reallocated_Sector_Ct, Current_Pending_Sector
  • lvm> prompt instead of normal LVM commands (LVM metadata issue)
  • mdadm: /dev/md0 has been shut down (RAID failure)
  • High I/O wait (> 50% in top/iostat)
  • Buffer I/O error on device X
  • Filesystem mount fails: mount: unknown filesystem type

Common Causes

  • Disk space exhausted (100% usage)
  • Inode exhausted (all inodes used, often from many small files)
  • Disk hardware failure (bad sectors, head crash)
  • Filesystem corruption from power loss or crash
  • Journal corruption preventing filesystem replay
  • LVM metadata corruption or missing PV/VG
  • RAID array degraded or failed
  • I/O scheduler causing latency spikes
  • Disk queue depth exceeded under load
  • NFS mount stale file handles
  • Kernel bug or driver issue causing I/O errors
  • Cloud volume detached or IOPS limit reached

Step-by-Step Fix

### 1. Diagnose disk issues

Check disk space and inodes:

```bash # Check disk space df -h # Output shows filesystem, size, used, available, use%

# Check inode usage df -i # Output shows inode total, used, free, use%

# If inode usage > 90%, finding and removing small files helps # Common culprits: session files, cache files, mail queues

# Find directories with most files for i in /*; do echo $i; find $i | wc -l; done | sort -k2 -rn | head -20

# Or more targeted find /var -type f | cut -d/ -f2-3 | sort | uniq -c | sort -rn | head -20

# Check specific directory sizes du -sh /* 2>/dev/null | sort -hr | head -20

# Find large files find / -type f -size +1G -exec ls -lh {} \; 2>/dev/null

# Find recently modified large files find / -type f -mtime -1 -size +100M -exec ls -lh {} \; 2>/dev/null ```

Check disk health (SMART):

```bash # Install smartmontools apt install smartmontools yum install smartmontools

# Check disk health smartctl -H /dev/sda

# Output: # SMART overall-health self-assessment test result: PASSED # Or: FAILED (imminent drive failure)

# Detailed SMART info smartctl -A /dev/sda

# Key attributes to watch: # Reallocated_Sector_Ct - Should be 0 (or low and stable) # Current_Pending_Sector - Should be 0 (sectors waiting for remap) # Offline_Uncorrectable - Should be 0 # UDMA_CRC_Error_Count - Increasing = bad cable # Power_On_Hours - Drive age # Wear_Leveling_Count - SSD wear level

# Run SMART self-test smartctl -t short /dev/sda # 2-5 minutes smartctl -t long /dev/sda # Several hours

# Check test results smartctl -l selftest /dev/sda

# Enable SMART monitoring # /etc/smartmontools/smartd.conf /dev/sda -a -m admin@example.com

# Start service systemctl enable smartmontools systemctl start smartmontools ```

Check I/O performance:

```bash # Real-time I/O stats iostat -xz 1

# Output interpretation: # Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util # sda 10.5 25.3 512.2 1024.5 0.02 5.21 0.19 17.04 0.52 1.23 0.05 48.78 40.50 0.38 1.36

# Key metrics: # %util - 100% = saturated # r_await/w_await - > 10ms indicates problem # aqu-sz - Average queue length, > 1 indicates backlog # svctm - Service time, should be < 5ms for HDD, < 1ms for SSD

# Check I/O wait in top top # Look at %id (idle) vs %wa (iowait) # iowait > 20% indicates disk bottleneck

# Per-process I/O iotop iotop -o # Show only processes doing I/O

# Check disk latency pidstat -d 1

# Block device queue depth cat /sys/block/sda/queue/nr_requests ```

### 2. Fix disk space exhaustion

Free disk space:

```bash # Clear package manager cache # Debian/Ubuntu apt-get clean apt-get autoremove

# RHEL/CentOS yum clean all dnf clean all

# Clear journal logs journalctl --vacuum-time=7d # Or limit by size journalctl --vacuum-size=500M

# Clear old kernels (Ubuntu) /usr/lib/linux/apt-remove-old-kernels

# Find and remove large log files find /var/log -type f -size +100M -exec ls -lh {} \; # Truncate instead of delete (keeps file handle) > /var/log/large-log-file.log

# Clear temporary files rm -rf /tmp/* rm -rf /var/tmp/*

# Clear user caches rm -rf ~/.cache/*

# Find and clear core dumps find / -name "core.*" -type f -size +10M -exec rm {} \; 2>/dev/null

# Clear old Docker images/containers docker system prune -a docker builder prune

# Clear systemd journal journalctl --vacuum-time=1d ```

Prevent future exhaustion:

```bash # Configure log rotation # /etc/logrotate.conf /var/log/*.log { daily rotate 7 compress delaycompress missingok notifempty create 0640 root root }

# Set up disk space monitoring # /etc/cron.daily/disk-check #!/bin/bash THRESHOLD=85 USAGE=$(df / | tail -1 | awk '{print $5}' | tr -d '%') if [ $USAGE -gt $THRESHOLD ]; then echo "Disk usage at ${USAGE}%" | mail -s "Disk Alert" admin@example.com fi

# Use LVM snapshots for safe cleanup # Create snapshot before major cleanup lvcreate -L 10G -s -name root-snap /dev/vg0/root # If something breaks, restore snapshot lvconvert --merge /dev/vg0/root-snap ```

### 3. Fix inode exhaustion

Diagnose inode usage:

```bash # Check inode usage per filesystem df -i

# Find directories with most inodes (files) for i in /*; do echo $i; find $i | wc -l; done | sort -k2 -rn | head -20

# Common inode hogs: # /var/spool/postfix - Mail queue # /var/lib/docker - Docker layers # /tmp - Session files # /var/cache - Package cache

# Find directories with most files find /var -type d -exec find {} -maxdepth 1 -type f | wc -l \; | sort -rn | head -20

# Find and clear session files find /var/lib/php/sessions -type f -mtime +1 -delete find /tmp -name "sess_*" -type f -mtime +1 -delete

# Clear mail queue postsuper -d ALL 2>/dev/null mailq | tail -1 | awk '{print $2}' # Count queued emails ```

Fix inode exhaustion:

```bash # Option 1: Delete unnecessary small files # Clear old session files find /var/lib/php/sessions -type f -mtime +7 -delete

# Clear old cache files find /var/cache -type f -atime +30 -delete

# Clear old logs find /var/log -type f -name "*.log.*" -mtime +30 -delete

# Option 2: Reformat with more inodes # Backup data first! # mkfs.ext4 -i 16384 /dev/sda1 # 1 inode per 16KB # Default is 1 inode per 16KB, can reduce to 4KB or 2KB

# Option 3: Move small files to separate filesystem # Create new filesystem with more inodes mkfs.ext4 -i 4096 /dev/sdb1 mount /dev/sdb1 /var/small-files

# Option 4: Use XFS instead of ext4 # XFS handles many small files better # mkfs.xfs /dev/sda1 ```

### 4. Fix filesystem corruption

Check filesystem status:

```bash # Check if filesystem is mounted mount | grep /dev/sda1

# Check filesystem type df -T | grep /dev/sda1

# Check for errors (read-only check) # ext4 fsck -n /dev/sda1

# xfs xfs_repair -n /dev/sda1

# Check kernel messages for filesystem errors dmesg | grep -E "EXT4|XFS|error" ```

Repair ext4 filesystem:

```bash # WARNING: Always backup before fsck! # Unmount filesystem first umount /dev/sda1

# If busy, find what's using it lsof +f -- /mount/point fuser -vm /mount/point

# Force unmount if needed umount -l /dev/sda1

# Run fsck fsck -y /dev/sda1 # -y automatically fixes errors

# For severe corruption fsck -f -y /dev/sda1 # -f forces full check

# Check result echo $? # 0 = no errors # 1 = errors corrected # 2 = errors corrected, reboot recommended # 4 = uncorrected errors

# Remount mount /dev/sda1

# If filesystem won't mount, check superblock # List backup superblocks mke2fs -n /dev/sda1 2>&1 | grep -i superblock

# Use backup superblock fsck -b 32768 /dev/sda1 ```

Repair XFS filesystem:

```bash # XFS repair (must be unmounted for full repair) umount /dev/sda1

# Dry run first xfs_repair -n /dev/sda1

# Actual repair xfs_repair /dev/sda1

# If log is corrupted, may need to clear xfs_repair -L /dev/sda1 # -L clears log (potential data loss!)

# For mounted XFS, use xfs_repair only if necessary # XFS can often self-heal on mount

# Check XFS health xfs_info /dev/sda1 xfs_db -r -c "sb 0" -c "print" /dev/sda1 ```

Filesystem remounted read-only:

```bash # EXT4-fs error: Remounting filesystem read-only # This is protective - filesystem detected corruption

# Check errors dmesg | tail -50 | grep -i ext4

# Try to remount read-write (may fail if serious) mount -o remount,rw /dev/sda1

# If fails, unmount and fsck umount /dev/sda1 fsck -y /dev/sda1 mount /dev/sda1

# Prevent future RO remounts (not recommended, masks problems) # /etc/fstab /dev/sda1 /data ext4 errors=continue 0 2 # errors=continue: Don't remount RO on error # errors=remount-ro: Default, safest # errors=panic: Panic kernel on error ```

### 5. Fix LVM issues

Check LVM status:

```bash # Check physical volumes pvs pvdisplay

# Check volume groups vgs vgdisplay

# Check logical volumes lvs lvdisplay

# Scan for LVM metadata vgscan pvscan

# If VG not found, may need to restore metadata # LVM keeps backups in /etc/lvm/backup/ and /etc/lvm/archive/ ```

Activate LVM volumes:

```bash # If VG is inactive vgchange -ay # -ay = activate all

# Activate specific VG vgchange -ay vg0

# If PV is missing pvs --all # Shows missing PVs

# Find missing PV pvscan --cache

# If PV truly lost, remove from VG vgreduce --removemissing vg0 # WARNING: Data on missing PV is lost!

# Restore LVM metadata from backup # List backups ls -la /etc/lvm/backup/ ls -la /etc/lvm/archive/

# Restore vgcfgrestore -f /etc/lvm/backup/vg0 vg0

# Reactivate vgchange -ay vg0 ```

Extend LVM volumes:

```bash # Check available space in VG vgs # Look for VFree

# Extend LV lvextend -L +10G /dev/vg0/lv_root # Or extend to all free space lvextend -l +100%FREE /dev/vg0/lv_root

# Resize filesystem # ext4 resize2fs /dev/vg0/lv_root

# xfs xfs_growfs /mount/point

# One-liner (lvextend calls resize automatically with -r) lvextend -r -l +100%FREE /dev/vg0/lv_root ```

### 6. Fix RAID array issues

Check RAID status:

```bash # Check mdadm status cat /proc/mdstat

# Detailed status mdadm --detail /dev/md0

# Check for failed drives mdadm --detail /dev/md0 | grep -i fail

# Output shows: # /dev/md0: # Version : 1.2 # Creation Time : ... # Raid Level : raid1 # Array Size : 1048576 (1.00 GiB) # Used Dev Size : 1048576 (1.00 GiB) # Raid Devices : 2 # Total Devices : 2 # Persistence : Superblock is persistent # State : clean, degraded # DEGRADED = problem! # Active Devices : 1 #Working Devices : 1 # Failed Devices : 1 # Spare Devices : 0 # Name : server:0 # UUID : ... # Events : 12345 # Number Major Minor RaidDevice State # 0 8 1 0 active sync /dev/sda1 # 1 8 17 1 faulty removed /dev/sdb1 ```

Replace failed RAID drive:

```bash # Mark drive as failed (if not auto-detected) mdadm /dev/md0 --fail /dev/sdb1

# Remove failed drive mdadm /dev/md0 --remove /dev/sdb1

# Physically replace drive, then partition new drive # Copy partition table from existing drive sfdisk -d /dev/sda | sfdisk /dev/sdb

# Add new drive to array mdadm /dev/md0 --add /dev/sdb1

# Monitor rebuild watch cat /proc/mdstat # Shows: [=>...........] 10% rebuild

# Update mdadm config mdadm --detail --scan >> /etc/mdadm/mdadm.conf # Or on Debian/Ubuntu: dpkg-reconfigure mdadm ```

### 7. Monitor disk health

Prometheus node_exporter metrics:

```yaml # Key disk metrics from node_exporter

# Disk space node_filesystem_avail_bytes node_filesystem_size_bytes node_filesystem_free_bytes

# Inodes node_filesystem_files_free node_filesystem_files

# I/O stats node_disk_reads_completed_total node_disk_writes_completed_total node_disk_read_bytes_total node_disk_written_bytes_total node_disk_io_time_seconds_total node_disk_io_now

# SMART metrics (requires smartctl_exporter) smartctl_disk_smart_status smartctl_disk_reallocated_sector_ct smartctl_disk_current_pending_sector

# Grafana alert rules groups: - name: disk_health rules: - alert: DiskSpaceLow expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1 for: 1h labels: severity: warning annotations: summary: "Disk space below 10%" description: "{{ $labels.mountpoint }} at {{ $value | humanizePercentage }}"

  • alert: DiskInodesLow
  • expr: (node_filesystem_files_free / node_filesystem_files) < 0.1
  • for: 1h
  • labels:
  • severity: warning
  • annotations:
  • summary: "Inode usage above 90%"
  • alert: DiskIOUtilHigh
  • expr: rate(node_disk_io_time_seconds_total[5m]) > 0.9
  • for: 30m
  • labels:
  • severity: warning
  • annotations:
  • summary: "Disk I/O utilization above 90%"
  • alert: DiskSMARTFailure
  • expr: smartctl_disk_smart_status == 0
  • for: 5m
  • labels:
  • severity: critical
  • annotations:
  • summary: "Disk SMART status failed"
  • description: "Disk {{ $labels.device }} may be failing"
  • `

Daily disk health check script:

```bash #!/bin/bash # /usr/local/bin/disk-health-check.sh

# Check disk space THRESHOLD=85 while read -r line; do usage=$(echo "$line" | awk '{print $5}' | tr -d '%') mount=$(echo "$line" | awk '{print $6}') if [ "$usage" -gt "$THRESHOLD" ]; then echo "CRITICAL: $mount at ${usage}%" fi done < <(df -h | tail -n +2)

# Check inode usage while read -r line; do usage=$(echo "$line" | awk '{print $5}' | tr -d '%') mount=$(echo "$line" | awk '{print $6}') if [ "$usage" -gt "$THRESHOLD" ]; then echo "WARNING: $mount inodes at ${usage}%" fi done < <(df -i | tail -n +2)

# Check SMART status for disk in /dev/sd[a-z]; do if smartctl -H "$disk" 2>/dev/null | grep -q "FAILED"; then echo "CRITICAL: SMART failure on $disk" fi done

# Check RAID status if [ -f /proc/mdstat ]; then if grep -q "inactive\|_U\|U_" /proc/mdstat; then echo "CRITICAL: RAID array degraded" fi fi

# Check I/O errors errors=$(dmesg | grep -c "I/O error" 2>/dev/null || echo 0) if [ "$errors" -gt 0 ]; then echo "WARNING: $errors I/O errors in dmesg" fi ```

Prevention

  • Monitor disk space and inodes with alerting (threshold: 85%)
  • Enable SMART monitoring with email alerts
  • Use LVM for flexible volume management
  • Implement log rotation with size limits
  • Regular filesystem checks during maintenance windows
  • Use RAID for data redundancy (not backup!)
  • Keep spare drives available for RAID replacement
  • Document disk layout and recovery procedures
  • Test backup restoration regularly
  • Use UPS to prevent corruption from power loss
  • **ENOSPC**: No space left on device
  • **EMFILE**: Too many open files (per-process limit)
  • **ENFILE**: Too many open files in system (system-wide limit)
  • **EROFS**: Read-only filesystem
  • **EIO**: I/O error