Introduction When Write-Ahead Log (WAL) or redo log archiving falls behind, log files accumulate on disk until the filesystem reaches 100% capacity. At this point, the database stops accepting writes, and in severe cases, even reads fail. This is a critical production incident.

Symptoms - PostgreSQL reports `ERROR: could not write to file "pg_wal/xlog": No space left on device` - Oracle reports `ORA-00257: archiver error. Connect internal only, until freed` - Database becomes read-only or completely unresponsive - Monitoring alerts show disk usage at 100% on the data partition - `archive_command` in PostgreSQL logs show repeated failures

Common Causes - Archive destination (S3, NFS, backup server) is unreachable or full - `archive_timeout` set too low generating excessive WAL files - Network outage preventing WAL shipping to standby or archive location - Log rotation not configured, causing archived WAL to accumulate on the same disk - Backup process stalled, preventing WAL cleanup by `wal_keep_size`

Step-by-Step Fix 1. **Immediately identify disk usage to find the largest consumers**: ```bash du -sh /var/lib/postgresql/*/pg_wal/* | sort -rh | head -20 df -h /var/lib/postgresql ```

  1. 1.Temporarily increase WAL retention limit to buy time:
  2. 2.```sql
  3. 3.-- Check current WAL usage
  4. 4.SELECT pg_walfile_name(pg_current_wal_lsn()),
  5. 5.pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS used_wal
  6. 6.FROM pg_control_checkpoint();

-- Check oldest required WAL SELECT slot_name, restart_lsn, active FROM pg_replication_slots; ```

  1. 1.Free space by moving archived WAL to alternative storage:
  2. 2.```bash
  3. 3.# Move WAL archives to a temp location on a different disk
  4. 4.mkdir -p /mnt/backup/pg_wal_archive
  5. 5.mv /var/lib/postgresql/16/main/pg_wal/archive_status/*.* /mnt/backup/pg_wal_archive/
  6. 6.`
  7. 7.Remove inactive replication slots that are holding WAL:
  8. 8.```sql
  9. 9.SELECT slot_name, active, restart_lsn FROM pg_replication_slots;
  10. 10.-- If a slot is inactive and holding WAL:
  11. 11.SELECT pg_drop_replication_slot('orphaned_standby_slot');
  12. 12.`
  13. 13.Fix the archive command and restart archiving:
  14. 14.```sql
  15. 15.-- Check archive status
  16. 16.SELECT * FROM pg_stat_archiver;

-- Verify archive_command is correct SHOW archive_command;

-- Fix and reload ALTER SYSTEM SET archive_command = 'wal-g wal-push %p'; SELECT pg_reload_conf(); ```

  1. 1.For Oracle, manually archive and delete old redo logs:
  2. 2.```sql
  3. 3.-- Check archive destination status
  4. 4.SELECT dest_id, status, error FROM v$archive_dest;

-- Archive current log ALTER SYSTEM ARCHIVE LOG CURRENT;

-- Delete archived logs older than 2 days using RMAN -- rman target / -- DELETE ARCHIVELOG UNTIL TIME 'SYSDATE-2'; ```

Prevention - Monitor disk usage with alerting at 70%, 85%, and 95% thresholds - Use a separate disk or mount point for WAL/archive logs - Set `wal_keep_size` to a reasonable limit (e.g., 10GB) - Implement automated archive cleanup with `archive_cleanup_command` - Use streaming replication slots carefully and monitor their lag - Test disaster recovery procedures including disk-full scenarios quarterly