# PostgreSQL WAL Error - Write-Ahead Log Troubleshooting

Write-Ahead Logging is PostgreSQL's mechanism for ensuring data integrity. When WAL operations fail, you'll see errors ranging from archive failures to corruption messages. Let's work through the most common WAL-related problems.

Identifying WAL Errors

WAL errors typically appear in the logs with specific messages:

```bash # Check for WAL-related errors sudo grep -i "wal|xlog|archive|segment" /var/log/postgresql/postgresql-*-main.log | tail -100

# Common error patterns to look for: # - "could not archive WAL file" # - "no space left on device" # - "invalid WAL record" # - "WAL segment is already being archived" # - "requested WAL segment has already been removed" ```

WAL Archive Command Failure

The most common WAL error is archive_command failure. You'll see messages like ERROR: archive command failed with exit code 1.

```bash # Check current archive configuration psql -U postgres -c "SHOW archive_command;" psql -U postgres -c "SHOW archive_mode;"

# Check archive status psql -U postgres -c " SELECT name, setting FROM pg_settings WHERE name LIKE 'archive%'; "

# View failed archive attempts psql -U postgres -c " SELECT pg_walfile_name(pg_current_wal_lsn()) AS current_wal, COUNT(*) AS pending_count FROM pg_stat_archiver WHERE failed_count > 0; " ```

Fixing Archive Command Issues

```bash # Test your archive command manually # Example archive_command: archive_command = 'cp %p /backup/wal_archive/%f'

# Test with actual file sudo -u postgres cp /var/lib/postgresql/16/main/pg_wal/000000010000000000000001 /backup/wal_archive/test

# Check permissions ls -la /backup/wal_archive/ # Should be writable by postgres user

# Fix permissions sudo chown -R postgres:postgres /backup/wal_archive/ sudo chmod 755 /backup/wal_archive/ ```

Robust Archive Command Configuration

```bash # Edit postgresql.conf sudo nano /etc/postgresql/16/main/postgresql.conf

# Better archive command with error handling archive_command = 'test ! -f /backup/wal_archive/%f && cp %p /backup/wal_archive/%f'

# Or with rsync for remote archives archive_command = 'rsync -a %p backup-server:/wal_archive/%f'

# Or using pg_probackup archive_command = 'pg_probackup archive-push -B /backup --instance main --wal-file-path=%p'

# Reload configuration sudo systemctl reload postgresql ```

WAL Disk Space Issues

When pg_wal directory fills up, PostgreSQL will stop accepting writes:

```bash # Check pg_wal size du -sh /var/lib/postgresql/16/main/pg_wal/

# Count WAL files ls -la /var/lib/postgresql/16/main/pg_wal/ | wc -l

# Check WAL disk usage details psql -U postgres -c " SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS wal_written, pg_size_pretty(sum(size)) AS total_wal_size FROM pg_ls_waldir() AS w(size); " ```

Clearing WAL Files Safely

```bash # Check which WAL files are safe to remove psql -U postgres -c " SELECT pg_walfile_name(pg_current_wal_lsn()) AS current_wal, pg_walfile_name_offset(pg_current_wal_lsn()) AS offset; "

# Check replication slots (these prevent WAL removal) psql -U postgres -c "SELECT slot_name, active, restart_lsn FROM pg_replication_slots;"

# If no replication slots and archiving is working, WAL should auto-remove # Force a checkpoint to trigger cleanup psql -U postgres -c "CHECKPOINT;"

# Check if archive is keeping up psql -U postgres -c " SELECT archived_count, failed_count, last_archived_wal, last_failed_wal, EXTRACT(EPOCH FROM (now() - last_archived_time)) / 60 AS minutes_since_last_archive FROM pg_stat_archiver; " ```

When WAL Files Must Be Manually Removed

Warning: This should only be done if archiving is hopelessly behind and you have a recent backup.

```bash # Stop PostgreSQL sudo systemctl stop postgresql

# Identify files to keep (keep at least 3 recent segments) ls -lt /var/lib/postgresql/16/main/pg_wal/ | head -5

# Move (don't delete) old files sudo mkdir -p /tmp/wal_backup sudo find /var/lib/postgresql/16/main/pg_wal/ -name "0000000*" -mtime +1 -exec mv {} /tmp/wal_backup/ \;

# Start PostgreSQL sudo systemctl start postgresql

# If PostgreSQL starts successfully, you can later delete the moved files ```

WAL Corruption

Corrupt WAL segments cause errors like PANIC: invalid WAL record or FATAL: incorrect resource manager ID in checkpoint record.

```bash # Identify the problematic segment sudo grep "invalid WAL" /var/log/postgresql/postgresql-*-main.log

# Example: "invalid WAL record at 0/1532B48" ```

Recovery from WAL Corruption

```bash # Stop PostgreSQL sudo systemctl stop postgresql

# Option 1: Restore from backup (recommended) pg_restore --clean --create -U postgres -d template1 /backup/base_backup.tar

# Option 2: Point-in-time recovery to before corruption # Restore base backup, then recover to timestamp before corruption

# Option 3: Last resort - pg_resetwal (DATA LOSS POSSIBLE) # This should only be used if no backups exist sudo -u postgres pg_resetwal -f /var/lib/postgresql/16/main

# After pg_resetwal, PostgreSQL will start with: # - Reset WAL position # - Potential data inconsistency # - Broken replication

# Reinitialize replication if using sudo rm -rf /var/lib/postgresql/16/main/pg_wal/* sudo -u postgres pg_basebackup -h primary -U replication -D /var/lib/postgresql/16/main -Fp -Xs -P -R ```

Replication Slot Preventing WAL Cleanup

Replication slots ensure WAL is retained for standbys, but orphaned slots can fill the disk:

```bash # List all replication slots psql -U postgres -c " SELECT slot_name, slot_type, active, restart_lsn, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes FROM pg_replication_slots; "

# Check if slot holder is still active psql -U postgres -c " SELECT pid, usename, application_name, client_addr, state, sent_lsn, replay_lsn FROM pg_stat_replication; " ```

Removing Orphaned Replication Slots

```bash # Identify inactive slots with high lag psql -U postgres -c " SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal FROM pg_replication_slots WHERE NOT active; "

# Drop inactive slot (this frees retained WAL) psql -U postgres -c "SELECT pg_drop_replication_slot('inactive_slot_name');"

# Verify slot is gone psql -U postgres -c "SELECT * FROM pg_replication_slots;" ```

WAL Configuration Tuning

Prevent WAL issues with proper configuration:

```bash # Check current WAL settings psql -U postgres -c " SELECT name, setting, unit FROM pg_settings WHERE name IN ( 'wal_level', 'wal_keep_size', 'max_wal_size', 'min_wal_size', 'checkpoint_timeout', 'checkpoint_completion_target', 'archive_mode', 'archive_timeout' ); "

# Recommended settings for production sudo nano /etc/postgresql/16/main/postgresql.conf

# WAL configuration wal_level = replica # minimal, replica, or logical wal_keep_size = 2GB # Keep enough for replication max_wal_size = 4GB # Max WAL space min_wal_size = 1GB # Min WAL to keep checkpoint_timeout = 15min # Time between checkpoints checkpoint_completion_target = 0.9 # Spread checkpoint over time

# Archive configuration archive_mode = on archive_timeout = 300 # Force archive every 5 minutes

# Reload to apply sudo systemctl reload postgresql ```

Monitoring WAL Health

Set up monitoring to catch WAL issues before they become critical:

```sql -- Create monitoring view CREATE OR REPLACE VIEW wal_health AS SELECT pg_walfile_name(pg_current_wal_lsn()) AS current_wal_file, pg_size_pretty(sum(size)) AS total_wal_size, (SELECT count(*) FROM pg_replication_slots WHERE NOT active) AS inactive_slots, (SELECT archived_count FROM pg_stat_archiver) AS total_archived, (SELECT failed_count FROM pg_stat_archiver) AS failed_archives, (SELECT EXTRACT(EPOCH FROM (now() - last_archived_time)) / 60 FROM pg_stat_archiver) AS minutes_since_archive FROM pg_ls_waldir() AS w(size);

-- Query for health check SELECT * FROM wal_health; ```

WAL Verification Tools

```bash # Verify WAL file integrity (PostgreSQL 10+) sudo -u postgres pg_verifybackup /backup/base_backup/ -n

# Check WAL continuity sudo -u postgres pg_waldump /var/lib/postgresql/16/main/pg_wal/000000010000000000000001 2>&1 | head -20

# Find gaps in WAL sequence ls /var/lib/postgresql/16/main/pg_wal/ | grep "^00" | sort | uniq -c ```

When WAL errors occur, the key is understanding whether it's a space issue, configuration problem, or corruption. Most archive failures are fixable by correcting permissions or the archive command itself. Corruption requires recovery from backup, making regular backups essential for any production PostgreSQL deployment.