Introduction PostgreSQL WAL archival ships completed WAL segments to an archive location for point-in-time recovery and standby replication. When the `archive_command` fails due to network issues, WAL segments accumulate on disk and eventually fill the `pg_wal` directory, causing the database to stop accepting writes.

Symptoms - `pg_stat_archiver` shows `failed_count` increasing and `last_failed_wal` growing - PostgreSQL logs show `archive_command failed with exit code 1` - `pg_wal` directory growing beyond normal size - Standby servers falling behind or disconnecting - Disk space alerts on the WAL partition

Common Causes - Network outage to archive destination (S3, NFS, backup server) - Archive server disk full, rejecting new WAL segments - DNS resolution failure for archive hostname - Archive tool (wal-g, barman, pgBackRest) process crashed - Firewall rules blocking archive destination port

Step-by-Step Fix 1. **Check archiver status": ```sql SELECT * FROM pg_stat_archiver; -- Key columns: -- archived_count, last_archived_wal, last_archived_time -- failed_count, last_failed_wal, last_failed_time ```

  1. 1.**Test the archive command manually":
  2. 2.```bash
  3. 3.# Check the configured archive_command
  4. 4.psql -c "SHOW archive_command;"

# Test it manually touch /tmp/test_wal_segment su - postgres -c "wal-g wal-push /tmp/test_wal_segment" # Or for NFS: su - postgres -c "cp /tmp/test_wal_segment /mnt/archive/test_wal_segment" ```

  1. 1.**Fix the archive destination and resume archiving":
  2. 2.```bash
  3. 3.# For S3-based archiving with wal-g
  4. 4.# Check connectivity
  5. 5.aws s3 ls s3://my-wal-archive/

# Check wal-g configuration cat /etc/wal-g/config.json

# Test wal-g wal-g wal-push /var/lib/postgresql/16/main/pg_wal/000000010000000100000001 ```

  1. 1.**Force archiving of accumulated WAL segments":
  2. 2.```sql
  3. 3.-- Force a WAL segment switch
  4. 4.SELECT pg_switch_wal();

-- Check if archiving resumes SELECT * FROM pg_stat_archiver; ```

  1. 1.**Implement retry logic in archive_command":
  2. 2.```ini
  3. 3.# Instead of a simple cp, use a retry wrapper
  4. 4.# archive_command = '/usr/local/bin/archive_wal.sh %f %p'
  5. 5.`
  6. 6.```bash
  7. 7.#!/bin/bash
  8. 8.# /usr/local/bin/archive_wal.sh
  9. 9.WAL_FILE=$1
  10. 10.WAL_PATH=$2
  11. 11.MAX_RETRIES=5
  12. 12.RETRY_DELAY=10

for i in $(seq 1 $MAX_RETRIES); do wal-g wal-push "$WAL_PATH" && exit 0 echo "Archive attempt $i failed for $WAL_FILE, retrying in ${RETRY_DELAY}s..." sleep $RETRY_DELAY RETRY_DELAY=$((RETRY_DELAY * 2)) done

exit 1 ```

Prevention - Monitor `pg_stat_archiver.failed_count` and alert on any increase - Test archive connectivity as part of health checks - Use a local WAL archive buffer before shipping to remote storage - Set up alerting on `pg_wal` directory size - Use resilient archive tools like wal-g with built-in retry logic - Test PITR recovery quarterly to verify archive integrity - Configure `archive_timeout = 60` to ensure frequent archival even during low-write periods