# PostgreSQL Stuck in Recovery Mode - Diagnosis and Fix
PostgreSQL enters recovery mode after a crash, during replication initialization, or when recovering from a backup. Normally this process completes automatically, but sometimes it gets stuck or fails. Understanding what's happening under the hood helps you resolve these situations correctly.
Understanding Recovery Mode
A PostgreSQL instance in recovery mode is essentially replaying WAL (Write-Ahead Log) records to bring the database to a consistent state. During this time, the server typically accepts read-only queries but blocks writes.
```bash # Check if PostgreSQL is in recovery mode psql -U postgres -c "SELECT pg_is_in_recovery();"
# Result: t = in recovery, f = normal operation ```
Diagnosing Recovery Issues
First, identify what type of recovery is happening and where it's stuck:
```bash # Check recovery status and progress psql -U postgres -c "SELECT * FROM pg_stat_wal_receiver;" psql -U postgres -c "SELECT * FROM pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"
# Check for recovery-related log messages sudo grep -i "recovery|restore|archive" /var/log/postgresql/postgresql-*-main.log | tail -50 ```
The pg_stat_wal_receiver view shows streaming replication status. If pg_last_wal_replay_lsn() is lagging behind pg_last_wal_receive_lsn(), the replay process is the bottleneck.
Archive Recovery Stuck
When recovering from a base backup using WAL archives, the most common issue is missing or inaccessible archive files.
# Check recovery settings in postgresql.conf
cat /var/lib/postgresql/16/main/postgresql.auto.conf | grep restore
cat /var/lib/postgresql/16/main/postgresql.conf | grep -E "restore_command|recovery_target"Missing WAL Files
If you see errors like ERROR: could not open file "pg_wal/000000010000000000000003":
```bash # Check available WAL files in archive ls -la /path/to/wal_archive/
# Check what WAL files PostgreSQL expects psql -U postgres -c "SELECT pg_walfile_name(pg_current_wal_lsn());"
# Verify restore_command is working # Test your restore_command manually: restore_command = 'cp /path/to/wal_archive/%f %p' # Test: cp /path/to/wal_archive/000000010000000000000003 /tmp/test_wal ```
Resolution for missing WAL files:
```bash # Option 1: Generate missing WAL on primary (if still available) # On primary server: psql -U postgres -c "SELECT pg_switch_wal();"
# Option 2: Re-initialize from a fresh base backup # On standby: sudo systemctl stop postgresql sudo rm -rf /var/lib/postgresql/16/main/* # Perform pg_basebackup again pg_basebackup -h primary_host -U replication_user -D /var/lib/postgresql/16/main -Fp -Xs -P -R sudo systemctl start postgresql ```
Incorrect Recovery Target
If recovery pauses at a specific point, check recovery_target_* settings:
```bash # Check current recovery target psql -U postgres -c "SHOW recovery_target;"
# Possible targets: # recovery_target_time = '2024-01-15 14:30:00' # recovery_target_xid = '12345' # recovery_target_lsn = '0/3000288' # recovery_target_name = 'my_savepoint' ```
To pause, promote, or continue:
```bash # Check if recovery paused psql -U postgres -c "SELECT pg_get_wal_replay_pause_state();"
# Resume paused recovery psql -U postgres -c "SELECT pg_wal_replay_resume();"
# Cancel recovery and promote to primary psql -U postgres -c "SELECT pg_promote();" ```
Standby Server Won't Catch Up
A standby stuck replaying WAL while the primary keeps generating it is a common problem.
```bash # Check replication lag psql -U postgres -c " SELECT client_addr, state, sync_state, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes, pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS lag_mb FROM pg_stat_replication; "
# On standby, check how far behind psql -U postgres -c " SELECT pg_last_wal_receive_lsn() AS received, pg_last_wal_replay_lsn() AS replayed, pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS lag_bytes; " ```
Performance Tuning for Faster Replay
```bash # Edit postgresql.conf on standby sudo nano /etc/postgresql/16/main/postgresql.conf
# Increase these for faster WAL replay max_wal_senders = 10 wal_keep_size = 2GB hot_standby = on hot_standby_feedback = on max_standby_streaming_delay = 30s wal_receiver_status_interval = 1s
# Restart standby sudo systemctl restart postgresql ```
Recovery After Improper Shutdown
If PostgreSQL crashed and won't recover:
```bash # Check for crash recovery messages sudo grep "database system was interrupted" /var/log/postgresql/postgresql-*-main.log
# Force recovery mode if needed # Create recovery.signal file sudo -u postgres touch /var/lib/postgresql/16/main/recovery.signal
# Ensure restore_command is set if using archive echo "restore_command = 'cp /path/to/archive/%f %p'" | sudo tee -a /var/lib/postgresql/16/main/postgresql.auto.conf
sudo systemctl start postgresql ```
Promoting Standby to Primary
Sometimes you need to break out of recovery mode intentionally:
```bash # Method 1: Clean promotion psql -U postgres -c "SELECT pg_promote(true, 60);" # Parameters: wait (true/false), wait_seconds
# Method 2: Using pg_ctl sudo -u postgres /usr/lib/postgresql/16/bin/pg_ctl promote -D /var/lib/postgresql/16/main
# Method 3: Trigger file (older method) # In postgresql.conf: trigger_file = '/tmp/postgresql.trigger.5432' # Then create the file: sudo touch /tmp/postgresql.trigger.5432 ```
Dealing with Corrupted WAL
If WAL files themselves are corrupted, recovery may fail with PANIC: invalid WAL record:
```bash # Stop PostgreSQL immediately sudo systemctl stop postgresql
# Option 1: Restore from valid backup # This is the safest approach
# Option 2: Try pg_resetwal (DATA LOSS POSSIBLE) # Only use this if no backup exists and you accept potential data loss sudo -u postgres /usr/lib/postgresql/16/bin/pg_resetwal -f /var/lib/postgresql/16/main
# This resets WAL and allows PostgreSQL to start, but: # - Some data may be lost # - Database consistency is not guaranteed # - Replication will need to be reinitialized
# After pg_resetwal, start PostgreSQL sudo systemctl start postgresql
# Immediately perform full backup pg_dumpall -U postgres > /tmp/full_backup_$(date +%Y%m%d).sql ```
Recovery from Time-Based Point
For point-in-time recovery (PITR):
```bash # Create recovery.signal sudo -u postgres touch /var/lib/postgresql/16/main/recovery.signal
# Configure recovery parameters cat << 'EOF' | sudo tee -a /var/lib/postgresql/16/main/postgresql.auto.conf restore_command = 'cp /path/to/wal_archive/%f %p' recovery_target_time = '2024-01-15 14:30:00+00' recovery_target_action = 'promote' EOF
# Start PostgreSQL sudo systemctl start postgresql
# Monitor recovery progress tail -f /var/log/postgresql/postgresql-16-main.log | grep -i recovery ```
Verifying Recovery Completion
```bash # Check that recovery is complete psql -U postgres -c "SELECT pg_is_in_recovery();" # Should return 'f'
# Verify data integrity psql -U postgres -c " SELECT datname, pg_database_size(datname) AS size, (SELECT count(*) FROM pg_stat_activity WHERE datname = current_database()) AS connections FROM pg_database WHERE datistemplate = false; "
# Run integrity check psql -U postgres -c "SET statement_timeout = 0; SELECT * FROM pg_stat_all_tables;"
# Check for any replication slots that might cause issues psql -U postgres -c "SELECT * FROM pg_replication_slots;" ```
Preventing Recovery Issues
- 1.Monitor WAL archive: Ensure archive_command is working and archives are accessible
- 2.Regular backups: Frequent base backups reduce recovery time
- 3.Test recovery: Periodically test your recovery procedure
- 4.Monitor replication lag: Alert before standby falls too far behind
- 5.Keep sufficient WAL: Configure
wal_keep_sizeappropriately - 6.Monitor disk space: Recovery needs room for temporary files
When recovery goes wrong, resist the urge to force promotion immediately. Understanding why recovery is stuck helps you choose the right solution and avoid data loss.