# PostgreSQL Physical Replication Error - Troubleshooting Guide
Physical (streaming) replication copies the entire PostgreSQL cluster to standby servers. When replication breaks, diagnosing the issue requires checking both primary and standby servers. Let's cover the most common replication errors.
Understanding Physical Replication Components
Physical replication requires:
- Primary server: Accepts writes, streams WAL to standbys
- Standby server: Receives WAL, applies changes
- Replication user: Special user with REPLICATION privilege
- Replication slot: Ensures WAL retention for standby
pg_hba.conf: Allows replication connections
```bash # Check replication prerequisites on primary psql -U postgres -c "SHOW wal_level;" # Should be 'replica' or 'logical'
psql -U postgres -c "SHOW max_wal_senders;" # Should be >= number of standbys
psql -U postgres -c "SHOW max_replication_slots;" # Should be >= number of standbys ```
Primary Server Issues
Error: "no pg_hba.conf entry for replication connection"
The standby cannot connect because pg_hba.conf doesn't allow replication:
# Edit pg_hba.conf
sudo nano /etc/postgresql/16/main/pg_hba.conf# Add replication entry
# TYPE DATABASE USER ADDRESS METHOD
host replication replicator 192.168.1.0/24 scram-sha-256
# Or for specific standby IP:
host replication replicator 192.168.1.10/32 scram-sha-256# Reload PostgreSQL
sudo systemctl reload postgresqlError: "replication user not allowed"
Create or configure the replication user:
```sql -- Create replication user CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD 'secure_password';
-- Verify replication privilege SELECT rolname, rolreplication FROM pg_roles WHERE rolname = 'replicator'; ```
Error: "too many replication slots"
```sql -- Check replication slots SELECT slot_name, slot_type, active, restart_lsn, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes FROM pg_replication_slots;
-- Drop unused slots SELECT pg_drop_replication_slot('unused_slot_name');
-- Or increase max_replication_slots ALTER SYSTEM SET max_replication_slots = 10; SELECT pg_reload_conf(); ```
Error: "no free WAL sender"
```sql -- Check current WAL senders SELECT count(*) FROM pg_stat_replication;
-- Increase max_wal_senders ALTER SYSTEM SET max_wal_senders = 10; SELECT pg_reload_conf();
-- Verify change SHOW max_wal_senders; ```
Standby Server Issues
Error: "could not connect to primary server"
Test connectivity from standby:
```bash # Test basic connectivity ping primary-server-host
# Test PostgreSQL connection psql -h primary-server-host -U replicator -d postgres -c "SELECT 1;"
# Check firewall sudo iptables -L -n | grep 5432 # Or sudo firewall-cmd --list-ports
# Test replication connection specifically psql -h primary-server-host -U replicator -d postgres \ -c "IDENTIFY_SYSTEM" \ -c "START_REPLICATION SLOT standby_slot LOGICAL 0/0" ```
Error: "standby server not following primary"
The standby might be following an old primary or is confused:
```bash # Check recovery configuration cat /var/lib/postgresql/16/main/postgresql.auto.conf | grep primary
# For PostgreSQL 12+, check primary_conninfo psql -U postgres -c "SHOW primary_conninfo;" ```
Error: "WAL segment already removed"
The standby fell too far behind and required WAL was removed:
```bash # Check WAL status on primary psql -U postgres -c " SELECT pg_walfile_name(pg_current_wal_lsn()) AS current_wal, pg_walfile_name(restart_lsn) AS standby_needs FROM pg_replication_slots; "
# Check what WAL files are available ls -la /var/lib/postgresql/16/main/pg_wal/ ```
Resolution options:
```bash # Option 1: Restore missing WAL from archive (if you have one) # Configure restore_command in standby's postgresql.conf restore_command = 'cp /archive/wal/%f %p'
# Option 2: Take a new base backup and reinitialize standby # On standby: sudo systemctl stop postgresql sudo rm -rf /var/lib/postgresql/16/main/*
# Take base backup from primary sudo -u postgres pg_basebackup \ -h primary-server \ -U replicator \ -D /var/lib/postgresql/16/main \ -Fp -Xs -P -R
sudo systemctl start postgresql ```
Replication Slot and WAL Issues
Slot Inactive But WAL Retained
Orphaned slots cause WAL to accumulate:
```sql -- Find inactive slots SELECT slot_name, slot_type, active, restart_lsn, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size FROM pg_replication_slots WHERE NOT active;
-- Check if standby is still running SELECT * FROM pg_stat_replication WHERE application_name = 'standby_name';
-- If standby is gone, drop the slot SELECT pg_drop_replication_slot('orphaned_slot'); ```
WAL Disk Space Exhausted
```bash # Check WAL directory size du -sh /var/lib/postgresql/16/main/pg_wal/
# Find old WAL files ls -lt /var/lib/postgresql/16/main/pg_wal/ | tail -20 ```
```sql -- Force checkpoint to trigger WAL cleanup CHECKPOINT;
-- Check if archiving is working SELECT archived_count, failed_count, last_archived_wal, last_failed_wal FROM pg_stat_archiver; ```
Replication Lag Diagnosis
Check Lag from Primary
-- View replication lag
SELECT
client_addr,
usename,
application_name,
state,
sync_state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag_size,
EXTRACT(EPOCH FROM (now() - reply_time)) AS seconds_since_reply
FROM pg_stat_replication;Check Lag from Standby
```sql -- On standby SELECT pg_is_in_recovery(), pg_last_wal_receive_lsn() AS received, pg_last_wal_replay_lsn() AS replayed, pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS apply_lag_bytes;
-- If standby can't connect to primary, this shows how far behind ```
Reduce Replication Lag
On primary:
sudo nano /etc/postgresql/16/main/postgresql.conf```ini # WAL sender tuning max_wal_senders = 10 wal_keep_size = 2GB wal_sender_timeout = 60s wal_level = replica
# Checkpoint tuning (affects WAL volume) checkpoint_timeout = 15min checkpoint_completion_target = 0.9 max_wal_size = 4GB ```
On standby:
sudo nano /etc/postgresql/16/main/postgresql.conf# WAL receiver tuning
wal_receiver_status_interval = 1s
hot_standby = on
hot_standby_feedback = on
max_standby_streaming_delay = 30sSynchronous Replication Issues
Error: "synchronous standby not found"
```sql -- Check synchronous_standby_names on primary SHOW synchronous_standby_names;
-- Check if standby application_name matches SELECT application_name FROM pg_stat_replication; ```
```bash # Fix: Ensure standby has matching application_name # In standby's postgresql.conf: primary_conninfo = 'host=primary user=replicator password=secret application_name=standby1'
# In primary's postgresql.conf: synchronous_standby_names = 'FIRST 1 (standby1, standby2)' # Or for any standby: synchronous_standby_names = 'ANY 1 (standby1, standby2)' ```
Commits Hang Waiting for Sync Standby
```sql -- Check waiting transactions SELECT pid, wait_event_type, wait_event, state, query FROM pg_stat_activity WHERE wait_event = 'SyncRep';
-- If standby is down, transactions will hang -- Temporarily disable synchronous replication ALTER SYSTEM SET synchronous_standby_names = ''; SELECT pg_reload_conf(); ```
Rebuilding Replication
When replication is broken beyond repair:
Step 1: Stop and Clean Standby
```bash # On standby sudo systemctl stop postgresql
# Backup standby config sudo cp /var/lib/postgresql/16/main/postgresql.auto.conf /tmp/
# Clean data directory sudo rm -rf /var/lib/postgresql/16/main/*
# Or just data, keeping config sudo find /var/lib/postgresql/16/main -mindepth 1 -maxdepth 1 \ ! -name 'postgresql.conf' ! -name 'postgresql.auto.conf' \ ! -name 'pg_hba.conf' ! -name 'pg_ident.conf' ! -name 'postmaster.opts' \ -exec rm -rf {} + ```
Step 2: Create Replication Slot on Primary
-- On primary
SELECT pg_create_physical_replication_slot('standby1_slot');Step 3: Take Base Backup
```bash # On standby, using pg_basebackup sudo -u postgres pg_basebackup \ -h primary-server \ -U replicator \ -D /var/lib/postgresql/16/main \ -Fp -Xs -P -R \ -S standby1_slot
# Options explained: # -Fp: Plain format # -Xs: Stream WAL during backup # -P: Show progress # -R: Create standby.signal and configure primary_conninfo # -S: Use replication slot ```
Step 4: Configure Standby
```bash # The -R flag creates standby.signal and sets primary_conninfo # Verify: cat /var/lib/postgresql/16/main/postgresql.auto.conf
# Should contain: # primary_conninfo = 'user=replicator password=... host=primary port=5432 sslmode=prefer sslcompression=0 sslcertmode=allow sslsni=1 gssencmode=prefer krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable' # primary_slot_name = 'standby1_slot' ```
Step 5: Start Standby
sudo systemctl start postgresqlStep 6: Verify Replication
```sql -- On primary SELECT * FROM pg_stat_replication;
-- On standby SELECT pg_is_in_recovery(); -- Should return 't' SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(); ```
Monitoring Physical Replication
```sql -- Create monitoring view on primary CREATE OR REPLACE VIEW replication_monitor AS SELECT client_addr, usename, application_name, state, sync_state, sent_lsn, replay_lsn, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag_size, reply_time, EXTRACT(EPOCH FROM now() - reply_time) AS seconds_since_reply FROM pg_stat_replication;
-- Alert if lag > threshold SELECT * FROM replication_monitor WHERE pg_wal_lsn_diff(sent_lsn, replay_lsn) > 100000000; -- 100MB ```
Common Error Messages and Solutions
| Error | Cause | Solution |
|---|---|---|
| "no pg_hba.conf entry" | Missing replication rule | Add host replication entry |
| "too many WAL senders" | max_wal_senders exceeded | Increase max_wal_senders |
| "WAL segment removed" | Standby too far behind | Reinitialize from base backup |
| "slot not found" | Slot dropped on primary | Recreate slot or drop subscription |
| "connection refused" | Network/firewall issue | Check connectivity, pg_hba.conf |
| "authentication failed" | Wrong credentials | Verify user/password |
Physical replication is reliable when properly configured. The most common issues are network connectivity, authentication, and WAL retention. Monitor lag closely and reinitialize standbys promptly when they fall too far behind.