Fix PostgreSQL Physical Replication Errors - Complete Guide

# PostgreSQL Physical Replication Error - Troubleshooting Guide

Physical (streaming) replication copies the entire PostgreSQL cluster to standby servers. When replication breaks, diagnosing the issue requires checking both primary and standby servers. Let's cover the most common replication errors.

Understanding Physical Replication Components

Physical replication requires:

Primary server: Accepts writes, streams WAL to standbys
Standby server: Receives WAL, applies changes
Replication user: Special user with REPLICATION privilege
Replication slot: Ensures WAL retention for standby
pg_hba.conf: Allows replication connections

```bash # Check replication prerequisites on primary psql -U postgres -c "SHOW wal_level;" # Should be 'replica' or 'logical'

psql -U postgres -c "SHOW max_wal_senders;" # Should be >= number of standbys

psql -U postgres -c "SHOW max_replication_slots;" # Should be >= number of standbys ```

Primary Server Issues

Error: "no pg_hba.conf entry for replication connection"

The standby cannot connect because pg_hba.conf doesn't allow replication:

bash

# Edit pg_hba.conf
sudo nano /etc/postgresql/16/main/pg_hba.conf

ini

# Add replication entry
# TYPE  DATABASE        USER            ADDRESS                 METHOD
host    replication     replicator      192.168.1.0/24          scram-sha-256
# Or for specific standby IP:
host    replication     replicator      192.168.1.10/32         scram-sha-256

bash

# Reload PostgreSQL
sudo systemctl reload postgresql

Error: "replication user not allowed"

Create or configure the replication user:

```sql -- Create replication user CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD 'secure_password';

-- Verify replication privilege SELECT rolname, rolreplication FROM pg_roles WHERE rolname = 'replicator'; ```

Error: "too many replication slots"

```sql -- Check replication slots SELECT slot_name, slot_type, active, restart_lsn, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes FROM pg_replication_slots;

-- Drop unused slots SELECT pg_drop_replication_slot('unused_slot_name');

-- Or increase max_replication_slots ALTER SYSTEM SET max_replication_slots = 10; SELECT pg_reload_conf(); ```

Error: "no free WAL sender"

```sql -- Check current WAL senders SELECT count(*) FROM pg_stat_replication;

-- Increase max_wal_senders ALTER SYSTEM SET max_wal_senders = 10; SELECT pg_reload_conf();

-- Verify change SHOW max_wal_senders; ```

Standby Server Issues

Error: "could not connect to primary server"

Test connectivity from standby:

```bash # Test basic connectivity ping primary-server-host

# Test PostgreSQL connection psql -h primary-server-host -U replicator -d postgres -c "SELECT 1;"

# Check firewall sudo iptables -L -n | grep 5432 # Or sudo firewall-cmd --list-ports

# Test replication connection specifically psql -h primary-server-host -U replicator -d postgres \ -c "IDENTIFY_SYSTEM" \ -c "START_REPLICATION SLOT standby_slot LOGICAL 0/0" ```

Error: "standby server not following primary"

The standby might be following an old primary or is confused:

```bash # Check recovery configuration cat /var/lib/postgresql/16/main/postgresql.auto.conf | grep primary

# For PostgreSQL 12+, check primary_conninfo psql -U postgres -c "SHOW primary_conninfo;" ```

Error: "WAL segment already removed"

The standby fell too far behind and required WAL was removed:

```bash # Check WAL status on primary psql -U postgres -c " SELECT pg_walfile_name(pg_current_wal_lsn()) AS current_wal, pg_walfile_name(restart_lsn) AS standby_needs FROM pg_replication_slots; "

# Check what WAL files are available ls -la /var/lib/postgresql/16/main/pg_wal/ ```

Resolution options:

```bash # Option 1: Restore missing WAL from archive (if you have one) # Configure restore_command in standby's postgresql.conf restore_command = 'cp /archive/wal/%f %p'

# Option 2: Take a new base backup and reinitialize standby # On standby: sudo systemctl stop postgresql sudo rm -rf /var/lib/postgresql/16/main/*

# Take base backup from primary sudo -u postgres pg_basebackup \ -h primary-server \ -U replicator \ -D /var/lib/postgresql/16/main \ -Fp -Xs -P -R

sudo systemctl start postgresql ```

Replication Slot and WAL Issues

Slot Inactive But WAL Retained

Orphaned slots cause WAL to accumulate:

```sql -- Find inactive slots SELECT slot_name, slot_type, active, restart_lsn, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size FROM pg_replication_slots WHERE NOT active;

-- Check if standby is still running SELECT * FROM pg_stat_replication WHERE application_name = 'standby_name';

-- If standby is gone, drop the slot SELECT pg_drop_replication_slot('orphaned_slot'); ```

WAL Disk Space Exhausted

```bash # Check WAL directory size du -sh /var/lib/postgresql/16/main/pg_wal/

# Find old WAL files ls -lt /var/lib/postgresql/16/main/pg_wal/ | tail -20 ```

```sql -- Force checkpoint to trigger WAL cleanup CHECKPOINT;

-- Check if archiving is working SELECT archived_count, failed_count, last_archived_wal, last_failed_wal FROM pg_stat_archiver; ```

Replication Lag Diagnosis

Check Lag from Primary

sql

-- View replication lag
SELECT 
    client_addr,
    usename,
    application_name,
    state,
    sync_state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag_size,
    EXTRACT(EPOCH FROM (now() - reply_time)) AS seconds_since_reply
FROM pg_stat_replication;

Check Lag from Standby

```sql -- On standby SELECT pg_is_in_recovery(), pg_last_wal_receive_lsn() AS received, pg_last_wal_replay_lsn() AS replayed, pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS apply_lag_bytes;

-- If standby can't connect to primary, this shows how far behind ```

Reduce Replication Lag

On primary:

bash

sudo nano /etc/postgresql/16/main/postgresql.conf

```ini # WAL sender tuning max_wal_senders = 10 wal_keep_size = 2GB wal_sender_timeout = 60s wal_level = replica

# Checkpoint tuning (affects WAL volume) checkpoint_timeout = 15min checkpoint_completion_target = 0.9 max_wal_size = 4GB ```

On standby:

bash

sudo nano /etc/postgresql/16/main/postgresql.conf

ini

# WAL receiver tuning
wal_receiver_status_interval = 1s
hot_standby = on
hot_standby_feedback = on
max_standby_streaming_delay = 30s

Synchronous Replication Issues

Error: "synchronous standby not found"

```sql -- Check synchronous_standby_names on primary SHOW synchronous_standby_names;

-- Check if standby application_name matches SELECT application_name FROM pg_stat_replication; ```

```bash # Fix: Ensure standby has matching application_name # In standby's postgresql.conf: primary_conninfo = 'host=primary user=replicator password=secret application_name=standby1'

# In primary's postgresql.conf: synchronous_standby_names = 'FIRST 1 (standby1, standby2)' # Or for any standby: synchronous_standby_names = 'ANY 1 (standby1, standby2)' ```

Commits Hang Waiting for Sync Standby

```sql -- Check waiting transactions SELECT pid, wait_event_type, wait_event, state, query FROM pg_stat_activity WHERE wait_event = 'SyncRep';

-- If standby is down, transactions will hang -- Temporarily disable synchronous replication ALTER SYSTEM SET synchronous_standby_names = ''; SELECT pg_reload_conf(); ```

Rebuilding Replication

When replication is broken beyond repair:

Step 1: Stop and Clean Standby

```bash # On standby sudo systemctl stop postgresql

# Backup standby config sudo cp /var/lib/postgresql/16/main/postgresql.auto.conf /tmp/

# Clean data directory sudo rm -rf /var/lib/postgresql/16/main/*

# Or just data, keeping config sudo find /var/lib/postgresql/16/main -mindepth 1 -maxdepth 1 \ ! -name 'postgresql.conf' ! -name 'postgresql.auto.conf' \ ! -name 'pg_hba.conf' ! -name 'pg_ident.conf' ! -name 'postmaster.opts' \ -exec rm -rf {} + ```

Step 2: Create Replication Slot on Primary

sql

-- On primary
SELECT pg_create_physical_replication_slot('standby1_slot');

Step 3: Take Base Backup

```bash # On standby, using pg_basebackup sudo -u postgres pg_basebackup \ -h primary-server \ -U replicator \ -D /var/lib/postgresql/16/main \ -Fp -Xs -P -R \ -S standby1_slot

# Options explained: # -Fp: Plain format # -Xs: Stream WAL during backup # -P: Show progress # -R: Create standby.signal and configure primary_conninfo # -S: Use replication slot ```

Step 4: Configure Standby

```bash # The -R flag creates standby.signal and sets primary_conninfo # Verify: cat /var/lib/postgresql/16/main/postgresql.auto.conf

# Should contain: # primary_conninfo = 'user=replicator password=... host=primary port=5432 sslmode=prefer sslcompression=0 sslcertmode=allow sslsni=1 gssencmode=prefer krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable' # primary_slot_name = 'standby1_slot' ```

Step 5: Start Standby

bash

sudo systemctl start postgresql

Step 6: Verify Replication

```sql -- On primary SELECT * FROM pg_stat_replication;

-- On standby SELECT pg_is_in_recovery(); -- Should return 't' SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(); ```

Monitoring Physical Replication

```sql -- Create monitoring view on primary CREATE OR REPLACE VIEW replication_monitor AS SELECT client_addr, usename, application_name, state, sync_state, sent_lsn, replay_lsn, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag_size, reply_time, EXTRACT(EPOCH FROM now() - reply_time) AS seconds_since_reply FROM pg_stat_replication;

-- Alert if lag > threshold SELECT * FROM replication_monitor WHERE pg_wal_lsn_diff(sent_lsn, replay_lsn) > 100000000; -- 100MB ```

Common Error Messages and Solutions

Error	Cause	Solution
"no pg_hba.conf entry"	Missing replication rule	Add host replication entry
"too many WAL senders"	max_wal_senders exceeded	Increase max_wal_senders
"WAL segment removed"	Standby too far behind	Reinitialize from base backup
"slot not found"	Slot dropped on primary	Recreate slot or drop subscription
"connection refused"	Network/firewall issue	Check connectivity, pg_hba.conf
"authentication failed"	Wrong credentials	Verify user/password

Physical replication is reliable when properly configured. The most common issues are network connectivity, authentication, and WAL retention. Monitor lag closely and reinitialize standbys promptly when they fall too far behind.

PostgreSQL Physical Replication Error - Troubleshooting Guide

Understanding Physical Replication Components

Primary Server Issues

Error: "no pg_hba.conf entry for replication connection"

Error: "replication user not allowed"

Error: "too many replication slots"

Error: "no free WAL sender"

Standby Server Issues

Error: "could not connect to primary server"

Error: "standby server not following primary"

Error: "WAL segment already removed"

Replication Slot and WAL Issues

Slot Inactive But WAL Retained

WAL Disk Space Exhausted

Replication Lag Diagnosis

Check Lag from Primary

Check Lag from Standby

Reduce Replication Lag

Synchronous Replication Issues

Error: "synchronous standby not found"

Commits Hang Waiting for Sync Standby

Rebuilding Replication

Step 1: Stop and Clean Standby

Step 2: Create Replication Slot on Primary

Step 3: Take Base Backup

Step 4: Configure Standby

Step 5: Start Standby

Step 6: Verify Replication

Monitoring Physical Replication

Common Error Messages and Solutions

Share this guide

More Database Troubleshooting Guides

Database Query Timeout

SQL Server TempDB Contention

Database Connection Pool Exhausted

SQL Server AlwaysOn Failover Failed

Database Sharding Rebalance Failed

SQL Server Log Reader Agent Stalled