# PostgreSQL Checkpoint Error - Diagnosis and Resolution
Checkpoints are PostgreSQL's mechanism for ensuring that modified data is written from memory to disk. When checkpoint operations fail or timeout, you'll see errors in the logs and potentially experience data integrity concerns. Understanding checkpoint behavior is crucial for database reliability.
Understanding Checkpoints
A checkpoint writes all dirty (modified) buffers from shared memory to disk. PostgreSQL triggers checkpoints:
- When
checkpoint_timeoutelapses (default 5 minutes) - When
max_wal_sizeis reached - On explicit
CHECKPOINTcommand - During database shutdown
- Before starting a backup
# Check current checkpoint settings
psql -U postgres -c "
SELECT name, setting, unit
FROM pg_settings
WHERE name LIKE 'checkpoint%' OR name IN ('wal_level', 'max_wal_size', 'min_wal_size');
"Identifying Checkpoint Errors
Checkpoint issues manifest in several ways:
```bash # Check PostgreSQL logs for checkpoint messages sudo grep -i "checkpoint" /var/log/postgresql/postgresql-*-main.log | tail -50
# Common messages to look for: # - "checkpoint request failed" # - "checkpoint starting" # - "checkpoint complete" # - "checkpoints are occurring too frequently" # - "WAL writer sleep between cleanups" ```
Checkpoint Statistics
-- View checkpoint statistics
SELECT
checkpoints_timed,
checkpoints_req,
checkpoints_timed::float / NULLIF(checkpoints_timed + checkpoints_req, 0) * 100 AS timed_pct,
checkpoint_write_time,
checkpoint_sync_time,
pg_size_pretty(buffers_checkpoint * 8192) AS checkpoint_write_size,
pg_size_pretty(buffers_clean * 8192) AS bgwriter_write_size
FROM pg_stat_bgwriter;A high checkpoints_req ratio indicates checkpoints are being forced by WAL volume rather than timeout.
Checkpoint Timeout Error
If checkpoints take longer than expected, you might see warnings or timeouts:
```bash # Check if checkpoints are completing sudo grep -E "checkpoint starting|checkpoint complete" /var/log/postgresql/postgresql-*-main.log | tail -20
# Look for long-running checkpoints sudo grep "checkpoint complete" /var/log/postgresql/postgresql-*-main.log | \ awk '{print $0; system("date -d \"" $1 " " $2 "\" +%s")}' | tail -20 ```
Tuning Checkpoint Duration
```bash # Edit postgresql.conf sudo nano /etc/postgresql/16/main/postgresql.conf
# Adjust checkpoint settings checkpoint_timeout = 15min # Increase to spread checkpoints max_wal_size = 4GB # Allow more WAL before checkpoint min_wal_size = 1GB # Minimum WAL to retain checkpoint_completion_target = 0.9 # Spread checkpoint work over 90% of interval checkpoint_flush_after = 256kB # Flush after this much written
# The checkpoint_completion_target is crucial: # - 0.9 means spread checkpoint writes over 90% of the timeout # - Prevents I/O spikes # - Allows smoother disk write patterns
# Reload configuration sudo systemctl reload postgresql ```
I/O Bottlenecks During Checkpoints
Heavy I/O during checkpoints can cause query timeouts and slow performance:
```bash # Monitor checkpoint I/O impact iostat -x 5 10
# While running checkpoint manually in another session psql -U postgres -c "CHECKPOINT;" ```
Reducing Checkpoint I/O Impact
```bash # Configure spread checkpoints and I/O throttling checkpoint_completion_target = 0.9 # Spread over 90% of interval checkpoint_flush_after = 256kB # Force flush after writing checkpoint_warning = 30s # Warn if checkpoints occur within 30s
# Background writer settings to reduce checkpoint burden bgwriter_delay = 200ms # Run every 200ms bgwriter_lru_maxpages = 100 # Max pages per round bgwriter_lru_multiplier = 2.0 # Aggressiveness bgwriter_flush_after = 512kB # Flush after this much
# Apply changes sudo systemctl reload postgresql ```
Too Frequent Checkpoints
Warning message "checkpoints are occurring too frequently" indicates max_wal_size is too small:
```bash # Check checkpoint frequency in logs sudo grep "checkpoints are occurring too frequently" /var/log/postgresql/postgresql-*-main.log
# Check current WAL production rate psql -U postgres -c " SELECT pg_walfile_name(pg_current_wal_lsn()) AS current_wal, pg_size_pretty(sum(size)) AS wal_dir_size FROM pg_ls_waldir() AS w(size); "
# Monitor over time watch -n 5 'psql -U postgres -c "SELECT pg_walfile_name(pg_current_wal_lsn()), pg_current_wal_lsn();"' ```
Increasing WAL Capacity
```bash # Increase max_wal_size to reduce checkpoint frequency sudo nano /etc/postgresql/16/main/postgresql.conf
# Before (example) max_wal_size = 1GB
# After (example) max_wal_size = 4GB
# Reload sudo systemctl reload postgresql
# Monitor checkpoint behavior after change psql -U postgres -c " SELECT checkpoints_timed, checkpoints_req, pg_size_pretty(current_setting('max_wal_size')::bigint) AS max_wal FROM pg_stat_bgwriter, pg_settings WHERE pg_settings.name = 'max_wal_size'; " ```
Checkpoint Sync Failures
When checkpoint_sync_time is high, the fsync at end of checkpoint is taking too long:
# Check sync times
psql -U postgres -c "
SELECT
checkpoint_write_time / 1000.0 AS write_seconds,
checkpoint_sync_time / 1000.0 AS sync_seconds,
buffers_checkpoint,
buffers_clean,
buffers_backend
FROM pg_stat_bgwriter;
"High sync times indicate storage performance issues:
```bash # Test disk sync performance sudo -u postgres postgres --sync-only -D /var/lib/postgresql/16/main
# Or use fio for storage benchmarking sudo fio --name=sync-test --ioengine=sync --rw=write --size=1G --numjobs=1 --fsync=1 --filename=/var/lib/postgresql/test_sync ```
Storage Optimization
```bash # If using Linux, check disk scheduler cat /sys/block/sda/queue/scheduler # For SSDs, 'noop' or 'deadline' is preferred # For HDDs, 'cfq' or 'deadline' is better
# Change scheduler (example for sda) echo 'deadline' | sudo tee /sys/block/sda/queue/scheduler
# Check if barriers are enabled (should be on for data safety) cat /proc/mounts | grep "data=ordered"
# Ensure proper mount options in /etc/fstab for data directory # /dev/sdb1 /var/lib/postgresql ext4 defaults,noatime,nodiratime,data=ordered 0 2 ```
Checkpoint During Backup
Manual backups trigger checkpoints, which can cause issues:
```bash # Before taking a backup, ensure system can handle the checkpoint psql -U postgres -c " SELECT count(*) AS dirty_buffers, pg_size_pretty(count(*) * 8192) AS dirty_size FROM pg_buffercache WHERE isdirty; "
# Need pg_buffercache extension psql -U postgres -c "CREATE EXTENSION IF NOT EXISTS pg_buffercache;" ```
Using Non-Blocking Backups
```bash # Use pg_basebackup with checkpoint=spread (default) pg_basebackup -h localhost -U backup_user -D /backup/base -Fp -Xs -P -R --checkpoint=spread
# For large databases, consider incremental backups # Or use WAL archiving with PITR capability ```
Manual Checkpoint Failures
When running CHECKPOINT command fails:
```sql -- Error: "ERROR: could not fsync file: No space left on device" -- Check disk space SELECT pg_size_pretty(pg_database_size(datname)) AS db_size, datname FROM pg_database;
-- Check filesystem space \! df -h /var/lib/postgresql ```
```bash # Clear space or add storage # Check for large temporary files sudo find /var/lib/postgresql -name "*.tmp" -o -name "*.temp" -ls
# Check pg_stat_tmp directory ls -la /var/lib/postgresql/16/main/pg_stat_tmp/
# Clear old logs if needed sudo find /var/log/postgresql -name "*.log" -mtime +30 -delete ```
Checkpoint and Standby Servers
Standby servers also perform checkpoints:
```bash # On standby, check if recovery is impacting checkpoints psql -U postgres -c "SELECT pg_is_in_recovery();"
# Check standby lag psql -U postgres -c " SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS lag_bytes; " ```
Monitoring Checkpoint Health
```sql -- Create a comprehensive checkpoint monitoring view CREATE OR REPLACE VIEW checkpoint_health AS SELECT now() AS check_time, checkpoints_timed, checkpoints_req, round(checkpoints_timed::numeric / NULLIF(checkpoints_timed + checkpoints_req, 0) * 100, 2) AS timed_pct, round(checkpoint_write_time / 1000.0, 2) AS write_sec, round(checkpoint_sync_time / 1000.0, 2) AS sync_sec, pg_size_pretty(buffers_checkpoint * current_setting('block_size')::bigint) AS checkpoint_written, pg_size_pretty(buffers_clean * current_setting('block_size')::bigint) AS bgwriter_written, pg_size_pretty(buffers_backend * current_setting('block_size')::bigint) AS backend_written FROM pg_stat_bgwriter;
-- Schedule regular monitoring -- SELECT * FROM checkpoint_health; ```
Best Practices
- 1.Set appropriate timeout: 10-15 minutes for most workloads
- 2.Tune completion target: 0.9 to spread I/O load
- 3.Size max_wal_size correctly: Based on WAL generation rate
- 4.Monitor bgwriter: Ensure background writer is cleaning buffers
- 5.Storage matters: Checkpoint performance is I/O bound
- 6.Test failover recovery: Ensure checkpoints enable fast recovery
When checkpoint errors occur, the root cause is usually either storage performance or configuration mismatch with workload. Proper tuning prevents most checkpoint-related issues and ensures smooth database operation.