Introduction
MySQL replication and Galera cluster errors occur when data synchronization between primary and replica servers fails, when Galera cluster nodes lose quorum, or when configuration mismatches prevent proper cluster operation. MySQL supports multiple replication formats (statement-based, row-based, mixed) and topologies (primary-replica, primary-primary, multi-source, group replication). Galera cluster provides synchronous multi-master replication with automatic membership control and failover. Common causes include GTID mode mismatches, binary log corruption, network partitions causing split-brain, replication SQL thread errors from constraint violations, disk space exhaustion on replicas, Galera write set exhaustion, cluster state transfer (SST) failures, and version incompatibility during upgrades. The fix requires understanding MySQL replication architecture, GTID mechanics, Galera consensus protocol, and recovery procedures. This guide provides production-proven troubleshooting for MySQL replication and Galera issues across traditional replication and cluster deployments.
Symptoms
Last_IO_Error: Got fatal error 1236 from masterLast_SQL_Error: Relay log init failureLast_SQL_Error: Error executing row eventSlave SQL thread is stopped because it encountered an errorSeconds_Behind_Masterincreasing continuouslyWSREP: failed to open gcomm backend connection: 111WSREP: Refusing to be unsafely bootstrappedWSREP: Cluster status changed to non-primary- Galera node shows
Disconnectedstate Too many connectionsduring SST- Binary log position mismatch between master and replica
- GTID_EXECUTED not matching GTID_INJECTED_GSET
Common Causes
- Network interruption between master and replica
- Binary log file deleted or corrupted on master
- Replica disk full, cannot write relay logs
- SQL error on replica (duplicate key, constraint violation)
- GTID mode enabled but not consistent across servers
server_uuidconflict between servers- Galera cluster lost quorum (majority of nodes down)
- SST method (mysqldump, xtrabackup) failed during join
- Firewall blocking replication ports (3306, 4444, 4567, 4568)
binlog_formatmismatch between servers- Row-based replication with non-deterministic statements
- Galera gcache size too small for large writesets
- Primary component timeout during network partition
Step-by-Step Fix
### 1. Diagnose replication status
Check replication status:
```sql -- Check slave status SHOW SLAVE STATUS\G
-- Key fields to examine: -- Slave_IO_Running: Yes/No (I/O thread status) -- Slave_SQL_Running: Yes/No (SQL thread status) -- Seconds_Behind_Master: NULL if stopped, number if lagging -- Last_IO_Error: Error from I/O thread -- Last_SQL_Error: Error from SQL thread -- Relay_Log_Space: Current relay log size -- Master_Log_File: Current binlog file on master -- Read_Master_Log_Pos: Position read from master -- Relay_Master_Log_File: Binlog file being executed -- Exec_Master_Log_Pos: Position executed on replica -- Retrieved_Gtid_Set: GTIDs fetched from master -- Executed_Gtid_Set: GTIDs executed on replica
-- Check master status SHOW MASTER STATUS\G
-- Output: -- File: mysql-bin.000123 -- Position: 4567890 -- Binlog_Do_DB: -- Binlog_Ignore_DB: -- Executed_Gtid_Set: abc123:1-12345
-- Check all replication connections (multi-source) SHOW ALL SLAVES STATUS\G ```
Check Galera cluster status:
```sql -- Check Galera status SHOW STATUS LIKE 'wsrep%';
-- Key fields: -- wsrep_ready: ON if node is ready -- wsrep_connected: ON if connected to cluster -- wsrep_local_state_comment: Synced/Donor/Joiner/Desync -- wsrep_cluster_status: Primary/Non-Primary -- wsrep_cluster_size: Number of nodes in cluster -- wsrep_local_state_uuid: Node state UUID -- wsrep_incoming_addresses: All node addresses
-- Check cluster configuration SHOW VARIABLES LIKE 'wsrep%';
-- wsrep_cluster_address should show all nodes: -- gcomm://192.168.1.10,192.168.1.11,192.168.1.12 ```
Check binary logs:
```sql -- List binary logs SHOW BINARY LOGS;
-- Check current binlog position SHOW MASTER STATUS;
-- View binlog events SHOW BINLOG EVENTS IN 'mysql-bin.000123' LIMIT 20;
-- Check relay logs on replica SHOW RELAYLOG EVENTS LIMIT 20;
-- Purge old binary logs (free disk space) PURGE BINARY LOGS BEFORE '2026-03-01 00:00:00'; -- Or PURGE BINARY LOGS TO 'mysql-bin.000100'; ```
### 2. Fix replication thread errors
I/O thread errors:
```sql -- Error 1236: Client requested master to start replication from impossible position -- Caused by: Master binary log deleted, replica requests old position
-- Solution 1: Reset replica to current master position STOP SLAVE; RESET SLAVE ALL; -- Caution: Clears all replication config
-- Get current master position -- On master: SHOW MASTER STATUS;
-- On replica, point to current position CHANGE MASTER TO MASTER_HOST='master-host', MASTER_USER='repl_user', MASTER_PASSWORD='repl_password', MASTER_LOG_FILE='mysql-bin.000123', MASTER_LOG_POS=4567890;
START SLAVE;
-- Solution 2: Using GTID (if GTID mode enabled) STOP SLAVE; RESET SLAVE ALL;
CHANGE MASTER TO MASTER_HOST='master-host', MASTER_USER='repl_user', MASTER_PASSWORD='repl_password', MASTER_AUTO_POSITION=1; -- Uses GTID auto-positioning
START SLAVE;
-- Error: Got fatal error 1236: Could not find first log file name -- Fix: Reset relay logs STOP SLAVE; RESET SLAVE; START SLAVE; ```
SQL thread errors:
```sql -- Error 1062: Duplicate entry for key 'PRIMARY' -- Error 1032: Can't find record in table -- Error 1451: Cannot add or update child row
-- Check the error SHOW SLAVE STATUS\G -- Last_SQL_Error shows the failing statement
-- Solution 1: Skip the failing transaction (data divergence!) STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;
-- For GTID mode, skipping is different: STOP SLAVE; SET gtid_next = 'aaa-bbb-ccc:12345'; -- GTID of failing transaction BEGIN; COMMIT; -- Empty transaction SET gtid_next = 'AUTOMATIC'; START SLAVE;
-- Solution 2: Fix data inconsistency -- Manually apply the failing statement on replica -- Then restart replication
-- Solution 3: Rebuild replica from master -- Stop replica STOP SLAVE;
-- Take backup from master mysqldump --master-data=2 --single-transaction -u root -p > backup.sql
-- Restore on replica mysql -u root -p < backup.sql
-- Reset and configure replication RESET SLAVE ALL; CHANGE MASTER TO ...; -- Use info from --master-data START SLAVE; ```
### 3. Fix GTID issues
GTID configuration:
```sql -- Check GTID mode SHOW VARIABLES LIKE 'gtid_mode'; -- Should be: ON
-- Check if GTID consistent SHOW VARIABLES LIKE 'enforce_gtid_consistency'; -- Should be: ON
-- Check GTID sets SELECT @@GLOBAL.GTID_EXECUTED; -- GTIDs executed on this server SELECT @@GLOBAL.GTID_PURGED; -- GTIDs purged from binlog SELECT @@GLOBAL.GTID_OWNED; -- GTIDs owned by this server
-- On replica, check GTID synchronization SHOW SLAVE STATUS\G -- Retrieved_Gtid_Set: GTIDs fetched from master -- Executed_Gtid_Set: GTIDs executed on replica
-- These should eventually match ```
Fix GTID mismatches:
```sql -- Problem: GTID_EXECUTED on replica exceeds master -- This happens when replica has executed extra GTIDs
-- Solution 1: Reset GTID on replica (if safe) STOP SLAVE; RESET MASTER; -- Clears GTID_EXECUTED
-- Reconfigure replication CHANGE MASTER TO MASTER_AUTO_POSITION=1; START SLAVE;
-- Solution 2: Inject missing GTIDs on master -- If master is missing GTIDs that replica needs
-- On master, inject empty transactions for missing GTIDs SET gtid_next = 'missing-gtid-here'; BEGIN; COMMIT; SET gtid_next = 'AUTOMATIC';
-- Solution 3: Full rebuild with GTID -- Stop replica STOP SLAVE; RESET MASTER;
-- Take backup from master with GTID info mysqldump --single-transaction --master-data=2 -u root -p > backup.sql
-- The backup contains SET @@GLOBAL.GTID_PURGED='...'
-- Restore on replica mysql -u root -p < backup.sql
-- Configure GTID-based replication CHANGE MASTER TO MASTER_HOST='master', MASTER_USER='repl', MASTER_PASSWORD='password', MASTER_AUTO_POSITION=1;
START SLAVE; ```
GTID failover:
```sql -- For clean failover, ensure GTID consistency
-- On new master (promoted replica): -- Verify all GTIDs are present SELECT @@GLOBAL.GTID_EXECUTED;
-- Ensure it's ready to accept writes STOP SLAVE; RESET SLAVE ALL; RESET MASTER; -- Optional: clears old GTID info
-- On other replicas, point to new master STOP SLAVE; CHANGE MASTER TO MASTER_HOST='new-master', MASTER_AUTO_POSITION=1; START SLAVE; ```
### 4. Fix Galera cluster issues
Cluster lost quorum:
```sql -- WSREP: Cluster status changed to non-primary -- This means cluster lost quorum (majority of nodes)
-- Check current state on each node SHOW STATUS LIKE 'wsrep_cluster_status'; -- Primary = healthy, Non-Primary = lost quorum
-- Check cluster size SHOW STATUS LIKE 'wsrep_cluster_size'; -- If expected 3, shows 1 or 2, nodes are partitioned
-- Check node state SHOW STATUS LIKE 'wsrep_local_state_comment'; -- Synced = healthy -- Donor = donating SST -- Joiner = joining cluster -- Desync = temporarily desynchronized
-- Solution: Bootstrap cluster from surviving nodes -- Only do this if majority cannot be recovered!
-- On one node (choose most up-to-date): -- Stop MySQL systemctl stop mysql
-- Start with bootstrap option mysqld_safe --wsrep-new-cluster & # Or with systemd: # systemctl start mysql@bootstrap.service
-- Verify cluster formed SHOW STATUS LIKE 'wsrep_cluster_size'; -- Should show 1
-- Start other nodes normally systemctl start mysql
-- Verify all nodes joined SHOW STATUS LIKE 'wsrep_cluster_size'; -- Should show all nodes ```
Node fails to join cluster:
```sql -- Error: failed to open gcomm backend connection: 111 -- Error: Refusing to be unsafely bootstrapped
-- Check wsrep_cluster_address SHOW VARIABLES LIKE 'wsrep_cluster_address'; -- Should be: gcomm://node1,node2,node3
-- Verify all nodes are reachable ping node1 ping node2 ping node3
-- Check firewall allows Galera ports # 3306 - MySQL # 4444 - SST (State Snapshot Transfer) # 4567 - Cluster replication # 4568 - IST (Incremental State Transfer)
-- Check Galera logs tail -f /var/log/mysql/error.log | grep WSREP
-- Find most advanced node (for bootstrap) -- On each node, check last committed transaction SHOW STATUS LIKE 'wsrep_last_committed';
-- Bootstrap from node with highest value
-- SST failures (xtrabackup) -- Check SST method SHOW VARIABLES LIKE 'wsrep_sst_method'; # Options: rsync, mysqldump, xtrabackup, xtrabackup-v2
-- If xtrabackup fails, check: # 1. xtrabackup installed on all nodes # 2. SST user has proper privileges # 3. Disk space for data transfer
-- Configure SST user [mysqld] wsrep_sst_auth=sstuser:sstpassword
-- Grant SST user privileges GRANT PROCESS, RELOAD, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'sstuser'@'localhost' IDENTIFIED BY 'sstpassword'; FLUSH PRIVILEGES; ```
Split-brain scenario:
```sql -- Split-brain: Network partition creates two clusters -- Both think they're the primary
-- Prevention: Galera requires majority for quorum -- With 3 nodes, need 2+ for quorum -- With 2 nodes, need both (no fault tolerance)
-- Detection: Check cluster from each node -- Node 1: wsrep_cluster_size=2, status=Primary -- Node 3: wsrep_cluster_size=1, status=Non-Primary
-- The Non-Primary partition cannot accept writes -- (pc.recovery option can override, but dangerous!)
-- Recovery: Merge partitions or choose one as primary -- Stop MySQL on Non-Primary nodes systemctl stop mysql
-- Verify Primary partition has quorum SHOW STATUS LIKE 'wsrep%';
-- Restart Non-Primary nodes to rejoin systemctl start mysql
-- If both partitions think they're Primary (misconfiguration): -- Choose one partition as authoritative (usually larger) -- Stop other partition -- Reconfigure to join authoritative cluster ```
Galera performance issues:
```sql -- Check flow control (backpressure) SHOW STATUS LIKE 'wsrep_flow_control%';
-- wsrep_flow_control_paused > 0.5 indicates overload -- wsrep_flow_control_sent > 0 means this node is throttling
-- Check certification failures SHOW STATUS LIKE 'wsrep_local_cert_failures'; -- Increasing = conflicts with other nodes' transactions
-- Check queue sizes SHOW STATUS LIKE 'wsrep_cert_deps_distance'; -- High value = long dependency queue
-- Solutions: -- 1. Increase gcache size (for larger writesets) -- [mysqld] -- wsrep_provider_options="gcache.size=2G"
-- 2. Reduce write rate or batch large transactions
-- 3. Add more nodes to distribute load
-- 4. Check network latency between nodes -- ping -c 10 node2 -- Should be < 1ms for optimal performance ```
### 5. Fix replication lag
Diagnose lag:
```sql -- Check lag SHOW SLAVE STATUS\G -- Seconds_Behind_Master: 3600 (1 hour behind!)
-- Check what's being executed SHOW PROCESSLIST; -- Look for long-running queries on replica
-- Check relay log size SHOW RELAYLOG EVENTS LIMIT 10;
-- Check if I/O thread is keeping up -- Master_Log_File vs Relay_Master_Log_File -- If same but lag increasing, SQL thread is slow ```
Reduce lag:
```sql -- Solution 1: Parallel replication (MySQL 5.7+) -- On replica: SET GLOBAL slave_parallel_workers = 4; SET GLOBAL slave_parallel_type = LOGICAL_CLOCK; SET GLOBAL slave_preserve_commit_order = ON;
-- In my.cnf for persistence: [mysqld] slave_parallel_workers = 4 slave_parallel_type = LOGICAL_CLOCK slave_preserve_commit_order = ON
-- Solution 2: Optimize replica configuration [mysqld] # Faster InnoDB flush innodb_flush_log_at_trx_commit = 2 sync_binlog = 0
# Larger buffer pool innodb_buffer_pool_size = 4G
# Disable slow query log on replica (if not needed for debugging) slow_query_log = 0
-- Solution 3: Skip non-critical updates -- For analytics replicas, skip DELETE/UPDATE -- Use --replicate-do-db for selective replication
-- Solution 4: Use row-based replication -- On master: SET GLOBAL binlog_format = 'ROW'; -- More reliable for replication, can be larger in binlog
-- Solution 5: Offload reads to other replicas -- Don't overload single replica with all read traffic ```
### 6. Monitor replication and cluster health
Monitoring queries:
```sql -- Replication health check SELECT Slave_IO_Running, Slave_SQL_Running, Seconds_Behind_Master, Last_IO_Error, Last_SQL_Error FROM mysql.slave_master_info;
-- Galera cluster health SELECT VARIABLE_NAME, VARIABLE_VALUE FROM information_schema.GLOBAL_STATUS WHERE VARIABLE_NAME LIKE 'wsrep%' AND VARIABLE_NAME IN ( 'wsrep_ready', 'wsrep_connected', 'wsrep_local_state_comment', 'wsrep_cluster_status', 'wsrep_cluster_size' );
-- Replication lag trend SHOW STATUS LIKE 'Seconds_Behind_Master'; ```
Prometheus metrics:
```yaml # mysqld_exporter for Prometheus # https://github.com/prometheus/mysqld_exporter
# Key metrics: mysql_slave_status_seconds_behind_master mysql_slave_status_slave_sql_running mysql_slave_status_slave_io_running mysql_global_status_wsrep_ready mysql_global_status_wsrep_connected mysql_global_status_wsrep_cluster_status mysql_global_status_wsrep_cluster_size mysql_global_status_wsrep_local_state_comment
# Grafana alert rules groups: - name: mysql_replication rules: - alert: MySQLReplicationDown expr: mysql_slave_status_slave_io_running == 0 for: 5m labels: severity: critical annotations: summary: "MySQL replication I/O thread stopped"
- alert: MySQLReplicationLagHigh
- expr: mysql_slave_status_seconds_behind_master > 300
- for: 10m
- labels:
- severity: warning
- annotations:
- summary: "MySQL replication lag above 5 minutes"
- alert: MySQLGaleraClusterSizeSmall
- expr: mysql_global_status_wsrep_cluster_size < 3
- for: 5m
- labels:
- severity: warning
- annotations:
- summary: "Galera cluster has fewer than expected nodes"
- alert: MySQLGaleraNodeNotReady
- expr: mysql_global_status_wsrep_ready == 0
- for: 5m
- labels:
- severity: critical
- annotations:
- summary: "Galera node is not ready"
`
### 7. Backup and recovery
Backup with replication info:
```bash # mysqldump with replication position mysqldump --master-data=2 --single-transaction \ --all-databases -u root -p > backup.sql
# --master-data=2 includes CHANGE MASTER info as comment # --single-transaction ensures consistent backup
# Check the backup file for replication info grep -i "change master" backup.sql
# Percona XtraBackup for hot backups xtrabackup --backup --target-dir=/backup/full \ --user=backup --password=secret
# Prepare backup for restore xtrabackup --prepare --target-dir=/backup/full ```
Recovery procedure:
```sql -- Full cluster recovery (Galera)
-- Step 1: Find most advanced node -- On each node: SHOW STATUS LIKE 'wsrep_last_committed'; SHOW STATUS LIKE 'wsrep_local_state_comment';
-- Step 2: Bootstrap from most advanced node systemctl stop mysql mysqld_safe --wsrep-new-cluster &
-- Step 3: Verify bootstrap successful SHOW STATUS LIKE 'wsrep_cluster_size';
-- Step 4: Start other nodes systemctl start mysql
-- Step 5: Verify all nodes synced SHOW STATUS LIKE 'wsrep_local_state_comment'; -- All should show 'Synced'
-- Replica recovery (traditional replication)
-- Step 1: Take backup from master mysqldump --master-data=2 --single-transaction \ --all-databases -u root -p > replica-recovery.sql
-- Step 2: Restore on replica mysql -u root -p < replica-recovery.sql
-- Step 3: Configure replication (use info from backup) CHANGE MASTER TO MASTER_HOST='master', MASTER_USER='repl', MASTER_PASSWORD='password', MASTER_LOG_FILE='mysql-bin.000123', MASTER_LOG_POS=4567890;
-- Step 4: Start replication START SLAVE;
-- Step 5: Verify SHOW SLAVE STATUS\G ```
Prevention
- Use GTID-based replication for easier failover
- Configure parallel replication on replicas
- Monitor replication lag with alerting
- Use odd number of Galera nodes (3, 5) for quorum
- Place Galera nodes in low-latency network (< 1ms)
- Size gcache appropriately for workload
- Regular backup with replication position documented
- Test failover procedures in staging
- Use connection pooling to handle replica failover
- Implement read/write splitting with proper error handling
Related Errors
- **ER_NET_PACKET_TOO_LARGE**: Query exceeds max_allowed_packet
- **ER_LOCK_DEADLOCK**: Deadlock found when trying to get lock
- **ER_LOCK_WAIT_TIMEOUT**: Lock wait timeout exceeded
- **ER_SERVER_SHUTDOWN**: Server shutdown in progress
- **CR_SERVER_GONE_ERROR**: MySQL server has gone away