What's Actually Happening

MySQL replication SQL thread stops due to errors when applying replicated transactions. The replica falls behind and may not be usable for failover.

The Error You'll See

SQL thread stopped:

```sql mysql> SHOW REPLICA STATUS\G

Replica_IO_Running: Yes Replica_SQL_Running: No Last_SQL_Errno: 1062 Last_SQL_Error: Error 'Duplicate entry '123' for key 'PRIMARY'' on query. ```

Duplicate key error:

sql
Last_SQL_Error: Could not execute Write_rows event on table mydb.users;
Duplicate entry '123' for key 'users.PRIMARY'

Constraint violation:

sql
Last_SQL_Error: Cannot add or update a child row: a foreign key constraint fails

Data type error:

sql
Last_SQL_Error: Incorrect datetime value: '2024-13-01' for column 'created_at'

Why This Happens

  1. 1.Duplicate writes - Data already exists on replica
  2. 2.Constraint violations - Foreign key or check constraint failure
  3. 3.Data inconsistency - Replica data differs from source
  4. 4.Configuration mismatch - Different sql_mode on source/replica
  5. 5.Schema drift - Table structures differ
  6. 6.Direct writes to replica - Non-replicated changes

Step 1: Check Replica Status

```sql -- Check replica status: SHOW REPLICA STATUS\G

-- Key fields: -- Replica_IO_Running: Should be Yes -- Replica_SQL_Running: Should be Yes -- Last_SQL_Errno: Error number (0 = no error) -- Last_SQL_Error: Error message -- Exec_Source_Log_File: Current log position -- Relay_Source_Log_File: Source log position

-- Check process list: SHOW PROCESSLIST;

-- Check relay log files: SHOW RELAYLOG EVENTS IN 'relay-bin.000001';

-- Check GTID status (if using GTID): SHOW REPLICA STATUS\G -- Retrieved_Gtid_Set -- Executed_Gtid_Set ```

Step 2: Identify the Error

```sql -- Get error details: SHOW REPLICA STATUS\G

-- Common error codes: -- 1062: Duplicate entry -- 1452: Foreign key constraint -- 1264: Out of range value -- 1366: Incorrect string value -- 1292: Incorrect datetime value

-- Check last error in detail: SELECT * FROM performance_schema.replication_applier_status_by_worker;

-- For GTID replication: SHOW REPLICA STATUS\G -- Note the Retrieved_Gtid_Set and Executed_Gtid_Set -- The gap indicates the failing transaction ```

Step 3: Handle Duplicate Key Errors

```sql -- Error 1062: Duplicate entry

-- Option 1: Skip the duplicate (may cause data inconsistency): SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START REPLICA;

-- For GTID, skip specific transaction: SET GTID_NEXT = 'aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee:123'; BEGIN; COMMIT; SET GTID_NEXT = 'AUTOMATIC'; START REPLICA;

-- Option 2: Delete duplicate on replica: DELETE FROM users WHERE id = 123; START REPLICA;

-- Option 3: Use replication filters to skip errors: -- In my.cnf: slave_skip_errors = 1062

-- Or for multiple errors: slave_skip_errors = 1062,1452

-- Caution: Skipping errors can cause data drift! ```

Step 4: Handle Constraint Violations

```sql -- Error 1452: Foreign key constraint

-- Check the constraint: SELECT TABLE_NAME, COLUMN_NAME, REFERENCED_TABLE_NAME, REFERENCED_COLUMN_NAME FROM information_schema.KEY_COLUMN_USAGE WHERE TABLE_NAME = 'orders';

-- Option 1: Add missing parent row: INSERT INTO customers (id, name) VALUES (123, 'Missing Customer'); START REPLICA;

-- Option 2: Temporarily disable foreign key checks: SET GLOBAL foreign_key_checks = 0; START REPLICA; -- After replication catches up: SET GLOBAL foreign_key_checks = 1;

-- Option 3: Skip the failing transaction: SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START REPLICA;

-- For GTID: SET GTID_NEXT = 'uuid:transaction_id'; BEGIN; COMMIT; SET GTID_NEXT = 'AUTOMATIC'; START REPLICA; ```

Step 5: Check Data Consistency

```sql -- Compare row counts: -- On source: SELECT COUNT(*) FROM users; -- On replica: SELECT COUNT(*) FROM users;

-- Compare checksum: -- On source: CHECKSUM TABLE users; -- On replica: CHECKSUM TABLE users;

-- Use pt-table-checksum for full verification: pt-table-checksum --host=source --databases=mydb

-- Fix inconsistencies with pt-table-sync: pt-table-sync --execute --host=source --databases=mydb

-- Check specific row: -- On source: SELECT * FROM users WHERE id = 123; -- On replica: SELECT * FROM users WHERE id = 123; ```

Step 6: Check Configuration Mismatch

```sql -- Check sql_mode on both servers: SELECT @@sql_mode;

-- Must be identical for consistent behavior

-- On source: SET GLOBAL sql_mode = 'STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION'; -- On replica: SET GLOBAL sql_mode = 'STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION';

-- Check character set: SHOW VARIABLES LIKE 'character_set%';

-- Check collation: SHOW VARIABLES LIKE 'collation%';

-- Check timezone: SELECT @@global.time_zone;

-- Check binlog format: SHOW VARIABLES LIKE 'binlog_format'; -- Should be ROW for safest replication

-- Sync configuration in my.cnf: [mysqld] sql_mode = STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION binlog_format = ROW character_set_server = utf8mb4 collation_server = utf8mb4_unicode_ci ```

Step 7: Check Schema Consistency

```sql -- Compare table structures: SHOW CREATE TABLE users;

-- On both source and replica

-- Check for differences in: -- - Column types -- - Indexes -- - Constraints -- - Auto_increment

-- Use mysqldump to compare: mysqldump --no-data --host=source mydb users > source_schema.sql mysqldump --no-data --host=replica mydb users > replica_schema.sql diff source_schema.sql replica_schema.sql

-- Fix schema differences: -- On replica: ALTER TABLE users ADD COLUMN new_column VARCHAR(100); -- Or recreate table if major differences ```

Step 8: Restart Replication Properly

```sql -- After fixing the error:

-- For non-GTID: START REPLICA;

-- For GTID: START REPLICA;

-- Check status: SHOW REPLICA STATUS\G

-- Monitor SQL thread: SELECT * FROM performance_schema.replication_applier_status_by_worker;

-- Check lag: SHOW REPLICA STATUS\G -- Seconds_Behind_Source should decrease

-- Monitor progress: SELECT MASTER_POS_WAIT('mysql-bin.000001', 12345, 60) AS position_caught_up; ```

Step 9: Rebuild Replica if Necessary

```sql -- If errors are extensive, rebuild replica:

-- On replica: STOP REPLICA; RESET REPLICA ALL;

-- Take backup from source: mysqldump --host=source --all-databases --master-data=2 \ --single-transaction --routines --triggers --events > backup.sql

-- Extract binlog position: head -n 50 backup.sql | grep "CHANGE MASTER"

-- Restore on replica: mysql < backup.sql

-- Configure replication: CHANGE REPLICATION SOURCE TO SOURCE_HOST = 'source-host', SOURCE_USER = 'replicator', SOURCE_PASSWORD = 'password', SOURCE_LOG_FILE = 'mysql-bin.000001', SOURCE_LOG_POS = 12345;

START REPLICA;

-- For GTID: CHANGE REPLICATION SOURCE TO SOURCE_HOST = 'source-host', SOURCE_USER = 'replicator', SOURCE_PASSWORD = 'password', SOURCE_AUTO_POSITION = 1;

START REPLICA; ```

Step 10: Monitor Replication Health

```sql -- Create monitoring query: SELECT VARIABLE_NAME, VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME IN ( 'Replica_running', 'Replica_IO_Running', 'Replica_SQL_Running' );

-- Create monitoring script: -- #!/bin/bash -- mysql -e "SHOW REPLICA STATUS\G" | grep -E "Running|Error|Seconds_Behind"

-- Prometheus mysqld_exporter metrics: -- mysql_up -- mysql_slave_status_slave_io_running -- mysql_slave_status_slave_sql_running -- mysql_slave_status_seconds_behind_master

-- Alerts: -- - alert: MySQLReplicationSQLStopped -- expr: mysql_slave_status_slave_sql_running == 0 -- for: 1m -- labels: -- severity: critical -- annotations: -- summary: "MySQL replication SQL thread stopped"

-- - alert: MySQLReplicationLag -- expr: mysql_slave_status_seconds_behind_master > 60 -- for: 5m -- labels: -- severity: warning -- annotations: -- summary: "MySQL replication lag > 60 seconds" ```

MySQL Replication SQL Thread Checklist

CheckCommandExpected
SQL RunningSHOW REPLICA STATUSYes
Error messageSHOW REPLICA STATUSNone
Data consistentCHECKSUM TABLEMatch
sql_modeSELECT @@sql_modeSame on both
SchemaSHOW CREATE TABLEIdentical
LagSeconds_BehindLow

Verify the Fix

```sql -- After resolving SQL thread error

-- 1. Check both threads running SHOW REPLICA STATUS\G -- Replica_IO_Running: Yes -- Replica_SQL_Running: Yes

-- 2. Check no errors SHOW REPLICA STATUS\G -- Last_SQL_Error: (empty)

-- 3. Check lag decreasing SHOW REPLICA STATUS\G -- Seconds_Behind_Source: decreasing

-- 4. Verify data consistency CHECKSUM TABLE users; -- Matches source

-- 5. Test write replication -- On source: INSERT INTO test_table VALUES (1, 'test'); -- On replica: SELECT * FROM test_table; -- Data present

-- 6. Monitor ongoing SELECT * FROM performance_schema.replication_applier_status_by_worker; -- No errors ```

  • [Fix PostgreSQL Streaming Replication Lag](/articles/fix-postgresql-streaming-replication-lag)
  • [Fix MySQL Binary Log Sync Delay](/articles/fix-mysql-binary-log-sync-delay)
  • [Fix MongoDB Replica Set Primary Down](/articles/fix-mongodb-replica-set-primary-down)