图 arctic_moon_by ellysiumn

在测试环境开启GTID运行一年多之后,我们准备近期上线生产。为了保证GTID顺利的上线,在测试环境模拟各种故障场景,观察GTID 的表现

本文案例采用sandbox测试,GTID的一些关键测试参数如下

master-info-repository=tablerelay-log-info-repository=tablegtid_mode=ONlog-slave-updatesenforce-gtid-consistencyrelay_log_recovery=on

MySQL 版本 5.7.19

Master的UUID是-1111-1111-111111111111,

Slave的UUID是 -3333-3333-333333333333,

我们简称 Master的UUID为1111,Slave的UUID为3333。

我们发现让从库 SQL THREAD 跳过一个GTID事务,然后在t11 时刻,我们想让从库重新执行这个跳过的GTID,SQL thread却始终报错,我们把从库的线程的关键信息提供一下

$ mysqlbinlog mysql-relay./*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;DELIMITER /*!*/;# at 4#190308 10:56:36 server id 102 end_log_pos 123 CRC32 0x73fa3393 Start: binlog v 4, server v 5.7.25-28-log created 190308 10:56:36# This Format_description_event appears in a relay log and was generated by the slave thread.# at 123#190308 10:56:36 server id 102 end_log_pos 154 CRC32 0x4c964c64 Previous-GTIDs# [empty]# at 154#700101 8:00:00 server id 1 end_log_pos 0 CRC32 0x6677862d Rotate to mysql-bin.000009 pos: 4# at 201#190308 9:30:10 server id 1 end_log_pos 123 CRC32 0x536ddf69 Start: binlog v 4, server v 5.7.25-28-log created 190308 9:30:10# at 320#190308 10:56:36 server id 0 end_log_pos 367 CRC32 0x72b7d6b4 Rotate to mysql-bin.000009 pos: 194# at 367#190308 10:56:36 server id 0 end_log_pos 414 CRC32 0x73628a00 Rotate to mysql-bin.000009 pos: 2426# at 414#190308 10:57:44 server id 1 end_log_pos 2491 CRC32 0x18c5896c GTID last_committed=13 sequence_number=14 rbr_only=noSET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1857'/*!*/;# at 479#190308 10:57:44 server id 1 end_log_pos 2606 CRC32 0x7c575de0 Query thread_id=4 exec_time=0 error_code=0use `test`/*!*/;DROP TABLE `t1` /* generated by server */ 删除表的操作/*!*/;# at 594#190308 10:57:48 server id 1 end_log_pos 2671 CRC32 0x7138d3a8 GTID last_committed=14 sequence_number=15 rbr_only=noSET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1858'/*!*/;# at 659#190308 10:57:48 server id 1 end_log_pos 2768 CRC32 0x3a8eeb36 Query thread_id=4 exec_time=0 error_code=0SET TIMESTAMP=1552013868/*!*/;create table t1(id int) 创建表的操作/*!*/;SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;DELIMITER ;# End of log file/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;我们可以看到 GTID 1111:1857 和 GTID 1111:1858 都在relay log里面,但是SQL thread 始终没有执行,为什么重新设置gtid_purged ='1111:1-1856', SLAVE 未应用 11111:1857 的drop 表事务?

Relay_Log_Pos = 594 表示的是SQL thread 执行的位点在relay log 的位置

我们看看594 在relay log 是什么event

# at 594# 10:57:48 server id 1 end_log_pos 2671 CRC32 0x7138d3a8 GTID last_committed=14 sequence_number=15 rbr_only=noSET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1858'/*!*/;# at 659# 10:57:48 server id 1 end_log_pos 2768 CRC32 0x3a8eeb36 Query thread_id=4 exec_time=0 error_code=0SET TIMESTAMP=1552013868/*!*/;create table t1(id int)但是DROP TABLE 的位点是在 414 的位点,而SQL thread 位点却在594,所以始终没办法执行到 414 位点的DROP TABLE 操作,无法修复,那是为啥呢?

因为414 的DROP TABLE 操作的GTID 是 1111:1857 ,被我们人为的设置为从库已经执行,当从库线程发现1111:1857已经执行过,则跳过该事务,继续往下执行回放。当执行到594位点,即GTID为1111:1858。这时候,当我们在t11时刻,设置gtid_purged=1111:1-1856',让从库可以重新执行GTID 为1111:1157 的事务,SQL thread 还是从relay log 的594位点开始执行,无法恢复。那怎么修复呢?

STOP SLAVE;CHANGE MASTER TO RELAY_LOG_FILE='mysql-relay.' ,RELAY_LOG_POS=414 ,MASTER_AUTO_POSITION=0;CHANGE MASTER TO MASTER_AUTO_POSITION=1;START SLAVE;SHOW SLAVE STATUS\G我们再观察下slave的状态

Master_Log_File: mysql-bin. Read_Master_Log_Pos: 2768 Relay_Log_File: mysql-relay.000002 Relay_Log_Pos: 756 Relay_Master_Log_File: mysql-bin. Slave_IO_Running: Yes Slave_SQL_Running: Yes Exec_Master_Log_Pos: 2768 Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1857-1858 Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1858我们看到1111:1157 已经被加入Executed_Gtid_Set。

因为relay_log_recovery为on的状态,老的relay log 会被删除,重新根据gtid_executed 去主库拉取binlog,这时候就会顺利往下执行了。

这次和案例一不一样的地方是本次停止的是整个从库线程,包括IO线程和SQL线程。在启动前IO线程是没有拿到DROP的GTID为1111:1159的事务的,启动SLAVE线程后MySQL会怎么处理呢?

官方文档上面的解释说:

In the initial handshake, the slave sends a GTID set containing the transactions that it has already received, committed, or both. This GTID set is equal to the union of the set of GTIDs in the gtid_executed system variable (@@GLOBAL.gtid_executed), and the set of GTIDs recorded in the Performance Schema replication_connection_status table as received transactions (the result of the statement SELECT RECEIVED_TRANSACTION_SET FROM PERFORMANCE_SCHEMA.replication_connection_status).

也就是说,当从库连上主库的时候,从库会发送一个GTID集合,可能是已经执行的GTID集合,也可能是已经收到的GTID集合,或者是两者。这个GTID是已经执行的GTID集合和已经收到的GTID集合的并集。

UNION(@@global.gtid_executed,Retrieved_gtid_set - last_received_GTID)

当t7 时刻启动slave的时候,已经执行的GTID集合为1111:1-1859,已经收到的GTID集合为1111:1-1858,那么发送给主库的GTID集合为1111:1-1859,那么主库下一个发送的GTID为1111:1860,也就是CREATE TABLE语句,我们去relay log里面验证下

当前slave 的状态

Master_Log_File: mysql-bin. Read_Master_Log_Pos: 3110 Relay_Log_File: mysql-relay.000005 Relay_Log_Pos: 414 Relay_Master_Log_File: mysql-bin. Slave_IO_Running: Yes Slave_SQL_Running: No Exec_Master_Log_Pos: 2948 Last_SQL_Errno: 1050 Last_SQL_Error: Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'create table t1(id int)' Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1859-1860 Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1858我们看下 mysql-relay.000005的内容

$mysqlbinlog mysql-relay./*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;DELIMITER /*!*/;# at 4#190308 12:15:33 server id 102 end_log_pos 123 CRC32 0x1f2e3e22 Start: binlog v 4, server v 5.7.25-28-log created 190308 12:15:33# This Format_description_event appears in a relay log and was generated by the slave thread.# at 123#190308 12:15:33 server id 102 end_log_pos 154 CRC32 0x5422dcce Previous-GTIDs# [empty]# at 154#700101 8:00:00 server id 1 end_log_pos 0 CRC32 0x6677862d Rotate to mysql-bin.000009 pos: 4# at 201#190308 9:30:10 server id 1 end_log_pos 123 CRC32 0x536ddf69 Start: binlog v 4, server v 5.7.25-28-log created 190308 9:30:10# at 320#190308 12:15:33 server id 0 end_log_pos 367 CRC32 0x182ec41f Rotate to mysql-bin.000009 pos: 194# at 367#190308 12:15:33 server id 0 end_log_pos 414 CRC32 0x7c85ab34 Rotate to mysql-bin.000009 pos: 2948# at 414#190308 12:15:27 server id 1 end_log_pos 3013 CRC32 0x4ab790eb GTID last_committed=16 sequence_number=17 rbr_only=noSET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1860'/*!*/;# at 479#190308 12:15:27 server id 1 end_log_pos 3110 CRC32 0xe74d818a Query thread_id=4 exec_time=0 error_code=0use `test`/*!*/;create table t1(id int) 建表的操作/*!*/;# at 576#190308 12:26:37 server id 102 end_log_pos 625 CRC32 0x532b5fe7 Rotate to mysql-relay.000006 pos: 4SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;DELIMITER ;# End of log file/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;我们发现 Relay_Log_Pos: 414 是CREATE TABLE的语句,GTID为1111:1860,在relay log mysql-relay.未发现GTID 为1111:1859 的DROP语句,我们发现后面还要一个relay log为mysql-relay.000006,我们看看里面的内容

$mysqlbinlog mysql-relay./*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;DELIMITER /*!*/;# at 4#190308 12:26:37 server id 102 end_log_pos 123 CRC32 0x4357741b Start: binlog v 4, server v 5.7.25-28-log created 190308 12:26:37# This Format_description_event appears in a relay log and was generated by the slave thread.# at 123#190308 12:26:37 server id 102 end_log_pos 194 CRC32 0xa087d1c5 Previous-GTIDs# 00021800-1111-1111-1111-111111111111:1860# at 194#700101 8:00:00 server id 1 end_log_pos 0 CRC32 0x6677862d Rotate to mysql-bin.000009 pos: 4# at 241#190308 9:30:10 server id 1 end_log_pos 123 CRC32 0x536ddf69 Start: binlog v 4, server v 5.7.25-28-log created 190308 9:30:10# at 360#190308 12:26:37 server id 0 end_log_pos 407 CRC32 0xaea58e04 Rotate to mysql-bin.000009 pos: 194# at 407#190308 12:26:37 server id 0 end_log_pos 454 CRC32 0xb19784ba Rotate to mysql-bin.000009 pos: 2768# at 454#190308 12:15:23 server id 1 end_log_pos 2833 CRC32 0x2279702d GTID last_committed=15 sequence_number=16 rbr_only=noSET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1859'/*!*/;# at 519#190308 12:15:23 server id 1 end_log_pos 2948 CRC32 0xe93f8d9d Query thread_id=4 exec_time=0 error_code=0use `test`/*!*/;DROP TABLE `t1` /* generated by server */ 删表的操作/*!*/;# at 634#190308 12:26:37 server id 0 end_log_pos 681 CRC32 0xd7fd7810 Rotate to mysql-bin.000009 pos: 3110SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;DELIMITER ;# End of log file/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;我们发现GTID为1111:1859的事务在1111:1860的后面,这也是符合预期的,因为在t7时刻启动从库线程的时候,主库需要发给从库的GTID只有1111:1860,因此mysql-relay.000005里面只有1111:1860,当我们在t11时刻执行SET GLOBAL GTID_PURGED='1111:1-1858'的时候,从库发送给主库的GTID集合为 1111:1-1858和 1111:1-1858:1860合集,也就是1111:1-1158:1860,这时候主库需要发送给从库的GTID 为1111:1859。

但是由于1111:1859 在 1111:1860 的后面,因此始终无法执行1111:1859的DROP事务,导致一直报错。

在t12操作之后,直接重启从库,此时,mysql-relay. 和 mysql-relay.000006 都被删除,这时候由于gtid_executed 变量为1111:1-1858,那么从主库重新拉取的GTID为1111:1859-1860,我们从relay log mysql-relay.000007 和 mysql-relay.000008 里面验证一下

由于mysql-relay. 里面没有包含实际的事务信息,我们直接看mysql-relay.000008

$mysqlbinlog mysql-relay./*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;DELIMITER /*!*/;# at 4#190308 13:22:11 server id 102 end_log_pos 123 CRC32 0xd3d8ddd6 Start: binlog v 4, server v 5.7.25-28-log created 190308 13:22:11# This Format_description_event appears in a relay log and was generated by the slave thread.# at 123#190308 13:22:11 server id 102 end_log_pos 154 CRC32 0x5c6327ce Previous-GTIDs# [empty]# at 154#700101 8:00:00 server id 1 end_log_pos 0 CRC32 0x6677862d Rotate to mysql-bin.000009 pos: 4# at 201#190308 9:30:10 server id 1 end_log_pos 123 CRC32 0x536ddf69 Start: binlog v 4, server v 5.7.25-28-log created 190308 9:30:10BINLOG 'osWBXA8BAAAAdwAAAHsAAAAAAAQANS43LjI1LTI4LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXwAEGggAAAAICAgCAAAACgoKKioAEjQAAWnfbVM='/*!*/;# at 320#190308 13:22:11 server id 0 end_log_pos 367 CRC32 0x08b48d1c Rotate to mysql-bin.000009 pos: 194# at 367#190308 13:22:11 server id 0 end_log_pos 414 CRC32 0xc7dd1334 Rotate to mysql-bin.000009 pos: 2768# at 414#190308 12:15:23 server id 1 end_log_pos 2833 CRC32 0x2279702d GTID last_committed=15 sequence_number=16 rbr_only=noSET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1859'/*!*/;# at 479#190308 12:15:23 server id 1 end_log_pos 2948 CRC32 0xe93f8d9d Query thread_id=4 exec_time=0 error_code=0use `test`/*!*/;DROP TABLE `t1` /* generated by server *//*!*/;# at 594#190308 12:15:27 server id 1 end_log_pos 3013 CRC32 0x4ab790eb GTID last_committed=16 sequence_number=17 rbr_only=noSET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1860'/*!*/;# at 659#190308 12:15:27 server id 1 end_log_pos 3110 CRC32 0xe74d818a Query thread_id=4 exec_time=0 error_code=0SET TIMESTAMP=1552018527/*!*/;create table t1(id int)/*!*/;SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;DELIMITER ;# End of log file/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;我们看到mysql-relay. 包含了1111:1859-1860,从库SQL线程也恢复正常状态

Master_Log_File: mysql-bin. Read_Master_Log_Pos: 3110 Relay_Log_File: mysql-relay.000008 Relay_Log_Pos: 756 Relay_Master_Log_File: mysql-bin. Slave_IO_Running: Yes Slave_SQL_Running: Yes Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1859-1860 Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1860结语案例一  说明当我们重置 gtid_purged,但是sql_thread 记录的位点并不会随着gtid_purged 而向后移动,必须手动change master 指定或者重启数据库实例来解决。

案例二  人为操作导致relay log里面的事务顺序发生变化,导致报错,解决方式其实利用了slave crash safe原理,重启实例会删除relay log 重新拉取需要日志并记录到relay log里面。

GTID在上线生产的时候,需要多做测试,明白其底层的原理,并不是所有未执行的GTID事务一定会被执行,也需要注意GTID的事务的执行顺序。