Recently i worked on a production issue for one of our client under support .They have a architecture of a three node Galera cluster with one asynchronous slave .
- Node1 – 22.214.171.124
- Node2 – 126.96.36.199
- Node3 – 188.8.131.52
- Replica – 184.108.40.206
The slave(replica) was configured with node3 as replica master. Unfortunately the node 3 was crashed with an OOM killer ,also server has a low gcache size, so when i am trying to start the node 3 , it went to SST . Here the data size was around 2.6 TB , in general for completion of whole SST and joining the node back to cluster will take around approximately 12 hours.
As i told earlier, the replication slave was under node3 and all reporting applications were pointed to async slave only .So, I can’t wait upto 12 hours as it will affect my entire client reporting environment.
To overcome this scenario, i had planned to switch my async slave under node 2 ( 220.127.116.11 ) . By this blog post, i am going to explain the steps how i was able to achieve this .
Stop the slave server for getting persistent log file and position in async slave.
mysql> STOP SLAVE SQL_THREAD; Query OK, 0 rows affected (0.11 sec)
From the command SHOW SLAVE STATUS\G get the Relay_Log_File and Read_Master_Log_Pos.
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Reconnecting after a failed master event read Master_Host: 18.104.22.168 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.003099 Read_Master_Log_Pos: 677232126 Relay_Log_File: replica-relay-bin.009093 Relay_Log_Pos: 677232323 Relay_Master_Log_File: mysql-bin.003099
Here Relay_Log_File is replica-relay-bin.009093 and Read_Master_Log_Pos is 677232126
Decode the respective Relay_Log_File and get the Xid value using the position of Read_Master_Log_Pos .
mysqlbinlog --no-defaults --base64-output=decode-rows -vv /data/mysql/replica-relay-bin.009093 > /home/mydbops/replica-relay-bin.009093.txt [mydbops@replica ~]$ less replica-relay-bin.009093.txt | grep "677232126"#181202 3:27:13 server id 12 end_log_pos 677232126 CRC32 0xc818fec0 Xid = 572228464
Extended view of replica-relay-bin.009093
# at 677232001 #181202 3:27:13 server id 12 end_log_pos 677231872 CRC32 0x19bffe41 Query thread_id=7239194 exec_time=1 error_code=0 SET TIMESTAMP=1543701433/*!*/; SET @@session.foreign_key_checks=1/*!*/; /*!\C utf8 *//*!*/; SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/; BEGIN /*!*/; # at 677232069 #181202 3:27:13 server id 12 end_log_pos 677231983 CRC32 0x8c9b2a51 Rows_query # insert into mydbops.wsrep_paused(tag,counter) values ( NAME_CONST('node',2),@new_value) # at 677232180 #181202 3:27:13 server id 12 end_log_pos 677232046 CRC32 0xa0e15b87 Table_map: `mydbops`.`wsrep_paused` mapped to number 297 # at 677232243 #181202 3:27:13 server id 12 end_log_pos 677232095 CRC32 0xe5c9c37e Write_rows: table id 297 flags: STMT_END_F ### INSERT INTO `mydbops`.`wsrep_paused` ### SET ### @1=1517733 /* INT meta=0 nullable=0 is_null=0 */ ### @2=2 /* TINYINT meta=0 nullable=0 is_null=0 */ ### @3=7555.01 /* FLOAT meta=4 nullable=1 is_null=0 */ ### @4=1543701433 /* TIMESTAMP(0) meta=0 nullable=1 is_null=0 */ # at 677232292 #181202 3:27:13 server id 12 end_log_pos 677232126 CRC32 0xc818fec0 Xid=572228464 COMMIT/*!*/; SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/; DELIMITER ;
Using Xid number get end_log_pos from node 2 to which we are going configure replication.
[mydbops@db02 ~]$ sudo mysqlbinlog /data/mysql/mysql-bin.002279 --base64-output=DECODE-ROWS --verbose | grep "Xid = 572228464"#181202 3:27:13 server id 12 end_log_pos 815896364 CRC32 0x519db5da Xid = 572228464
from Xid ‘572228464’ we got the binlog position ‘815896364‘ in mysql-bin.002279
Pointing the replication to node 2 by running change master in slave server.
mysql> reset slave; Query OK, 0 rows affected (0.11 sec) mysql> reset slave all; Query OK, 0 rows affected (0.01 sec) mysql> change master to master_host='172.20.4.12',master_port=3306,master_log_file='mysql-bin.002279',master_log_pos=815896364; Query OK, 0 rows affected (0.02 sec) mysql> start slave; Query OK, 0 rows affected (0.00 sec) mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Queueing master event to the relay log Master_Host: 172.20.4.12 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.002280 Read_Master_Log_Pos: 100575837 Relay_Log_File: replica-relay-bin.000002 Relay_Log_Pos: 317 Relay_Master_Log_File: mysql-bin.002279 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB:
After switch over current master for replica is node 2 (172.20.4.12):
- From the four simple steps i was able to get back the slave and make it production ready, which reduces the impact of downtime of a switch with out GTID.