PXC(Percona XtraDB Cluster), aka Galera cluster, is one of my favorite clustering technology for MySQL that comes with real Multi-master capabilities, Write-set based nearly real-time replication and many more features.
I could proudly say that Mydbops is one of the very few companies who have been supporting mission-critical clusters and preaching PXC ie., Galera cluster for the last 4+ years. Here are few of our talks, presentations, and blogs on PXC
- Percona Xtradb cluster (Presentation)
- Handling long duration SST(timeout) in PXC with systemd
- New GTID functions in Galera 4
- Percona Cluster/MariaDB Cluster (Galera) Read-write Split
- Fulfilled Tablespace Encryption (TDE) in Percona Cluster
- Maxscale for PXC
In this blog am going to share my recent experience in troubleshooting the Percona XtarDB Cluster(Galera).
We had a request from the client to do a rolling shutdown of cluster for a hardware maintenance. This is a large and busy cluster of size 4 with a dataset of size 3.3TB running on version PXC-5.6.46, and ProxySQL being used as a balancer for handling more than 60K QPS running in single-writer mode.
As requested we had brought down MySQL cleanly, and then the node was taken for maintenance., which lasted for ~1:30 hrs ( 90 Min )
2020-04-14 04:37:45 3300 [Note] Shutting down plugin 'binlog' 2020-04-14 04:37:45 3300 [Note] /usr/sbin/mysqld: Shutdown complete
Having an IST(Incremental State Transfer) after maintenance is an easy and fastest way to join back the node to cluster rather than SST (State Snapshot Transfer) which is a full-data copy.
In order to have IST, we have calculated and set Gcache aka ring-buffer file to 16GB to hold write set for at least 3hrs
When the nodes come back online, as expected it was going for an IST as below
2020-04-14 05:05:04 4506 [Note] WSREP: IST receiver addr using tcp://184.108.40.206:4568 2020-04-14 05:05:04 4506 [Note] WSREP: GCache history reset: eaad8819-e6a3-11e6-84d2-3e87f6331c69:77346382323 -> eaad8819-e6a3-11e6-84d2-3e87f6331c69:77347108222 2020-04-14 05:07:21 4506 [Note] WSREP: SST received: eaad8819-e6a3-11e6-84d2-3e87f6331c69:77346382323 2020-04-14 05:07:21 4506 [Note] WSREP: Receiving IST: 725899 writesets, seqnos 77346382323-77347108222 2020-04-14 05:07:21 4506 [Note] WSREP: Receiving IST... 0.0% ( 0/725899 events) complete.
As you can see the local state as 77346382323 ie., state of this node and the group state as 77347108222 ie., current cluster state. The missing write sets “725899” has to be received via IST. The node should have joined the cluster automatically post receiving those writeset’s, but it doesn’t.
While checking on the error log we got the below note stating the node has been rejected to join the cluster stating there is a change in the group state comparing the local state of the node. This might happen on edge-case scenarios due to high concurrency of writes.
2020-04-14 05:08:42 4506 [Note] WSREP: State transfer required: Group state: eaad8819-e6a3-11e6-84d2-3e87f6331c69:77347116858 Local state: eaad8819-e6a3-11e6-84d2-3e87f6331c69:77347109141 2020-04-14 05:08:42 4506 [Note] WSREP: New cluster view: global state: eaad8819-e6a3-11e6-84d2-3e87f6331c69:77347116858, view# 562: Primary, number of nodes: 4, my index: 1, protocol version 3 2020-04-14 05:08:56 4311 [Warning] WSREP: Rejecting JOIN message from 0.0 (node3): new State Transfer required. 2020-04-14 05:08:56 4506 [Note] WSREP: Setting wsrep_ready to false
“Don’t ever attempt to restart/start MySQL” at this point, since it will go for a time-consuming SST process to sync the data.
Reason for SST Trigger:
Since the “GRASTATE.dat” file has been reset ( shown below) after the failure with initial IST the cluster will trigger an SST
# GALERA saved state version: 2.1 uuid: 00000-000000000000-00000000 seqno: -1 safe_to_bootstrap: 0
What is grastate.dat file?
This file is located under the data-dir of each node in the cluster, maintains cluster group UUID, and local state seq no. From Galera 3.19 additional protection for bootstrap “safe_to_bootstrap” flag has been added.
Grastate.dat file also helps in identifying the most advanced node in the cluster for bootstrap after a complete outage w.r.t all the nodes in the cluster. (Ex) in case of a power outage
How to join this node with IST?
Again as you can see the local state seq no as 77347109141 ie the node is consistent till this point, and the group state has an advanced seq no as 77347116858, ie the cluster has moved forward with the write sets.
Local State : 77347109141
Group State : 77347116858
Now we can proceed to induce an IST transfer by manually updating local state seq.no in “grastate.dat”.
Now let’s proceed to update the cluster UUID(if overwritten) and Seq no to 77347109141, and start the cluster. The node will now proceed to read the seq no from file and proceed to get the IST as below,
I have edited the grastate.dat file as below
# GALERA saved state version: 2.1 uuid: ea5d8819-e8a3-11e6-84d2-3e85f6331c89 seqno: 77347109141 safe_to_bootstrap: 0
now proceed to start the cluster node
#service mysql start
What happens during the Cluster start ?
Step 1: Grastet.dat file has been read
When the node joins the cluster, it reads its state from the “grastate.dat” file
2020-04-14 05:21:00 12030 [Note] WSREP: Found saved state: ea5d8819-e8a3-11e6-84d2-3e85f6331c89:77347109141, safe_to_bootstrap: 0 2020-04-14 05:21:00 12030 [Note] WSREP: Skipped GCache ring buffer recovery: could not determine history UUID.
Step2: Assign and update the UUID and Seq number
2020-04-14 05:21:00 12030 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> ea5d8819-e8a3-11e6-84d2-3e85f6331c89:77347109141 2020-04-14 05:21:00 12030 [Note] WSREP: Assign initial position for certification: 77347109141, protocol version: -1
Step3: Calculate the write-set to be applied
Calculates the write set difference with the local state and current group state, and receives it from the Gcache via IST.
2020-04-14 05:21:08 12030 [Note] WSREP: SST received: ea5d8819-e8a3-11e6-84d2-3e85f6331c89:77347109141 2020-04-14 05:21:08 12030 [Note] WSREP: Receiving IST: 452991 writesets, seqnos 77347109141-77347562132 2020-04-14 05:21:08 12030 [Note] WSREP: Receiving IST... 0.0% ( 0/452991 events) complete.
Step4: Apply of IST and node joining
Apply the received write sets in the Order of transaction .
2020-04-14 05:21:08 12030 [Note] /usr/sbin/mysqld: ready for connections. Version: '5.6.46-86.2-56-log' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), Release rel86.2, Revision 5c5f58d, WSREP version 28.38, wsrep_28.38 2020-04-14 05:21:19 12030 [Note] WSREP: Receiving IST... 19.7% ( 89440/452991 events) complete. 2020-04-14 05:21:29 12030 [Note] WSREP: Receiving IST... 36.7% (166064/452991 events) complete. 2020-04-14 05:21:39 12030 [Note] WSREP: Receiving IST... 60.8% (275392/452991 events) complete. 2020-04-14 05:21:49 12030 [Note] WSREP: Receiving IST... 83.2% (376688/452991 events) complete. 2020-04-14 05:21:57 12030 [Note] WSREP: Receiving IST...100.0% (452991/452991 events) complete. 2020-04-14 05:21:57 12030 [Note] WSREP: IST received: ea5d8819-e8a3-11e6-84d2-3e85f6331c89:77347562132 2020-04-14 05:21:57 12030 [Note] WSREP: 2.0 (node1): State transfer from 3.0 (node2) complete. 2020-04-14 05:21:57 12030 [Note] WSREP: Shifting JOINER -> JOINED (TO: 77347588293) 2020-04-14 05:21:58 12030 [Note] WSREP: Member 2.0 (node1) synced with group. 2020-04-14 05:21:58 12030 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 77347588752) 2020-04-14 05:21:59 12030 [Note] WSREP: Synchronized with group, ready for connections
The node has joined back to cluster as expected by re-initiating IST and the node is now ready for connections.It saved my day from a full SST ( rebuild ). I hope this blog will provide solutions to many who face a similar issue.