Impact of “tmpdir” change in MySQL replication

Recently we had encountered a strange issue with replication and temp directory(tmpdir) change while working for one major client.

All the servers under this were running with Percona flavor of MySQL versioned 5.6.38 hosted on a Debian 8(Jessie)

The MySQL architecture setup  is as follows one master with 5 direct slaves under it

Screen Shot 2019-02-21 at 3.52.00 PM.png

Through this blog, we will take you through the issue we had faced and how we fixed ultimately.

Client Request:

There was a simple request from our client to add a column and index to a 16GB production table since the table had foreign keys, to avoid complications we decided to go with online DDL instead of pt-online-schema.

When we started to alter, it got failed due to insufficient space in “tmpdir”.MySQL by default would be using “/tmp” for temp table creating, sorting and other temp operation, Since we had only 5.7G left on the “/” in master Alter failed as below.

mysql> alter table payment.transactions add column wallet_aggregate_id bigint(20) default null,add index idx_wallet_aggregate_id(wallet_aggregate_id),algorithm=inplace,lock=none;
ERROR 1878 (HY000): Temporary file write failure.

root@payment-mydbops:/home/waseem.mydbops# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 9.8G 3.6G 5.7G 39% /

mysql> select @@tmpdir;
+----------+
| @@tmpdir |
+----------+
| /tmp |
+----------+
1 row in set (0.00 sec)

Change of  MySQL tmpdir:

We had to explain the situation to the stakeholders. With the approval of client we changed tmpdir from “/tmp” to “/var/tmp/mysql”  partition with a MySQL restart, then the alter was completed successfully in master

root@payment-mydbops:/home/waseem.mydbops# df -h /var/tmp/mysql
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vgmysql-mysqltmp 25G 33M 25G 1% /var/tmp/mysql

mysql> select @@tmpdir;
+----------------+
| @@tmpdir |
+----------------+
| /var/tmp/mysql |
+----------------+
1 row in set (0.00 sec)

mysql> alter table transactions add column aggregate_id bigint(20) default null,add index idx_aggregate_id(aggregate_id),algorithm=inplace,lock=none;
Query OK, 0 rows affected (48 min 1.57 sec)
Records: 0 Duplicates: 0 Warnings: 0

Raise of Replication Issue:

The replication went down in all the 5 slaves connected to master, because of the alters since all the slaves were using “/tmp”, as their tmpdir which had only 9.8G of usable space. below is the snap of replication error

mysql> show slave status\G
*************************** 1. row ***************************
Master_Log_File: 118120-mysql-bin.000237
Read_Master_Log_Pos: 171370534
Relay_Log_File: mysqld-relay-bin.000704
Relay_Log_Pos: 99663098
Relay_Master_Log_File: 118120-mysql-bin.000237
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1878
Last_Error: Error 'Temporary file write failure.' on query. Default database: ''. Query: 'alter table transactions add column aggregate_id bigint(20) default null,add index idx_aggregate_id(aggregate_id),algorithm=inplace,lock=none'
Exec_Master_Log_Pos: 99662928
Relay_Log_Space: 171371048
Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

Since we had faced the same issue with the master it’s just a straight forward to change the tmpdir to “/var/tmp/mysql” and restarted MySQL to fix the replication issue too.

But the actual surprise factor is that the slave is completely out of the replication cluster, slave status is as below.

mysql> show slave status\G
Empty set (0.01 sec)

Below is the error found from the error log

2019-02-20 18:04:29 547 [ERROR] Error in checking /var/lib/mysql/master.info repository info type of FILE.
2019-02-20 18:04:29 547 [ERROR] Error creating master info: Error checking repositories.
2019-02-20 18:04:29 547 [ERROR] Failed to create or recover replication info repository.
2019-02-20 18:04:29 547 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.
2019-02-20 18:04:29 547 [Note] Event Scheduler: Loaded 0 events

Since we had the master.info in the table as below, we had ignored the error at first line initially.

mysql> show global variables like '%repository%';
+---------------------------+---------------------+
| Variable_name | Value |
+---------------------------+---------------------+
| master_info_repository | TABLE |
| relay_log_info_repository | TABLE |
+---------------------------+---------------------+

Since we had the Binlog co-ordinated handly with us we decided to run the change master command again to resume replication, it again failed with below error on server-id

change master to master_host='10.200.xx.xxx',master_user='replication',master_password='dessert',master_log_file='118120-mysql-bin.000237',master_log_pos=99662928;
ERROR 1794 (HY000): Slave is not configured or failed to initialize properly. You must at least set --server-id to enable either a master or a slave. Additional error messages can be found in the MySQL error log. 

This error made us check the config file for server-id, which was untouched.

Again we went back to error log as below

2019-02-20 18:04:29 547 [ERROR] Error in checking /var/lib/mysql/master.info repository info type of FILE.
2019-02-20 18:04:29 547 [ERROR] Error creating master info: Error checking repositories.
2019-02-20 18:04:29 547 [ERROR] Failed to create or recover replication info repository.

Now we had checked the file  “/var/lib/mysql/master.info” even though we had our repo info in the table and found the file to be empty, we decided to remove the file and restart MySQL. But still, the issue was not fixed

Finally, we decided to change the master_info_repository and relay_log_info_repository from “TABLE” to “FILE” format, also removed empty master.info file from datadir and restarted MySQL. Below are the changed done in the slave.

mysql> show global variables like '%repository%';
+---------------------------+-------+
| Variable_name | Value |
+---------------------------+-------+
| master_info_repository | FILE |
| relay_log_info_repository | FILE |
+---------------------------+-------+

root@payments-mydbops:/home/waseem.mydbops# less /etc/mysql/conf.d/mysql.cnf | grep -i repository
master_info_repository = FILE
relay_log_info_repository = FILE

Even we had our “Change master to” query ready for re-configuring, surprisingly there was no room for that since after the above change, the slave resumed without any manual intervention as below

mysql> show slave status\G
*************************** 1. row ***************************
Master_Log_File: 118120-mysql-bin.000237
Read_Master_Log_Pos: 214316053
Relay_Log_File: mysqld-relay-bin.000702
Relay_Log_Pos: 290
Relay_Master_Log_File: 118120-mysql-bin.000237
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 42840
1 row in set (0.00 sec)

Conclusion :

The above issue was quite strange. still, we are trying to reproduce the issue with other versions of MySQL such as 5.6, 5.7 as well as 8.0.

We had resolved the issue, but still, we are trying to understand more about this. Will keep everyone posted

For now, hope this blog would be helpful for someone to solve a similar issue with replication.

Thanks for your time and continued support!!

Image courtesy: Photo by Christian Regg on Unsplash

10 thoughts on “Impact of “tmpdir” change in MySQL replication

  1. I would be very curious to see the content of the mysql.slave_master_info and mysql.slave_relay_log_info table before running CHANGE MASTER. You can change master_info_repository and relay_log_info_repository from FILE to TABLE (and vice versa) when the slave is stopped, and it looks like the tables got somehow corrupted and starting with TABLE is trying to copy FILE from TABLE. But this is just a guess, and I have no idea how a full tmpdir could corrupt this, unless tmpdir used when stopping a slave. I might run some test to know more…

    Like

    1. I can confirm that MySQL 5.6.42 corrupts its mysql.slave_relay_log_info table on not being able to write to its tmpdir while running an ALTER TABLE […] ADD INDEX […], I will create a bug.

      Like

      1. If I add below before stopping MySQL in the How to repeat of Bug#94452…

        $ touch node1/data/master.info repository

        I get this in the error log:

        2019-02-23 09:30:03 12950 [ERROR] Error creating master info: Multiple replication metadata repository instances found with data in them. Unable to decide which is the correct one to choose.
        2019-02-23 09:30:03 12950 [ERROR] Failed to create or recover replication info repository.
        2019-02-23 09:30:03 12950 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.

        So having a master.info file and data in the mysql.slave_master_info table leads to problems. When you deleted that file and went to FILE for you replication repositories, you probably copied the data from mysql.slave_master_info to the file, which explains why replication resumed (MySQL probably thought at this point that you wanted to move from TABLE to FILE, and it “did the right thing”).

        I still do not understand how deleting the file did not fix things for you (it did for me), so I am wondering if you were using relay_log_recovery (default to OFF).

        Like

  2. Hi Jean-François Gagné,

    Thanks for the detailed analysis,

    When we checked the table mysql.slave_master_info and mysql.slave_relay_log_info, by query we got a empty result set.

    We checked the master.info file, which was created under root:root permission, we tried by changing the permission to mysql:mysql, it failed and also we tried by removing the master.info file, it didnot work, hence we decided to change from TABLE to FILE.

    And also all our slaves are running with relay_log_recovery=ON.

    Thanks

    Like

    1. > When we checked the table mysql.slave_master_info and mysql.slave_relay_log_info, by query we got a empty result set.

      Super weird ! If the data was not in those tables, and not in the files, I do not know where it is coming from: for your slave to magically start after switching to file, it needs to come from somewhere.

      > master.info file was created under root:root permission

      Super interesting ! I guess MySQL is running under mysql, so it is not MySQL that created this file.

      Like

  3. Unrelated note (because you are on 5.6 and not on 5.7)

    For 5.7, MySQL has better solution with “innodb_tmpdir” session variable. You can choose another tmpdir for the alter.

    Liked by 3 people

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s