MySQL Group Replication and its Memory consumption (troubleshooting).

This blog is about one of the issues encountered by our Remote DBA Team in one of the production servers. We have a setup of MySQL 5.7 Single Primary (Writer) GR with cluster size of  3 . Due to OOM, the MySQL process in the primary node got killed, this repeated over the course of the time.

We all know about the OOM (out of memory), theoretically, it is a process that the Linux kernel employs when the system is critically low on memory.

In a dedicated DB server, when the OOM triggers the direct impact will be on mysqld process since it will be the most memory consuming one.

Going forward will look into the detailed analysis made to tackle the issue of OOM.

DB Environment:-

  • Service – Group Replication Cluster
  • Cluster Nodes – 3
  • GR mode – Single Primary
  • MySQL Version – 5.7.24
  • CPU Core – 4
  • Memory – 11G
  • innodb_buffer_pool_size – 8G (72% of RAM)

Analysis:-

We had faced this issue only in the GR Primary node and not in any of the Secondary nodes (ProxySQL takes care of load balancing). The memory allocation is same across the cluster. So we had started to dig deep on memory usage in the primary node. I am splitting this memory issue analysis into 3 levels.

  1. MySQL Engine(Innodb) Memory Utilisation.
  2. MySQL Global Memory Utilisation.
  3. MySQL Group Replication Memory Utilisation.

1).MySQL Engine(Innodb) Memory Utilisation:-

Group Replication supports only the InnoDB engine. So in the case of InnoDB transaction-based engine, The major buffer is innodb_buffer_pool_size. As a rule of thumb, we can set it to 70-80% of the system memory. Depending on the server configuration, So, here we have set it around 70% of RAM to the InnoDB Buffer Pool. ( every rule has exceptions ).

Below I have shared the memory usage by InnoDB at Linux Kernel.

$ sudo pmap -x $(pidof mysqld) | grep -i anon
.
.
.
00007f9d7dd3a000 8543680 6168996 6129452 rw--- [ anon ]
.
.
mydbops@localhost:performance_schema> select (8543680/1024/1024) as 'innodb memory utilisation in GB';
+---------------------------------+
| innodb memory utilization in GB |
+---------------------------------+
| 8.14788818                      |
+---------------------------------+

2). Global Memory Utilisation:-

The memory utilization by all the global events in MySQL can be checked using the performance_schema. This includes the memory usage by the below MySQL events.

  • Events for engine based memory usage ( InnoDB, MyISAM, memory, blackhole, temptable and others. )
  • Events for the SQL Layer (memory/SQL)
  • Events for the client-server communication memory usage (memory/client)

To get these events output, the respective memory instrumentation in performance_schema. setup_instruments table were enabled.

update performance_schema.setup_instruments set ENABLED = 'YES' where NAME like 'memory/%';

After enabling the instruments, we can fetch the memory usage at all the global events. Observing the stats it is found that they events took only 172MB on an average.

mydbops@localhost:performance_schema> select (sum(current_number_of_bytes_used))/1024/1024 'Global memory usage in MB' from memory_summary_global_by_event_name;
+---------------------------+
| Global memory usage in MB |
+---------------------------+
| 172.50711823              |
+---------------------------+

In combination, MySQL engine (Innodb) [ 8.14GB] and global MySQL events memory usage[172MB] total usage is clocked around 8.3GB at MySQL level.

 But when we observed the Linux Kernel the total memory usage of MySQL process is around 10GB.

$ sudo pmap -x $(pidof mysqld) | tail -n 1
total kB 10395760

So, now the question is,

  • Where the remaining 1.7GB is utilized ?
  • Which process inside MySQL is consuming this memory?
  • Do we have events in Performance Schema to measure this memory leak?

Here is the answer, The remaining 1.7GB is used by the Group Replication Cache. Let’s see how.

3). Group Replication Memory Utilisation:-

Commonly in Group Replication, the primary node will capture the recent events in the memory cache. Using this cache other nodes will get the state transfer. In MySQL version 5.7, there is no variable to control this memory cache utilization, it grows exponentially and also it doesn’t have any events to calculate the memory utilization,

Because of this limitation, we have faced the OOM in MySQL 5.7 GR.

To overcome this issue, the XCOM Cache Management has been introduced from MySQL 8.0.16. From this version, where we can control the cache used by Group Replication using the variable group_replication_message_cache_size. And also MySQL has introduced one more event with the name of ‘GCS_XCom::xcom_cache‘. By fetching the details from this event in the performance_schema.memory_summary_global_by_event_name table, we can get the memory consumption by the Group Replication Cache.

Query:-

mysql>SELECT EVENT_NAME,CURRENT_NUMBER_OF_BYTES_USED FROM performance_schema.memory_summary_global_by_event_name WHERE EVENT_NAME LIKE 'memory/group_rpl/GCS_XCom::xcom_cache';
+---------------------------------------+------------------------------+
| EVENT_NAME                            | CURRENT_NUMBER_OF_BYTES_USED |
+---------------------------------------+------------------------------+
| memory/group_rpl/GCS_XCom::xcom_cache | 1071827570                   |
+---------------------------------------+------------------------------+

Conclusion:-

  • Memory Leak in GR 5.7 was due to the Group replication message cache.
  • To avoid this issue,
    • Permanent Fix – Upgrade to MySQL 8.0.16 and get the benefits from the XCOM Cache Management.
    • Temporary Fix – Reduce the Global Buffer Innodb_buffer_pool_size.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s