Troubleshooting an OLAP system on InnoDB

As a part of Mydbops Consulting we have a below problem statement from one of our client.

We have a high powered server for reporting which in turn powers our internal dashboard for viewing the logistics status.Even with a high end hardware, we had a heavy CPU usage and which in turn triggers spikes in replication lag and slowness. Below is the hardware configuration.

OS : Debian 9 (Stretch)
CPU : 40
RAM : 220G (Usable)
Disk : 3T SSD with 80K sustained IOPS.
MySQL : 5.6.43-84.3-log Percona Server (GPL)
Datasize : 2.2TB

Below is the graph on CPU utilisation from Grafana.

Since the work load is purely reporting(OLAP) we could observe a similar type of queries with different ranges. Below is the Execution plan of the query. It is a join query over 6 tables.

Explain Plan:

+----+-------------+-------+--------+--------------------------------------------------------------------------------+---------------+---------+---------------------------------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------------------------------------------------------------+---------------+---------+---------------------------------+------+------------------------------------+
| 1 | SIMPLE | cf | ref | PRIMARY,name_x | name_x | 103 | const | 1 | Using where; Using index |
| 1 | SIMPLE | scf | ref | sip_idx,active_idx,flag_idx | flag_idx | 8 | logis.cf.flagId | 5820 | Using where |
| 1 | SIMPLE | sre | ref | staId,sre_idx1,shelfId_hubId,updateDate_statusId_statId_idx,statId_statusId_x | sre_idx1 | 18 | logis.scf.sipId | 1 | Using index condition; Using where |
| 1 | SIMPLE | scam | eq_ref | mappingId | mappingId | 8 | logis.scf.mapId | 1 | NULL |
| 1 | SIMPLE | ssdm | ref | sipIdIdx,SDetailIdIdx | shipmentIdIdx | 17 | logis.sre.sipId | 1 | NULL |
| 1 | SIMPLE | sd | eq_ref | PRIMARY,mrchIdIdx | PRIMARY | 8 | logis.ssdm.SDId | 1 | Using where |
+----+-------------+-------------------------------------------------

With the initial screening it looks normal as it’s a reporting (OLAP) query and its bound to run longer and this has started to bit our system resources (CPU) and the replication lag cause stale or obsolete data in internal dashboard.

As the execution plan depicts the query is using index and the columns used are being perfectly indexed. Only with table “SCF” we could see a scan of “5820 “ , The index used here has a less cardinality.

Now we should tweak the optimizer to choose the right index. Optimizer chooses the index based on the stats collected for the table and stored under mysql.innodb_table_stats and mysql.innodb_index_stats. .

The default value of innodb_stats_persistent_sample_pages and innodb_stats_transient_sample_pages are 20 and 8 respectively, which is too low for our dataset, This works wells for smaller tables, but in our case tables are in few 100’s of GB. We increased the values below globally since its a dynamic variable by 10X approx.

mysql> set global innodb_stats_persistent_sample_pages=200;
Query OK, 0 rows affected (0.00 sec)

mysql> set global innodb_stats_transient_sample_pages=100;
Query OK, 0 rows affected (0.00 sec)

Below is the definition from the official documentation on these variables,

The number of index pages to sample when estimating cardinality and other statistics for an indexed column, such as those calculated by ANALYZE TABLE. Increasing the value improves the accuracy of index statistics, which can improve the query execution plan, at the expense of increased I/O during the execution of ANALYZE TABLE for an InnoDB table.

Now we will have to force the index stats recalculation by running a “Analyze table table_name” on all the table involved in the query or else you can make variables persistent and invoke a DB restart to calculate stats for all the tables , we chose the first method since its less harming.

Let us review the execution plan now, we could see a reduced row scans and better index usage with the optimiser as below:

+----+-------------+-------+--------+--------------------------------------------------------------------------------+---------------+---------+---------------------------------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------------------------------------------------------------+---------------+---------+---------------------------------+------+------------------------------------+
| 1 | SIMPLE | cf | ref | PRIMARY,name_x | name_x | 103 | const | 1 | Using where; Using index |
| 1 | SIMPLE | sre | ref | staId,sre_idx1,shelfId_hubId,updateDate_statusId_statId_idx,statId_statusId_x | shelfId_hubId | 9 | const | 2936 | Using where |
| 1 | SIMPLE | scf | ref | sip_idx,active_idx,flag_idx | sip_idx | 22 | logis.sre.shipmentId | 1 | Using index condition; Using where |
| 1 | SIMPLE | scam | eq_ref | mappingId | mappingId | 8 | logis.scf.mappingId | 1 | NULL |
| 1 | SIMPLE | ssdm | ref | shipmentIdIdx,sellerDetailIdIdx | sipIdIdx | 17 | logis.sre.shipmentId | 1 | NULL |
| 1 | SIMPLE | sd | eq_ref | PRIMARY,merchantIdIdx | PRIMARY | 8 | logis.ssdm.sellerDetailId | 1 | Using where |
+----+-------------+-------+--------+--------------------------------------------------------------------------------+---------------+---------+---------------------------------+------+------------------------------------+

Note the new index plan is applicable for the new incoming queries. Within a short span of time the CPU usage has dropped down drastically there is huge boost in performance, please find the graph below. Now the dashboards are faster than ever before.

Key Takeaways:

Below are points to note while setting this variable.

  • Too high value can result in longer time for stats calculation .
  • Too low can have inaccurate stats and leads to a situation discussed above
  • As the table grows, InnoDB allegedly re-ANALYZEs ie., re-calculates stats after 10% growth, so no manual action needed.
  • Most tables have decent stats from sampling 20 pages (Smaller tables)
  • Tables with uneven distribution won’t benefit from changing the ’20’, to tackle that we have “Histograms” from MySQL 8.0

One thought on “Troubleshooting an OLAP system on InnoDB

  1. The big secret to good olap performance on mysql are materialized views and query parallelism. Check out Flexviews for the former and shard-query for the latter.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s