“The MySQL to be explored ” this is what i felt when i worked with it, and just going to share my experience with Aurora .
Recently we had an call for MySQL Consulting from one of the famous fantasy gaming site in India, stating that “Database (MySQL) is our bottleneck”, they want it to be stabilised before the major league match of this year (2017).
As a fast growing startup they have hosted their Infra on cloud (AWS). On further discussion it was revealed that they were using AWS AURORA MySQL for their production DB. Next question from our consulting team was
- Synchronous data across nodes within a lag of 100ms ( mostly ).
- Auto Scaling and Performance .
- High Availability of database.
- It supports 15 read-replicas to the max.
Most of the above makes sense on their choice with Aurora MySQL. Because it reduce the their operations and easy management as a startup it reduces your burden.
We had their Data team and Mydbops consultants sat together and discussed on their DB architecture and performance issues.
Actually they are on the highest band of Aurora family (db.r3.8xlarge) with 244 GB RAM and 32 VCPU’s and disk being auto extended with Aurora and also they are running with Aurora cluster of 1 master and 3 read-replica and they too have Redis as the caching layer.
Though there are on the highest possible hardware ( db.r3.8xlarge ). They face the below issues.
- Slowness at High Concurrency
- Occasional Stalls and DB Reboot.
- Increased Response Time
- IO Spike at 3K QPS.
- Better Archival Policy
DB Size : 300GB ( Ofcourse InnoDB only )
Monitoring & Graphing of Aurora Cluster:
The first thing i did on my day first of consulting is setting up a monitoring and performance graphing solution which could help in the proactive monitoring and also in capacity planning of hardware. I would recommend simple NagiOS for alerting and PMM ( Percona Monitoring & Management ) for performance graphing and analysis of MySQL. PMM does a pretty good job on AWS with Cloud watch.
Cloudwatch and Aurora Metrics also provides good insights on performance stats.I have also customised a few NagiOS plugin for Aurora monitoring.
Performance Benefits of Aurora:
I have analysed the server and suggested a few of major variable level changes and also partitioned their hot tables . Then I made changes to queries ( mostly at procedure level ) to ensure usage of partition pruning , indexed a few and proposed some schema changes too. These settings were extensively load tested and taken into the production.We are able to achieve a QPS of 196K with their custom load test with their application queries. Which might not be feasible on other MySQL cloud solutions. The DB was able to handle concurrency of 60K queries per second (QPS) which is a huge performance gain from their previous record of 3K QPS and a lesser resource consumption without any major changes to the apps. It is a huge win for our consulting. Below is the DB performance snippet on a Peak production hour (on Writer) after implementing the proposed changes.
Database size is now 750 GB ( and Growing ). It is scaling good.
Tool sets in Aurora:
Aurora has more customised and rich tool sets that makes things easier.
- Aurora Metrics through AWS console.
- Simulating the Failover Scenarios.
- Automated Backups and Incremental backups.
- Aurora status variables
- Cloud watch stats for Aurora
Load Balancing in AURORA:
Aurora provides two end-points as writer-end-point and reader-end-point, by connecting the wiriter-end-point ie) MasterDB we can have both reads/writes. Aurora doesn’t support read-write splitting. Please find the below snippet showing the writer and reader end-points
So for the efficient usage of Aurora , the reads and writes should be segregated at application level.
This read-write splitting of the Aurora cluster is provided by MariaDB Maxscale load balancer from version 2.1, Maxscale 2.1 ( Beta ) also provides module for efficient monitor of the Aurora cluster. Maxscale under BSL it might support only 3 instance ( 1 Writer + 2 Reader ) without licensing fee.
A few Drawbacks noted:
Threads Sleeps more than ‘wait_timeout’:
The Sleeps query inside the Aurora sleeps more than the default wait_timeout of 28800 (8 hours), we have seen queries inside the aurora sleeping more than 95K secs with Aurora version 1.11 this has been recently fixed with the Aurora Version 1.12 , and moreover the value of “wait_timeout” is unconfigurable, hence it makes its more painful hence the threads will be holding more memory for high concurrency system like ours.
Memory Issues and Buffer Pool:
The first thing the we have seen was memory allocation for the variables, the innodb_buffer_pool_size was just to 118G out of 244GB of RAM which is less than 50% of the available RAM, when we inquired about we got the reply as this is the suggestion from Amazon team with regards to freeable memory going down at high concurrency and ultimately which leads to the reboot of the cluster when the freeable memory goes beyond 5%.
We had then suggested to keep it at 70% and they finally agreed to set it to 65% ie..,153G of the available RAM. Which ultimately had a great performance gain
But still the memory is not released back to the OS, ie., Freeable memory goes down gradually and never released back, it made occasional cluster restart during OFF productions hours to gain back memory.
As you can see above after the restart the memory gained around ~45 G and it’s getting reduced gradually with time.
Disabling Query Cache :
Query cache has been heavily modified by the Aurora team, which is a great performance benefit for reads. In our case the query cache was configured to use 10G of the RAM, since we had lot of procedure calls we decided to disable query_cache and it was disabled by editing the cluster parameter group in AWS console. But the problem whenever we reboot the cluster the query_cache_size gets 10G allocated inspite we have the query_cache_size set as 0 in the parameter group. Then we have to apply it again in the parameter group to revert back the size to zero, This was with the Aurora version 1.11 and fixed in 1.12 Aurora. People on Aurora 1.11 will still face this issue.
Replication Lag and Reader Reboot:
Aurora states that replication lag is normally under 100ms for the underlying replica’s, that is very great which is almost synchronous with the master, this is achieved by shipping the transactional log, that is one of the greatest feature of Aurora that we should look for. The replica sits in the shared storage same as the master hence shipping of the log is very faster and achieves this minimal replication lag. When we are trying to create partition for existing bigger table using pt-osc .We had faced read replica reboot during the drop and swap of table (DDL) using pt-osc, on checking we found the replication lag increased beyond 50secs and ultimately, replica nodes were rebooted, to apply the data changes ( Transactional logs ).
DDL’s are not online completely in Aurora. A simple empty partition could also trigger a replication lag and reader reboot. When the reader reboot happens the master switch is also triggered and it can cause a downtime of 60 sec to 120 sec. It happens in our case with addition of empty partition. So it is better to schedule any DDL changes during OFF hours to avoid these surprises.
Tuning of IO variables:
Aurora comes with a standard SSD hard disk, which provides a higher IOPS, but the Innodb_io_capacity is set to default which is 200 and the innodb_io_capacity_max is set to 2000 which are default values set in accordance with the magnetic disk, and these values are not modifiable with Aurora
mysql> show global variables like 'innodb_io_capa%'; +------------------------+-------+ | Variable_name | Value | +------------------------+-------+ | innodb_io_capacity | 200 | | innodb_io_capacity_max | 2000 | +------------------------+-------+
And also the Innodb read and write IO threads is also not modifiable with aurora.
While checking from the mysql cmd line , it’s found the variable set as below
mysql> show global variables like 'innodb_write_io_threads'; +-------------------------+-------+ | Variable_name | Value | +-------------------------+-------+ | innodb_write_io_threads | 4 | +-------------------------+-------+ 1 row in set (0.00 sec) mysql> show global variables like 'innodb_read_io_threads'; +------------------------+-------+ | Variable_name | Value | +------------------------+-------+ | innodb_read_io_threads | 64 | +------------------------+-------+
The read IO thread is set higher, this is how they get 5x speed on reads i believe which is not totally true.
More on InnoDB IO Capacity : Sevta in Percona Blog.
The aurora is built on mysql 5.6.10 which was released 3.5 years ago, but the internal aurora version is heavily patched and bug fixed. If you just check the version it will be shown as
mysql> select version(); +------------+ | version() | +------------+ | 5.6.10-log | +------------+
The Aurora version will be shown as below
mysql> show global variables like 'aurora_version'; +----------------+-------+ | Variable_name | Value | +----------------+-------+ | aurora_version | 1.12 | +----------------+-------+
This recent version of aurora (1.12) which is released on 5th April 17, with bug fixes and it support faster DDL. The newer version 1.12 came up with changes on Data dictionary that affected the optimiser plan of most of queries and then a new patch was released by Aurora Team shortly, it was quick response from AWS Support.
Support for MySQL 5.7 ?
It has been quite a long time since MySQL 5.7 was released with enhanced performance features in Optimiser, with improvement in multi-threaded replication ( Aurora synchronisation is different ) , JSON Datatype support, Geo data type, Virtual Columns, and many more enhanced features. Its still a big topic of debate when these features will be supported in Aurora MySQL
Aurora is a good piece of Engineering for MySQL on Cloud ( Best Solution Available ) and it is so easy to start it. Ofcourse it has a few set backs but it has to overcome these in the fast evolving MySQL ecosystem. It is a MySQL to be explored in deep by DBA’s.
Safe Harbour : The views expressed are my own and not that of my employer.