Aurora has a salient feature “Parallel query“, Which will be more beneficial for analytical workload environments.
Before going to deep dive on this particular feature, let us understand the basis of Aurora.
- The key feature of Aurora is simple data synchronisation among the nodes. The sync latency will be too low when compared to RDS because the synchronisation is happening on storage volumes among the nodes. Also all the server will available in different zone, even when a zone goes down we can able to maintain will other server present in other zone with auto failure.
- Auto healing volume, Each volume has it’s own backup volume. So if the volume got corrupted, the instance will up with the backup volume.
- The writer node will support writes and reads, But the reader node supports only the reads
- We can have up to 15 reader node, where as in RDS we can have only 5 replica server.
- Automated failure , when a writer node goes down for any reason, the reader node will automatically promote as a writer node. And also we can set the priority for the reader node to become a writer, if failover occurs.
- Automatic storage sccaling, the storage will automatically get increase and will reduce back once the need if done.
- Auto-scaling of reader node, when there is a load has been increased in server, the cluster will automatically create a new reader and will add to the cluster until the load reduce back to normal. Post that , newly created instance will be auto-terminated. We are handle this by custom endpoint in aurora as well, the newly created instance will be auto joined to the cluster endpoint for query routing.
- So far we have seen what is the main feature of aurora, now let’s how it differ from native MySQL.
Prominent Aurora only feature
- In native MySQL, hash join is enabled in MySQL 8.0.18, But in Aurora has the hash join feature is available in MySQL 5.7 .
- Parallel query is not available with native MySQL.
- Storage autoscaling and reader auto scaling is not available in native MySQL.
- In the same server it won’t split the query. Instead of that, it will split the query among the cluster server and it will organise the data in the same server and will give the output.
- When sharing the data among the network, it won’t send whole table instead of that it will send only the required value (based on the query condition) among the among the network.
Use cases of Parallel Query
- It works better when queries have RANGE or IN or EQUAL (=) conditions.
- It is more optimised for analytical query, since it split the query among the nodes and then merge the result set.
- Due the splitting of query among the nodes, the resource utilisation is uniform among the nodes and it avoids heavy utilisation on a single node.
To enable this parallel query feature, we need to satisfy some needs.
- Aurora version should be greater than 2.09 or 1.23.
- Instance type should be R-series, other instance types are not supported.
- The tables should be non-partitioned, If any table has partition parallel query won’t support for that.
- We need to hash join optimisation for the same.
To enable this, we need to enable these two variables.
- aurora_parallel_query = ON
- aurora_disable_hash_join = OFF
Both of the variables are dynamic variables only, So we can enable it without any downtime. By default parallel query was disabled and aurora_disable_hash_join will be ON ( enabled ).
Of course, every feature has some limitations,
- Parallel query won’t work if the row format is compressed, it will work with dynamic row format only.
- It will work for all the tables, if the table size is smaller it will bypass the parallel query.
- Concurrent parallel query depends on the instance type, i.e if the instance type is r3.large only one parallel query is supported, if the instance type is r3.xlarge only 2 parallel query is supported.
- The aurora version should be > 2.09.
In the next blog, we will see how the performance has been increased by using a parallel query.