Recently one of our customers ran into an issue, wherein a bad actor(code) from the application had made the wrong update to 16 M records of a critical table in the database, causing the entire production process to go down. The application Team was able to find the bad actor and block it, our Remote DBA was involved in the Data Recovery/Rollback.
Here I would like to discuss possible recovery methods for the above said scenario
A simple and effective way to recover is by using a delayed slave, RDS started supporting this feature from version 5.6.40 and 5.7.22 i.e., you can induce a SQL thread delay-interval for applying the writes to a slave, detailed implementation is covered in our blog here. It’s quite simple if the impact or the wrong update/delete from the application found within the delayed interval defined, just proceed to pause the replication in slave and recover the data. This option is crossed-out in our case since we don’t have one.
Recovery From Binlog:
Binlogs are very trivial in MySQL as they are the heart of replication as well as the source of point in time recovery irrespective of the engine used. With RDS binlogs are purged as soon as possible and also once the replicas are in sync. To increase the retention of binlogs set the retention adequately based on the requirement and disk availability as below
call mysql.rds_set_configuration('binlog retention hours', 24);
When you have enough retention of binlogs and Binlog row image set to FULL, you can proceed to make a flash-back i.e, Rollback of the updates as mentioned here. This option is also crossed-out since binlogs was purged
Now we had to fall back on the “Point-In-Time” recovery with MySQL RDS.
How Point In Time Recovery (PTIR) in RDS works?
This works in combination with both the automated backups(snapshots) as well as binlogs. Binlogs are backed up every 5 mins to S3(AWS object store) by default in RDS when you have it enabled in fact backup and binlogs are interdependent with RDS, and also backup retention(max 35 days) defines the max time you can go back in time with your data.
For example, say you have the backup retention of 7 days ie., you can move back and forth in time with data within 7 days interval. Lets say now you require to restore to a point in time say 10.00 Am on day 5 below are the steps it follows
- The nearest full backup(automated) would be restored first ie., day 4 full backup taken at 12:00 AM
- Now the incremental data gets applied by restoring the Binlogs post the full backup to the requested time ie., from 12:00 Am to 10:00 AM
I will now walk you through the steps that we followed for restoring a production instance
- Step 1
- Choose the instance below for which you need to do a point-in-time recovery
- Step 2
- Now click Actions —-> Restore to point in time as shown below
- Step 3
- Launch DB instance Wizard will appear as below, you can see “Last restorable time” defines the max time to which recovery is possible and “Custom” help you in choosing the custom time within the restorable window ie., within backup retention set
- Step 4
- Choose “Custom” and you can note the date and time wizard. Set the time to the point based on the requirement to have your data restored..
- Step 5
- Now proceed to chose a name for the instance i have made it as “recovery” engine type would be same as the source instance.
- Step 6
- You can also proceed to choose the storage, connectivity and log options am not mentioning all the options here , finally proceed with the “continue” option at the bottom
Now you can see the backup getting restored to a new instance with the provided name. Restoration time greatly depend upon the number of binlogs to be applied ie., lesser the number of binlogs to be applied between the snapshot to recovery time faster would be the restore. On successful restore, you can see the below event notification “Restored DB instance instance_name to time”
We were able to successfully recover/rollback the data which was updated wrongly on to a new instance and restore it back to the production
- With PITR which greatly reduces RPO(Recovery Point Object) of RDS to 5 mins, and also comes with some cost.
- Always make sure to have the durability settings enabled with RDS for best results with PITR again it comes with a performance penalty on IO.
- The higher the number of backup retention greater is the ability to move much longer back in time again it adds storage cost for snapshots.