Backups may seem like a complicated subject, but they can be simplified into one multi variable question. What $DATA is so important that you don’t feel you could be missing it for $TTR and you can’t lose more than $RPO of it?
I am using the acronyms TTR to represent the time to recovery objective and RPO to represent the recovery point objective.
Assuming that you are willing to segregate data into classes based on TTR and RPO requirements (and you should), you can answer this question in the following manners:
My business will suffer untold harm if I am unable to access my {order data} for more than {2 hours} and i lose more than {10 minutes} of data.
My business will suffer significant harm if I am unable to access my {blog comments} for more than {12 hours} and I lose more than {6 hours} of data.
My business will suffer significant harm if I am unable to access my {financial data} for more than {20 minutes} and I lose more than {0 minutes} of data.
Many customers take the position that they cannot afford any downtime or data loss. While this is an ideal scenario, the cost of building and maintaining the database and backup infrastructure necessary to achieve such a goal would far outstrip the loss of revenue or damage to reputation that could fall out of a more realistic definition of $TTR and $RPO. There are of course exceptions to this case, including financial data. I for one would find it unacceptable for my banking institution to lose 15 minutes of my financial transactions, but merely annoying if I lost a tweet on twitter.
Keep in mind when defining these metrics that defining realistic TTR doesn’t necessarily equate to ever losing $RPO of data or being down for $TTR. These are simply the worst-case targets we use when designing a backup solution.
There is value in understanding what sort of failure you’re mitigating while you’re developing your backup and recovery plan as well. There need to be plans for a city level failure, datacenter failure (Seattle’s Fisher Plaza Fire caused 30 hours of downtime to that entire facility), network failure and server failure. Above all, successful backup plans must address more common but less spectacular failures such as block level corruption and user error.
The simplest backups copy data to removable media and ship them to an offsite location, where they remain for a specified period of time. Then the removable media is rotated back into service to be overwritten. This achieves not only the basic goals of protecting against corruption and user errors, but also provides a copy of data in an alternate location to protect against total site failure (the tsunami/earthquake/sinkhole scenario).
MySQL provides for multiple levels of backup protection. In the most basic of cases, for databases with a very low data change velocity, a (logical) mysqldump may be sufficient. But if you are truly concerned about data loss and recoverability, the InnoDB storage engine’s datafile and binary log backups can deliver more robust and comprehensive recovery capabilities.
At Blue Gecko, we often implement MySQL backups using a standby database (which is not a backup in and of itself) with LVM volumes to take snapshots of the datafiles for backup. The standby not only provides a failover in the event of database or hardware failure, but also allows us to quiesce the database for a consistent backup. In MySQL, quiescence can be achieved either by stopping the database for an interval or by locking all the tables in order to stop all write activity.
Hot (open) backups of InnoDB datafiles can be taken using InnoDBhotbackup, a commercial product, or XtraBackup, an open source variant of the same code. MySQL hot backups are not a panacea. Single-table or single-schema recoveries may be challenging for large databases, since the entire database must be recovered before extracting a potentially small subset of data.
When developing a backup plan, consider the following -
- how long can you truly afford to be down
- how current must the recovered data be
- what storage engine are you using
- how important is complete consistency
- which failure case(s) are most common in your systems
- how much can you afford to spend on the infrastructure
The role of operations and engineering is only partially about preventing failure. The rest should be about planning so that in the event of a failure there is a graceful path to recovery.
Backups Best Practices
- know your data: the volatility, importance and availability requirements of it.
- segregate your data based on the above factors
- define backup plans based on the above factors
- backup volatile and important data frequently
- create standby databases for data that needs to be highly available
- backup and offsite everything weekly at a minimum
- backup and offsite the more volatile and important data more frequently
- test restores
- test failure cases
- test restores
- and lastly, test restores.
Related posts:
- Oracle Backup and Recovery
- Emergency Database Recovery
- Ignite MySQL at the MySQL Conference and Expo
PlanetMySQL Voting: Vote UP / Vote DOWN