This is a time-honored topic, and there’s no shortage of articles on the topic on this blog. I wanted to write a post trying to condense and clarify those posts, as it has taken me a while to really understand this relationship.
Some basic facts
- Most of us know that writing into Innodb updates buffer pool pages in memory and records page operations in the transaction (redo) log.
- Behind the scenes those updated (dirty) buffer pool pages are flushed down the to the tablespace.
- If Innodb stops (read: crashes) with dirty buffer pool pages, Innodb recovery must be done to rebuild the last consistent picture of the database.
- Recovery uses the transaction log by redoing (hence the name ‘redo log’) the page operations in the log that had not already been flushed to the tablespaces.
Ultimately this mechanism was an optimization for slow drives: if you can sequentially write all the changes into a log, it will be faster to do on the fly as transactions come in than trying to randomly write the changes across the tablespaces. Sequential IO trumps Random IO.
However, even today in our modern flash storage world where random IO is significantly less expensive (from a latency perspective, not dollars), this is still an optimization because the longer we delay updating the tablespace, the more IOPs we can potentially conserve, condense, merge, etc. This is because:
- The same row may be written multiple times before the page is flushed
- Multiple rows within the same page can be written before the page is flushed
Innodb Log Checkpointing
So, first of all, what can we see about Innodb log checkpointing and what does it tell us?
mysql> SHOW ENGINE INNODB STATUS\G --- LOG --- Log sequence number 9682004056 Log flushed up to 9682004056 Last checkpoint at 9682002296
This shows us the virtual head of our log (Log sequence Number), the last place the log was flushed to disk (Log flushed up to), and our last Checkpoint. The LSN grows forever, while the actual locations inside the transaction logs are reused in a circular fashion. Based on these numbers, we can determine how many bytes back in the transaction log our oldest uncheckpointed transaction is by subtracting our ‘Log sequence number’ from the ‘Last checkpoint at’ value. More on what a Checkpoint is in a minute. If you use Percona server, it does the math for you by including some more output:
--- LOG --- Log sequence number 9682004056 Log flushed up to 9682004056 Last checkpoint at 9682002296 Max checkpoint age 108005254 Checkpoint age target 104630090 Modified age 1760 Checkpoint age 1760
Probably most interesting here is the Checkpoint age, which is the subtraction I described above. I think of the Max checkpoint age as roughly the furthest back Innodb will allow us to go in the transaction logs; our Checkpoint age cannot exceed this without blocking client operations in Innodb to flush dirty buffers. Max checkpoint age appears to be approximately 80% of the total number of bytes in all the transaction logs, but I’m unsure if that’s always the case.
Remember our transaction logs are circular, and the checkpoint age represents how far back the oldest unflushed transaction is in the log. We cannot overwrite that without potentially losing data on a crash, so Innodb does not permit such an operation and will block incoming writes until the space is available to continue (safely) writing in the log.
Dirty Buffer Pool Pages
On the other side, we have dirty buffers. These two numbers are relevant from the BUFFER POOL AND MEMORY section of SHOW ENGINE INNODB STATUS:
Database pages 65530 ... Modified db pages 3
So we have 3 pages that have modified data in them, and that (in this case) is a very small percentage of the total buffer pool. A page in Innodb contains rows, indexes, etc., while a transaction may modify 1 or millions of rows. Also realize that a single modified page in the buffer pool may contain modified data from multiple transactions in the transaction log.
As I said before, dirty pages are flushed to disk in the background. The order in which they are flushed really has little to nothing to do with the transaction they are associated with, nor with the position associated with their modification in the transaction log. The effect of this is that as the thread managing the dirty page flushing goes about its business, it is not necessarily flushing to optimize the Checkpoint age, it is flushing to try to optimize IO and to obey the LRU in the buffer pool.
Since buffers can and will be flushed out of order, it may be the case that there are a lot of transactions in the transaction log that are fully flushed to disk (i.e., all the pages associated with said transaction are clean), but there still could be older transactions that are not flushed. This, in essence, is what fuzzy checkpointing is.
The checkpoint process is really a logical operation. It occasionally (as chunks of dirty pages get flushed) has a look through the dirty pages in the buffer pool to find the one with the oldest LSN, and that’s the Checkpoint. Everything older must be fully flushed.
The main reason this is important is if the Checkpoint Age is not a factor in dirty buffer flushing, it can get too big and cause stalls in client operations: the algorithm that decides which dirty pages to flush does not optimize for this [well] and sometimes it is not good enough on its own.
So, how can we optimize here? The short of it is: make innodb flush more dirty pages. However, I can’t help but wonder if some tweaks could be made to the page flushing algorithm to be more effective there in choosing older dirty pages. It is clear how that algorithm works without reading the source code.
There are a lot of ways to tune this, here is a list of the most signficant, roughly ordered from oldest to newest, and simultaneously listed from least effective to most effective:
- innodb_max_dirty_pages_pct: This attempts to keep the percentage of dirty pages under control, and before the Innodb plugin this was really the only way to tune dirty buffer flushing. However, I have seen servers with 3% dirty buffers and they are hitting their max checkpoint age. The way this increases dirty buffer flushing also doesn’t scale well on high io subsystems, it effectively just doubles the dirty buffer flushing per second when the % dirty pages exceeds this amount.
- innodb_io_capacity: This setting, in spite of all our grand hopes that it would allow Innodb to make better use of our IO in all operations, simply controls the amount of dirty page flushing per second (and other background tasks like read-ahead). Make this bigger, you flush more per second. This does not adapt, it simply does that many iops every second if there are dirty buffers to flush. It will effectively eliminate any optimization of IO consolidation if you have a low enough write workload (that is, dirty pages get flushed almost immediately, we might be better off without a transaction log in this case). It also can quickly starve data reads and writes to the transaction log if you set this too high.
- innodb_write_io_threads: Controls how many threads will have writes in progress to the disk. I’m not sure why this is still useful if you can use Linux native AIO. These can also be rendered useless by filesystems that don’t allow parallel writing to the same file by more than one thread (particularly if you have relatively few tables and/or use the global tablespaces) *cough ext3 cough*.
- innodb_adaptive_flushing: An Innodb plugin/5.5 setting that tries to be smarter about flushing more aggressively based on the number of dirty pages and the rate of transaction log growth.
- innodb_adaptive_flushing_method: (Percona Server only) This adds a few new different algorithms, but the more effective ones adjust the amount of dirty page flushing using a formula that considers the Checkpoint age and the Checkpoint age target (something you can set in Percona Server, otherwise it is effectively just a hair under the Max Checkpoint age). The two main methods here would be ‘estimate’ (good default) and ‘keep_average’ designed for SSD type storage. Running Percona Server and using this method is my go-to setting for managing Checkpoint age.
PlanetMySQL Voting: Vote UP / Vote DOWN