While doing a performance audit for a customer a few weeks ago, I tried to improve the response time of their top slow query according to pt-query-digest
‘s report. This query was run very frequently and had very unstable performance: during the time data was collected, response time varied from 50µs to 1s.
When I ran the query myself (a two-table join with a WHERE
condition, the whole dataset was in memory), I always got a consistent response time (about 160ms). Of course, I wanted to know more about how MySQL executes this query. So I used commands you’re probably familiar with: EXPLAIN
, SHOW PROFILE
, SHOW STATUS LIKE 'Handler%'
.
EXPLAIN
and Handler
counters only confirmed that the execution plan seemed reasonable and that fields were correctly indexed.
With SHOW PROFILE
, I saw that most of the time was spent sending the result set to the client, which was not surprising as the result set was around 30,000 rows:
+----------------------+----------+ | Status | Duration | +----------------------+----------+ | starting | 0.000075 | | checking permissions | 0.000004 | | checking permissions | 0.000004 | | Opening tables | 0.000021 | | System lock | 0.000009 | | init | 0.000022 | | optimizing | 0.000013 | | statistics | 0.000075 | | preparing | 0.000016 | | executing | 0.000003 | | Sending data | 0.162272 | | end | 0.000008 | | query end | 0.000004 | | closing tables | 0.000032 | | freeing items | 0.000035 | | logging slow query | 0.000004 | | cleaning up | 0.000005 | +----------------------+----------+
So the unstable response times did not come from a bad execution plan, but rather from contention/excessive waiting somewhere in the server. A good candidate for contention issues was the query cache as it was enabled. Contention on the query cache mutex when checking if the result set can be served from the is quite common.
It reminded me that I executed the SHOW PROFILE
and SHOW STATUS LIKE 'Handler%'
with the SQL_NO_CACHE
hint. Will the output be different without SQL_NO_CACHE
?
Indeed it was. If the Handler
counters were the same, I got around 200 lines of output from SHOW PROFILE instead of the 15 lines above. Particularly interesting, a sequence was repeated on and on:
[...] | Sending data | 0.003067 | | Waiting for query cache lock | 0.000004 | | Waiting on query cache mutex | 0.000002 | | Sending data | 0.003407 | | Waiting for query cache lock | 0.000003 | | Waiting on query cache mutex | 0.000003 | | Sending data | 0.003515 | | Waiting for query cache lock | 0.000003 | | Waiting on query cache mutex | 0.000002 | | Sending data | 0.003365 | | Waiting for query cache lock | 0.000003 | | Waiting on query cache mutex | 0.000002 | | Sending data | 0.003380 | | Waiting for query cache lock | 0.000003 | | Waiting on query cache mutex | 0.000002 | | Sending data | 0.003474 | | Waiting for query cache lock | 0.000003 | | Waiting on query cache mutex | 0.000002 | [...]
Total response time was still around 160ms, but with so many waits for the query cache locks, it was easy to imagine that this query could take ages as soon as there is competition for the lock.
Of course, the question was: why did MySQL need so many accesses to the query cache lock?
The answer is in the way the query cache works. Simply stated, the server wants to lock the query cache both when checking if a result is in the cache and when writing a result set into the cache. When writing, locking can occur several times: the server sends results to the cache before computing the entire result set (so the total size of the result set is not known), so the cache has to assign memory block by block. If a block is full and the server keeps sending rows, a new block must be assigned, requiring the cache to be locked!
More specifically, the size of the blocks is set by the value of the query_cache_min_res_unit
variable (4KB by default). For the query I was working on, each row was about 6 bytes, so the result set was around 180KB. It means that around 50 accesses to the query cache were needed to cache the entire result set. This is what SHOW PROFILE
revealed.
A solution to the problem could be raise the value of query_cache_min_res_unit
. But this could increase the fragmentation of the cache and degrade performance for other queries. The best solution was simply to turn off the query cache.
Conclusions:
- Once again, the query cache revealed bad scalability, but it was in an unusual manner.
- When you have unstable response times for a query, you should suspect either some change in data access pattern (reads on disk vs reads in memory, execution plan instability) or contention somewhere while the query is executing.
- You should not forget that improving queries also means improving the stability of the response time, and not only having low response times sometimes. In this case, a stable 160ms response time was preferred to a response time ranging from 50µs to 1s.
And a final note for sharp-eyed readers: why do we have 2 lines showing waits on the query cache lock instead of one in the output of SHOW PROFILE
? This is specific to Percona Server 5.5 and used only for debugging purposes to have a finer granularity.
PlanetMySQL Voting: Vote UP / Vote DOWN