Quantcast
Viewing all articles
Browse latest Browse all 18769

From Months to Seconds with Subquery Materialization

In an earlier blog post, I showed how optimizer improvements in MySQL 5.6 gave better performance for several of the queries in the DBT-3 benchmark.

However, for one of the queries, Query 18, I was not able to give exact numbers for the improvement since the query took very long in MySQL 5.5. I decided to try to find out exactly how long the query would take, but when the query had run for one month, I gave up. How can a query take so long? Especially, when I had set up InnoDB with a buffer pool that should be large enough to hold the entire database. Let's have a look at the query:

select c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity)
from customer, orders, lineitem
where o_orderkey in (
select l_orderkey
from lineitem
group by l_orderkey
having sum(l_quantity) > 313
)
and c_custkey = o_custkey
and o_orderkey = l_orderkey
group by c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
order by o_totalprice desc, o_orderdate
LIMIT 100;

This query will find the orders from customers that have placed big orders. The reason that this takes so long in MySQL 5.5, is that the sub-query in the WHERE clause will be executed for each row produced by the three table join of the (outer) query. This can be seen from the EXPLAIN output for this query:
+----+--------------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
| 1 | PRIMARY | customer | ALL | PRIMARY | NULL | NULL | NULL | 150000 | Using temporary; Using filesort |
| 1 | PRIMARY | orders | ref | PRIMARY,i_o_custkey | i_o_custkey | 5 | dbt3.customer.c_custkey | 7 | Using where |
| 1 | PRIMARY | lineitem | ref | PRIMARY,i_l_orderkey,i_l_orderkey_quantity | i_l_orderkey_quantity | 4 | dbt3.orders.o_orderkey | 2 | Using index |
| 2 | DEPENDENT SUBQUERY | lineitem | index | NULL | PRIMARY | 8 | NULL | 6001215 | NULL |
+----+--------------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+

The select_type for the subquery is DEPENDENT SUBQUERY. This implies that the index scan of the lineitem table will be performed once for every row produced by the three preceeding tables. In other words, it will be executed more than 2 million times (150000*7*2). I have measured that one execution of the sub-query takes about 3.5 seconds when all the data is in memory. Hence, it will take more than 80 days to execute it 2 million times.

In MySQL 5.6, Subquery Materialization may be used to avoid the repeated execution of subqueries. This implies that the subquery is executed once and the result stored (materialized) in a temporary table. Then, for each row produced by the outer query, a hash-based look-up will be made into the temporary table to check whether there is a match. The EXPLAIN output for Query 18, looks like this in MySQL 5.6:

+----+-------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
| 1 | PRIMARY | customer | ALL | PRIMARY | NULL | NULL | NULL | 150000 | Using temporary; Using filesort |
| 1 | PRIMARY | orders | ref | PRIMARY,i_o_custkey | i_o_custkey | 5 | dbt3.customer.c_custkey | 7 | Using where |
| 1 | PRIMARY | lineitem | ref | PRIMARY,i_l_orderkey,i_l_orderkey_quantity | i_l_orderkey_quantity | 4 | dbt3.orders.o_orderkey | 2 | Using index |
| 2 | SUBQUERY | lineitem | index | NULL | PRIMARY | 8 | NULL | 6001215 | NULL |
+----+-------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+

The only difference is that select_type of the subquery now is SUBQUERY. In other words, it is no longer dependent on the preceding tables, and can be executed only once. In my run of DBT-3 with MySQL 5.6, Query 18 takes only 6.8 seconds. Hence, we have gone from a couple of months to just a few seconds!

My colleague Guilhem has earlier written about another DBT-3 query that is improved with Subquery Materialization.

Image may be NSFW.
Clik here to view.

PlanetMySQL Voting: Vote UP / Vote DOWN

Viewing all articles
Browse latest Browse all 18769

Trending Articles