
In MySQL 8.0.18 there is a new feature called Hash Joins, and I wanted to see how it works and in which situations it can help us. Here you can find a nice detailed explanation about how it works under the hood.
The high-level basics are the following: if there is a join, it will create an in-memory hash table based on one of the tables and will read the other table row by row, calculate a hash, and do a lookup on the in-memory hash table.
Great, but does this give us any performance benefits?
First of all, this only works on fields that are not indexed, so that is an immediate table scan and we usually do not recommend doing joins without indexes because it is slow. Here is where Hash Joins in MySQL can help because it will use an in-memory hash table instead of Nested Loop.
Let’s do some tests and see. First I created the following tables:
CREATE TABLE `t1` ( `id` int(11) NOT NULL AUTO_INCREMENT , `c1` int(11) NOT NULL DEFAULT '0', `c2` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`id`), KEY `idx_c1` (`c1`) ) ENGINE=InnoDB; CREATE TABLE `t2` ( `id` int(11) NOT NULL AUTO_INCREMENT , `c1` int(11) NOT NULL DEFAULT '0', `c2` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`id`), KEY `idx_c1` (`c1`) ) ENGINE=InnoDB;
I have inserted 131072
random rows into both tables.
mysql> select count(*) from t1; +----------+ | count(*) | +----------+ | 131072 | +----------+
First test – Hash Joins
Run a join based on c2
which is not indexed.
mysql> explain format=tree select count(*) from t1 join t2 on t1.c2 = t2.c2\G *************************** 1. row *************************** EXPLAIN: -> Aggregate: count(0) -> Inner hash join (t2.c2 = t1.c2) (cost=1728502115.04 rows=1728488704) -> Table scan on t2 (cost=0.01 rows=131472) -> Hash -> Table scan on t1 (cost=13219.45 rows=131472) 1 row in set (0.00 sec)
We have to use explain format=tree
to see if Hash Join will be used or not, as normal explain still says it is going to be a Nested Loop, which I think it is very misleading. I have already filed a bug report because of this and in the ticket, you can see some comments from developers saying:
The solution is to stop using traditional EXPLAIN (it will eventually go away).
So this is not going to be fixed in traditional explain and we should start using the new way.
Back to the query; we can see it is going to use Hash Join for this query, but how fast is it?
mysql> select count(*) from t1 join t2 on t1.c2 = t2.c2; +----------+ | count(*) | +----------+ | 17172231 | +----------+ 1 row in set (0.73 sec)
0.73s
for a more than 17m rows join table. Looks promising.
Second Test – Non-Hash Joins
We can disable it with an optimizer switch or optimizer hint.
mysql> select /*+ NO_HASH_JOIN (t1,t2) */ count(*) from t1 join t2 on t1.c2 = t2.c2; +----------+ | count(*) | +----------+ | 17172231 | +----------+ 1 row in set (13 min 36.36 sec)
Now the same query takes more than 13 minutes. That is a huge difference and we can see Hash Join helps a lot here.
Third Test – Joins Based on Indexes
Let’s create indexes and see how fast a join based on indexes is.
create index idx_c2 on t1(c2); create index idx_c2 on t2(c2); mysql> select count(*) from t1 join t2 on t1.c2 = t2.c2; +----------+ | count(*) | +----------+ | 17172231 | +----------+ 1 row in set (2.63 sec)
2.6sHash Join is even faster than the Index-based join in this case.
However, I was able to force the optimizer to use Hash Joins even if an index is available by using ignore index
:
mysql> explain format=tree select count(*) from t1 ignore index (idx_c2) join t2 ignore index (idx_c2) on t1.c2 = t2.c2 where t1.c2=t2.c2\G *************************** 1. row *************************** EXPLAIN: -> Aggregate: count(0) -> Inner hash join (t2.c2 = t1.c2) (cost=1728502115.04 rows=17336898) -> Table scan on t2 (cost=0.00 rows=131472) -> Hash -> Table scan on t1 (cost=13219.45 rows=131472) 1 row in set (0.00 sec)
I still think it would be nice if I can tell the optimizer with a hint to use Hash Joins even if an index is available, so we do not have to ignore indexes on all the tables. I have created a feature request for this.
However, if you read my first bug report carefully you can see a comment from a MySQL developer which indicates this might not be necessary:
BNL (Block Nested-Loop) will also go away entirely at some point, at which point this hint will be ignored.
That could mean they are planning to remove BNL joins in the future and maybe replace it with Hash join.
Limitations
We can see Hash Join can be powerful, but there are limitations:
- As I mentioned it only works on columns that do not have indexes (or you have to ignore them).
- It only works with equi-join conditions.
- It does not work with LEFT or RIGHT JOIN.
I would like to see a status metric as well to monitor how many times Hash Join was used, and for this, I filled another feature request.
Conclusion
Hash Join seems a very powerful new join option, and we should keep an eye on this because I would not be surprised if we get some other features in the future as well. In theory, it would be able to do Left and Right joins as well and as we can see in the comments on the bug report that Oracle has plans for it in the future.