It's frustrating seeing examples of MySQL being slow as an example of why you should use NoSQL. If you have an invested interest[1] in comparing two technologies that are already apples to oranges, the least you can do is optimize both. If you can't do it, don't share it.
This came out of a talk on Cassandra. "MySQL" is not on the slide, but it was mentioned in reference to MySQL:
SELECT * FROM tweets WHERE user_id IN (SELECT follower FROM followers WHERE user_id = ?) order by time_tweeted DESC LIMIT 40;Let me simulate that for you:
CREATE TABLE `tweets` ( `id` int(11) NOT NULL AUTO_INCREMENT, `user_id` int(11) NOT NULL, `info` text, `time_tweeted` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`), INDEX (time_tweeted), INDEX (user_id) ) ENGINE=InnoDB; CREATE TABLE `followers` ( `user_id` int(11) NOT NULL, `follower` int(11) NOT NULL, PRIMARY KEY (`user_id`,`follower`) ) ENGINE=InnoDB; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM dual; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets; INSERT ignore into followers (user_id, follower) SELECT floor(rand()*10000), floor(rand()*10000) FROM tweets; mysql> select count(*) from tweets; +----------+ | count(*) | +----------+ | 65536 | +----------+ 1 row in set (0.03 sec) mysql> select count(*) from followers; # there are some duplicates. +----------+ | count(*) | +----------+ | 65521 | +----------+ 1 row in set (0.03 sec) select count(*) from followers where user_id = 55; <-- Ok, found a user +----------+ | count(*) | +----------+ | 4 | +----------+ 1 row in set (0.00 sec) mysql> EXPLAIN SELECT * FROM tweets WHERE user_id IN (SELECT follower FROM followers WHERE user_id = 55) order by time_tweeted DESC LIMIT 40\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: tweets type: index possible_keys: NULL key: time_tweeted key_len: 4 ref: NULL rows: 40 Extra: Using where *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: followers type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 8 ref: const,func rows: 1 Extra: Using index 2 rows in set (0.00 sec)I should note that even though it says "40 rows" in the index scan, it's really a lot more, due to a bug I've previously reported:
mysql> show session status like 'Handler%'; +----------------------------+---------+ | Variable_name | Value | +----------------------------+---------+ | Handler_commit | 160 | | Handler_delete | 0 | | Handler_discover | 0 | | Handler_prepare | 142 | | Handler_read_first | 56 | | Handler_read_key | 131364 | | Handler_read_next | 900495 | | Handler_read_prev | 0 | | Handler_read_rnd | 42 | | Handler_read_rnd_next | 750736 | | Handler_rollback | 7 | | Handler_savepoint | 0 | | Handler_savepoint_rollback | 0 | | Handler_update | 0 | | Handler_write | 1454583 | +----------------------------+---------+ 15 rows in set (0.00 sec) .. Run query .. mysql> show session status like 'Handler%'; +----------------------------+---------+ | Variable_name | Value | +----------------------------+---------+ | Handler_commit | 161 | | Handler_delete | 0 | | Handler_discover | 0 | | Handler_prepare | 142 | | Handler_read_first | 56 | | Handler_read_key | 218955 | <--- Something like 87591 rows! | Handler_read_next | 900495 | | Handler_read_prev | 43793 | | Handler_read_rnd | 42 | | Handler_read_rnd_next | 750736 | | Handler_rollback | 7 | | Handler_savepoint | 0 | | Handler_savepoint_rollback | 0 | | Handler_update | 0 | | Handler_write | 1454583 | +----------------------------+---------+ 15 rows in set (0.00 sec) And repeating to get a profile shows the awesome dependent subquery: mysql> show profile for query 1; +----------------------+----------+ | Status | Duration | +----------------------+----------+ | starting | 0.000099 | | checking permissions | 0.000008 | | checking permissions | 0.000010 | | Opening tables | 0.000034 | | System lock | 0.000015 | | init | 0.000045 | | optimizing | 0.000012 | | statistics | 0.000013 | | preparing | 0.000013 | | executing | 0.000004 | | Sorting result | 0.000007 | | Sending data | 0.000135 | | optimizing | 0.000016 | | statistics | 0.000057 | | preparing | 0.000020 | | executing | 0.000005 | | Sending data | 0.000036 | | executing | 0.000005 | .. | Sending data | 0.000027 | | executing | 0.000005 | .. | end | 0.000007 | | query end | 0.000005 | | freeing items | 0.000149 | | logging slow query | 0.000006 | | cleaning up | 0.000006 | +----------------------+----------+ 161966 rows in set (1.33 sec)The problems with this query are:
- The IN() subquery can not be optimized. This is one of the optimizations that MySQL fails at. The secret is that even though MySQL has subqueries, there are few a DBA will recommend using in production. Please pretend they don't exist in comparisons.
- Assuming a sort is used (it isn't) ID should be monotonic and ordering by it _is_ the better choice (assuming it is a PK/InnoDB). If you could get a plan to access followers first (as described) and then sort by posted_date on tweets, the data is going to be pre-sorted in the wrong order - since all newer tweets have newer dates. This is the worst case situation to sort.
- The visual description of how the query executes in MySQL is incorrect. The tables clearly join in the complete opposite order, and there's no sort. The dependent subquery is why MySQL sucks here.
mysql> EXPLAIN SELECT tweets.* FROM followers INNER JOIN tweets ON tweets.user_id=followers.user_id WHERE followers.user_id = 55 ORDER BY id DESC\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: tweets type: ref possible_keys: user_id key: user_id key_len: 4 ref: const rows: 111 Extra: Using where *************************** 2. row *************************** id: 1 select_type: SIMPLE table: followers type: ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: const rows: 4 Extra: Using index 2 rows in set (0.00 sec)
Note: This might be a bit difficult to read if you're not familiar with subtle optimizations. The followers table appears second but is id=1, so the filter on followers is applied.
mysql> show session status like 'Handler%'; +----------------------------+---------+ | Variable_name | Value | +----------------------------+---------+ | Handler_commit | 75 | | Handler_delete | 0 | | Handler_discover | 0 | | Handler_prepare | 60 | | Handler_read_first | 31 | | Handler_read_key | 324231 | | Handler_read_next | 2198205 | | Handler_read_prev | 162173 | | Handler_read_rnd | 444 | | Handler_read_rnd_next | 2529171 | | Handler_rollback | 0 | | Handler_savepoint | 0 | | Handler_savepoint_rollback | 0 | | Handler_update | 0 | | Handler_write | 2579264 | +----------------------------+---------+ 15 rows in set (0.00 sec) .. run query .. mysql> show session status like 'Handler%'; +----------------------------+---------+ | Variable_name | Value | +----------------------------+---------+ | Handler_commit | 76 | | Handler_delete | 0 | | Handler_discover | 0 | | Handler_prepare | 60 | | Handler_read_first | 31 | | Handler_read_key | 324347 | <-- +116 | Handler_read_next | 2198649 | <-- +444 | Handler_read_prev | 162284 | <-- +111 | Handler_read_rnd | 444 | | Handler_read_rnd_next | 2529171 | | Handler_rollback | 0 | | Handler_savepoint | 0 | | Handler_savepoint_rollback | 0 | | Handler_update | 0 | | Handler_write | 2579264 | +----------------------------+---------+ 15 rows in set (0.01 sec) And show profiles result: mysql> show profile for query 13; +----------------------+----------+ | Status | Duration | +----------------------+----------+ | starting | 0.000136 | | checking permissions | 0.000007 | | checking permissions | 0.000007 | | Opening tables | 0.000069 | | System lock | 0.000014 | | init | 0.000028 | | optimizing | 0.000016 | | statistics | 0.000095 | | preparing | 0.000024 | | executing | 0.000004 | | Sorting result | 0.000005 | | Sending data | 0.003517 | | end | 0.000006 | | query end | 0.000003 | | freeing items | 0.000035 | | logging slow query | 0.000004 | | cleaning up | 0.000004 | +----------------------+----------+ 17 rows in set (0.01 sec)
I want to learn more about Cassandra. I want speakers to be able to apply context by comparing to something people are familiar with, but sheesh... you've got to get your descriptions right.
I have heard people tell me MySQL is difficult to optimize. It is true, but it's not an excuse. If you really are at "massive scale" you can afford to get a second opinion.
[1] I wouldn't heckle someone who was neutral and made the mistake accidentally.PlanetMySQL Voting: Vote UP / Vote DOWN