Quantcast
Viewing all articles
Browse latest Browse all 18770

Yesterday I heckled a speaker.

It's frustrating seeing examples of MySQL being slow as an example of why you should use NoSQL. If you have an invested interest[1] in comparing two technologies that are already apples to oranges, the least you can do is optimize both. If you can't do it, don't share it.

This came out of a talk on Cassandra. "MySQL" is not on the slide, but it was mentioned in reference to MySQL:

SELECT * FROM tweets WHERE user_id IN (SELECT follower FROM followers WHERE user_id = ?) order by time_tweeted DESC LIMIT 40;
Let me simulate that for you:
CREATE TABLE `tweets` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `user_id` int(11) NOT NULL,
  `info` text,
  `time_tweeted` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  INDEX (time_tweeted),
  INDEX (user_id)
) ENGINE=InnoDB;

CREATE TABLE `followers` (
  `user_id` int(11) NOT NULL,
  `follower` int(11) NOT NULL,
  PRIMARY KEY (`user_id`,`follower`)
) ENGINE=InnoDB;

INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM dual;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;
INSERT INTO tweets (user_id, info) SELECT floor(rand()*10000), REPEAT('a', 160) FROM tweets;

INSERT ignore into followers (user_id, follower) SELECT floor(rand()*10000), floor(rand()*10000) FROM tweets;

mysql> select count(*) from tweets;
+----------+
| count(*) |
+----------+
|    65536 |
+----------+
1 row in set (0.03 sec)

mysql> select count(*) from followers; # there are some duplicates.
+----------+
| count(*) |
+----------+
|    65521 |
+----------+
1 row in set (0.03 sec)

select count(*) from followers where user_id = 55; <-- Ok, found a user
+----------+
| count(*) |
+----------+
|        4 |
+----------+
1 row in set (0.00 sec)


mysql> EXPLAIN SELECT * FROM tweets  WHERE user_id IN (SELECT follower FROM followers WHERE user_id = 55) order by time_tweeted DESC LIMIT 40\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: tweets
         type: index
possible_keys: NULL
          key: time_tweeted
      key_len: 4
          ref: NULL
         rows: 40
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: followers
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: const,func
         rows: 1
        Extra: Using index
2 rows in set (0.00 sec)
I should note that even though it says "40 rows" in the index scan, it's really a lot more, due to a bug I've previously reported:
mysql> show session status like 'Handler%';
+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_commit             | 160     |
| Handler_delete             | 0       |
| Handler_discover           | 0       |
| Handler_prepare            | 142     |
| Handler_read_first         | 56      |
| Handler_read_key           | 131364  |
| Handler_read_next          | 900495  |
| Handler_read_prev          | 0       |
| Handler_read_rnd           | 42      |
| Handler_read_rnd_next      | 750736  |
| Handler_rollback           | 7       |
| Handler_savepoint          | 0       |
| Handler_savepoint_rollback | 0       |
| Handler_update             | 0       |
| Handler_write              | 1454583 |
+----------------------------+---------+
15 rows in set (0.00 sec)

.. Run query ..

mysql> show session status like 'Handler%';
+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_commit             | 161     |
| Handler_delete             | 0       |
| Handler_discover           | 0       |
| Handler_prepare            | 142     |
| Handler_read_first         | 56      |
| Handler_read_key           | 218955  | <--- Something like 87591 rows!
| Handler_read_next          | 900495  |
| Handler_read_prev          | 43793   |
| Handler_read_rnd           | 42      |
| Handler_read_rnd_next      | 750736  |
| Handler_rollback           | 7       |
| Handler_savepoint          | 0       |
| Handler_savepoint_rollback | 0       |
| Handler_update             | 0       |
| Handler_write              | 1454583 |
+----------------------------+---------+
15 rows in set (0.00 sec)

And repeating to get a profile shows the awesome dependent subquery:

mysql> show profile for query 1;
+----------------------+----------+
| Status               | Duration |
+----------------------+----------+
| starting             | 0.000099 |
| checking permissions | 0.000008 |
| checking permissions | 0.000010 |
| Opening tables       | 0.000034 |
| System lock          | 0.000015 |
| init                 | 0.000045 |
| optimizing           | 0.000012 |
| statistics           | 0.000013 |
| preparing            | 0.000013 |
| executing            | 0.000004 |
| Sorting result       | 0.000007 |
| Sending data         | 0.000135 |
| optimizing           | 0.000016 |
| statistics           | 0.000057 |
| preparing            | 0.000020 |
| executing            | 0.000005 |
| Sending data         | 0.000036 |
| executing            | 0.000005 |
..
| Sending data         | 0.000027 |
| executing            | 0.000005 |
..
| end                  | 0.000007 |
| query end            | 0.000005 |
| freeing items        | 0.000149 |
| logging slow query   | 0.000006 |
| cleaning up          | 0.000006 |
+----------------------+----------+
161966 rows in set (1.33 sec)

The problems with this query are:
  • The IN() subquery can not be optimized. This is one of the optimizations that MySQL fails at. The secret is that even though MySQL has subqueries, there are few a DBA will recommend using in production. Please pretend they don't exist in comparisons.
  • Assuming a sort is used (it isn't) ID should be monotonic and ordering by it _is_ the better choice (assuming it is a PK/InnoDB). If you could get a plan to access followers first (as described) and then sort by posted_date on tweets, the data is going to be pre-sorted in the wrong order - since all newer tweets have newer dates. This is the worst case situation to sort.
  • The visual description of how the query executes in MySQL is incorrect. The tables clearly join in the complete opposite order, and there's no sort. The dependent subquery is why MySQL sucks here.
The query that should have been a join. i.e.:
mysql> EXPLAIN SELECT tweets.* FROM followers INNER JOIN tweets  ON tweets.user_id=followers.user_id WHERE followers.user_id = 55 ORDER BY id DESC\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tweets
         type: ref
possible_keys: user_id
          key: user_id
      key_len: 4
          ref: const
         rows: 111
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: followers
         type: ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 4
        Extra: Using index
2 rows in set (0.00 sec)

Note: This might be a bit difficult to read if you're not familiar with subtle optimizations. The followers table appears second but is id=1, so the filter on followers is applied.

mysql> show session status like 'Handler%';
+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_commit             | 75      |
| Handler_delete             | 0       |
| Handler_discover           | 0       |
| Handler_prepare            | 60      |
| Handler_read_first         | 31      |
| Handler_read_key           | 324231  |
| Handler_read_next          | 2198205 |
| Handler_read_prev          | 162173  |
| Handler_read_rnd           | 444     |
| Handler_read_rnd_next      | 2529171 |
| Handler_rollback           | 0       |
| Handler_savepoint          | 0       |
| Handler_savepoint_rollback | 0       |
| Handler_update             | 0       |
| Handler_write              | 2579264 |
+----------------------------+---------+
15 rows in set (0.00 sec)

.. run query ..

mysql> show session status like 'Handler%';
+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_commit             | 76      |
| Handler_delete             | 0       |
| Handler_discover           | 0       |
| Handler_prepare            | 60      |
| Handler_read_first         | 31      |
| Handler_read_key           | 324347  | <-- +116
| Handler_read_next          | 2198649 | <-- +444
| Handler_read_prev          | 162284  | <-- +111
| Handler_read_rnd           | 444     |
| Handler_read_rnd_next      | 2529171 |
| Handler_rollback           | 0       |
| Handler_savepoint          | 0       |
| Handler_savepoint_rollback | 0       |
| Handler_update             | 0       |
| Handler_write              | 2579264 |
+----------------------------+---------+
15 rows in set (0.01 sec)

And show profiles result:

mysql> show profile for query 13;
+----------------------+----------+
| Status               | Duration |
+----------------------+----------+
| starting             | 0.000136 |
| checking permissions | 0.000007 |
| checking permissions | 0.000007 |
| Opening tables       | 0.000069 |
| System lock          | 0.000014 |
| init                 | 0.000028 |
| optimizing           | 0.000016 |
| statistics           | 0.000095 |
| preparing            | 0.000024 |
| executing            | 0.000004 |
| Sorting result       | 0.000005 |
| Sending data         | 0.003517 |
| end                  | 0.000006 |
| query end            | 0.000003 |
| freeing items        | 0.000035 |
| logging slow query   | 0.000004 |
| cleaning up          | 0.000004 |
+----------------------+----------+
17 rows in set (0.01 sec)

I want to learn more about Cassandra. I want speakers to be able to apply context by comparing to something people are familiar with, but sheesh... you've got to get your descriptions right.

I have heard people tell me MySQL is difficult to optimize. It is true, but it's not an excuse. If you really are at "massive scale" you can afford to get a second opinion.

[1] I wouldn't heckle someone who was neutral and made the mistake accidentally.
PlanetMySQL Voting: Vote UP / Vote DOWN

Viewing all articles
Browse latest Browse all 18770

Trending Articles