Few days ago I was working on a case where we needed to modify a lot of data before pushing it to sphinx – MySQL did not have a function to do the thing so I thought I’ll write MySQL Stored Function and we’ll be good to go. It worked! But not so well really – building the index, which was taking 10 minutes, was now taking 16 minutes. Then we added another MySQL function for different set of attributes and indexing speed went from 16 minutes to 26 minutes. I knew using UDF would be faster, but I had no idea how much. Have you ever wondered?
So what were the modifications we needed? It was couple very simple things – (1) two varchar columns needed leading nonalpha characters trimmed, so “123 ^&* and some text” would become “and some text”, and (2) same two varchar columns needed some double characters changed to single one so “Picasso” becomes “Picaso”, “Wesselmann” becomes “Weselman” and so on. Why we needed that is another story which this blog post is not about. Note however that only very small portion of data really needed to be modified.
Here are the two MySQL functions I wrote to do the job – ltrim_junk_mysql() and remove_dups_mysql(). Although processing single row seemed to be instantaneous, we needed to process much more than that – and that wasn’t as fast. For example, here’s how long it took to process 100k rows:
mysql> select ltrim_junk_mysql(author), ltrim_junk_mysql(title) from paintings limit 100000;
100000 rows in set (2.97 sec)
mysql> select remove_dups_mysql(author), remove_dups_mysql(title) from paintings limit 100000;
100000 rows in set (2.04 sec)
If you looked carefully at the second function though, you may have noticed I did not necessarily have to write a function, I could have written it as an SQL statement:
mysql> select
REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
REPLACE( REPLACE( LOWER(author), 'aa', 'a'), 'bb', 'b'), 'cc', 'c'),
'dd', 'd'), 'ff', 'f'), 'gg', 'g'), 'll', 'l'), 'mm', 'm'), 'nn', 'n'),
'oo', 'o'), 'pp', 'p'), 'rr', 'r'), 'ss', 's'), 'tt', 't'), 'vv', 'v'),
'zz', 'z'),
REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
REPLACE( REPLACE( LOWER(title), 'aa', 'a'), 'bb', 'b'), 'cc', 'c'),
'dd', 'd'), 'ff', 'f'), 'gg', 'g'), 'll', 'l'), 'mm', 'm'), 'nn', 'n'),
'oo', 'o'), 'pp', 'p'), 'rr', 'r'), 'ss', 's'), 'tt', 't'), 'vv', 'v'),
'zz', 'z') FROM paintings LIMIT 100000;
100000 rows in set (0.33 sec)
Doesn’t look nice, but it already executes more than 6 times faster which is interesting as it shows how much overhead you have by using mysql stored routines interface. So anyway, I asked my colleague Sasha to help me out by rewriting these as UDF functions. Here’s ltrim_junk() function and remove_dups(). Well, guess what:
mysql> select ltrim_junk(author), ltrim_junk(title) from paintings limit 100000;
100000 rows in set (0.13 sec)
mysql> select remove_dups(author), remove_dups(title) from paintings limit 100000;
100000 rows in set (0.17 sec)
So for ltrim_junk() function we got almost 23x improvement and for remove_dups – 12 times if comparing to stored function or 2 times comparing to just using available functions. With that speed I could even scan the whole table of 7 million records:
mysql> select count(*) from paintings where title != ltrim_junk(title);
+----------+
| count(*) |
+----------+
| 101533 |
+----------+
1 row in set (6.82 sec)
mysql> select count(*) from paintings where author != ltrim_junk(author);
+----------+
| count(*) |
+----------+
| 28335 |
+----------+
1 row in set (6.63 sec)
mysql> select count(*) from paintings where author != remove_dups(author) OR title != remove_dups(title);
+----------+
| count(*) |
+----------+
| 2720414 |
+----------+
1 row in set (11.19 sec)
Whereas using stored function used to take minutes!
I don’t mean to say stored functions are bad and you should now rewrite all your functions as UDFs – if you need to process just a few records for a request and you are not burning racks of CPUs to constantly do the job, the speed difference is really negligible. However in case like this one where we have to process many records constantly and every second counts, UDF can really save your day. If you need one and don’t feel confident writing C, you know who to call!
Entry posted by Aurimas Mikalauskas | One comment
Add to: Image may be NSFW.
Clik here to view. | Image may be NSFW.
Clik here to view. | Image may be NSFW.
Clik here to view. | Image may be NSFW.
Clik here to view. | Image may be NSFW.
Clik here to view.
PlanetMySQL Voting: Vote UP / Vote DOWN