UUIDs are Popular, but Bad for Performance — Let’s Discuss

November 22, 2019, 10:52 am

≫ Next: How to setup a GUI via VNC for your Oracle Linux Compute Instance in Oracle Cloud Infrastructure (OCI)

≪ Previous: Tips for Designing Grafana Dashboards

If you do a quick web search about UUIDs and MySQL, you’ll get a fair number of results. Here are just a few examples:

So, does a well-covered topic like this one needs any more attention? Well, apparently – yes. Even though most posts are warning people against the use of UUIDs, they are still very popular. This popularity comes from the fact that these values can easily be generated by remote devices, with a very low probability of collision. With this post, my goal is to summarize what has already been written by others and, hopefully, bring in a few new ideas.

What are UUIDs?

UUID stands for Universally Unique IDentifier and is defined in the RFC 4122. It is a 128 bits number, normally written in hexadecimal and split by dashes into five groups. A typical UUID value looks like:

yves@laptop:~$ uuidgen 
83fda883-86d9-4913-9729-91f20973fa52

There are officially 5 types of UUID values, version 1 to 5, but the most common are: time-based (version 1 or version 2) and purely random (version 3). The time-based UUIDs encode the number of 10ns since January 1st, 1970 in 7.5 bytes (60 bits), which is split in a “time-low”-“time-mid”-“time-hi” fashion. The missing 4 bits is the version number used as a prefix to the time-hi field. This yields the 64 bits of the first 3 groups. The last 2 groups are the clock sequence, a value incremented every time the clock is modified and a host unique identifier. Most of the time, the MAC address of the main network interface of the host is used as a unique identifier.

There are important points to consider when you use time-based UUID values:

It is possible to determine the approximated time when the value was generated from the first 3 fields
There are many repetitive fields between consecutive UUID values
The first field, “time-low”, rolls over every 429s
The MySQL UUID function produces version one values

Here’s an example using the “uuidgen” Unix tool to generate time-based values:

yves@laptop:~$ for i in $(seq 1 500); do echo "$(date +%s): $(uuidgen -t)"; sleep 1; done
1573656803: 572e4122-0625-11ea-9f44-8c16456798f1
1573656804: 57c8019a-0625-11ea-9f44-8c16456798f1
1573656805: 586202b8-0625-11ea-9f44-8c16456798f1
...
1573657085: ff86e090-0625-11ea-9f44-8c16456798f1
1573657086: 0020a216-0626-11ea-9f44-8c16456798f1
...
1573657232: 56b943b2-0626-11ea-9f44-8c16456798f1
1573657233: 57534782-0626-11ea-9f44-8c16456798f1
1573657234: 57ed593a-0626-11ea-9f44-8c16456798f1
...

The first field rolls over (at t=1573657086) and the second field is incremented. It takes about 429s to see similar values again for the first field. The third field changes only once per about a year. The last field is static on a given host, the MAC address is used on my laptop:

yves@laptop:~$ ifconfig | grep ether | grep 8c
    	ether 8c:16:45:67:98:f1  txqueuelen 1000  (Ethernet)

The other frequently seen UUID version is 4, the purely random one. By default, the Unix “uuidgen” tool produces UUID version 4 values:

yves@laptop:~$ for i in $(seq 1 3); do uuidgen; done
6102ef39-c3f4-4977-80d4-742d15eefe66
14d6e343-028d-48a3-9ec6-77f1b703dc8f
ac9c7139-34a1-48cf-86cf-a2c823689a91

The only “repeated” value is the version, “4”, at the beginning of the 3rd field. All the other 124 bits are random.

What is so Wrong with UUID Values?

In order to appreciate the impact of using UUID values as a primary key, it is important to review how InnoDB organizes the data. InnoDB stores the rows of a table in the b-tree of the primary key. In database terminology, we call this a clustered index. The clustered index orders the rows automatically by the primary key.

When you insert a new row with a random primary key value, InnoDB has to find the page where the row belongs, load it in the buffer pool if it is not already there, insert the row and then, eventually, flush the page back to disk. With purely random values and large tables, all b-tree leaf pages are susceptible to receive the new row, there are no hot pages. Rows inserted out of the primary key order cause page splits causing a low filling factor. For tables much larger than the buffer pool, an insert will very likely need to read a table page from disk. The page in the buffer pool where the new row has been inserted will then be dirty. The odds the page will receive a second row before it needs to be flushed to disk are very low. Most of the time, every insert will cause two IOPs – one read and one write. The first major impact is on the rate of IOPs and it is a major limiting factor for scalability.

The only way to get decent performance is thus to use storage with low latency and high endurance. That’s where you’ll the second major performance impact. With a clustered index, the secondary indexes use the primary key values as the pointers. While the leaves of the b-tree of the primary key store rows, the leaves of the b-tree of a secondary index store primary key values.

Let’s assume a table of 1B rows having UUID values as primary key and five secondary indexes. If you read the previous paragraph, you know the primary key values are stored six times for each row. That means a total of 6B char(36) values representing 216 GB. That is just the tip of the iceberg, as tables normally have foreign keys, explicit or not, pointing to other tables. When the schema is based on UUID values, all these columns and indexes supporting them are char(36). I recently analyzed a UUID based schema and found that about 70 percent of storage was for these values.

As if that’s not enough, there’s a third important impact of using UUID values. Integer values are compared up to 8 bytes at a time by the CPU but UUID values are compared char per char. Databases are rarely CPU bound, but nevertheless this adds to the latencies of the queries. If you are not convinced, look at this performance comparison between integers vs strings:

mysql> select benchmark(100000000,2=3);
+--------------------------+
| benchmark(100000000,2=3) |
+--------------------------+
|                        0 |
+--------------------------+
1 row in set (0.96 sec)

mysql> select benchmark(100000000,'df878007-80da-11e9-93dd-00163e000002'='df878007-80da-11e9-93dd-00163e000003');
+----------------------------------------------------------------------------------------------------+
| benchmark(100000000,'df878007-80da-11e9-93dd-00163e000002'='df878007-80da-11e9-93dd-00163e000003') |
+----------------------------------------------------------------------------------------------------+
|                                                                                                  0 |
+----------------------------------------------------------------------------------------------------+
1 row in set (27.67 sec)

Of course, the above example is a worst-case scenario but it at least gives the span of the issue. Comparing integers is about 28 times faster. Even if the difference appears rapidly in the char values, it is still about 2.5 times slower:

mysql> select benchmark(100000000,'df878007-80da-11e9-93dd-00163e000002'='ef878007-80da-11e9-93dd-00163e000003');
+----------------------------------------------------------------------------------------------------+
| benchmark(100000000,'df878007-80da-11e9-93dd-00163e000002'='ef878007-80da-11e9-93dd-00163e000003') |
+----------------------------------------------------------------------------------------------------+
|                                                                                                  0 |
+----------------------------------------------------------------------------------------------------+
1 row in set (2.45 sec)

Let’s explore a few solutions to address those issues.

Size of the Values

The default representation for UUID, hash, and token values is often the hexadecimal notation. With a cardinality, the number of possible values, of only 16 per byte, it is far from efficient. What about using another representation like base64 or even straight binary? How much do we save? How is the performance affected?

Let’s begin by the base64 notation. The cardinality of each byte is 64 so it takes 3 bytes in base64 to represent 2 bytes of actual value. A UUID value consists of 16 bytes of data, if we divide by 3, there is a remainder of 1. To handle that, the base64 encoding adds ‘=’ at the end:

mysql> select to_base64(unhex(replace(uuid(),'-','')));
+------------------------------------------+
| to_base64(unhex(replace(uuid(),'-',''))) |
+------------------------------------------+
| clJ4xvczEeml1FJUAJ7+Fg==                 |
+------------------------------------------+
1 row in set (0.00 sec)

If the length of the encoded entity is known, like for a UUID, we can remove the ‘==’, as it is just dead weight. A UUID encoded in base64 thus has a length of 22.

The next logical step is to directly store the value in binary format. This the most optimal format but displaying the values in the mysql client is less convenient.

So, how’s the size impacting performance? To illustrate the impact, I inserted random UUID values in a table with the following definition…

CREATE TABLE `data_uuid` (
  `id` char(36) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

… for the default hexadecimal representation. For base64, the ‘id’ column is defined as char(22) while binary(16) is used for the binary example. The database server has a buffer pool size at 128M and its IOPs are limited to 500. The insertions are done over a single thread.

Insertion rates for tables using different representation for UUID values

In all cases, the insertion rate is at first CPU bound but as soon the table is larger than the buffer pool, the insertion rapidly becomes IO bound. This is expected and shouldn’t surprise anyone. The use of a smaller representation for the UUID values just allows more rows to fit in the buffer pool but in the long run, it doesn’t really help the performance, as the random insertion order dominates. If you are using random UUID values as primary keys, your performance is limited by the amount of memory you can afford.

Option 1: Saving IOPs with Pseudo-Random Order

As we have seen, the most important issue is the random nature of the values. A new row may end up in any of the table leaf pages. So unless the whole table is loaded in the buffer pool, it means a read IOP and eventually a write IOP. My colleague David Ducos gave a nice solution to this problem but some customers do not want to allow for the possibility of extracting information from the UUID values, like, for example, the generation timestamp.

What if we somewhat just reduce then the randomness of the values in a way that a prefix of a few bytes is constant for a time interval? During the time interval, only a fraction of the whole table, corresponding to the cardinality of the prefix, would be required to be in the memory to save the read IOPs. This would also increase the likelihood a page receives a second write before being flushed to disk, thus reducing the write load. Let’s consider the following UUID generation function:

drop function if exists f_new_uuid; 
delimiter ;;
CREATE DEFINER=`root`@`%` FUNCTION `f_new_uuid`() RETURNS char(36)
    NOT DETERMINISTIC
BEGIN
    DECLARE cNewUUID char(36);
    DECLARE cMd5Val char(32);


    set cMd5Val = md5(concat(rand(),now(6)));
    set cNewUUID = concat(left(md5(concat(year(now()),week(now()))),4),left(cMd5Val,4),'-',
        mid(cMd5Val,5,4),'-4',mid(cMd5Val,9,3),'-',mid(cMd5Val,13,4),'-',mid(cMd5Val,17,12));

    RETURN cNewUUID;
END;;
delimiter ;

The first four characters of the UUID value comes from the MD5 hash of the concatenation of the current year and week number. This value is, of course, static over a week. The remaining of the UUID value comes from the MD5 of a random value and the current time at a precision of 1us. The third field is prefixed with a “4” to indicate it is a version 4 UUID type. There are 65536 possible prefixes so, during a week, only 1/65536 of the table rows are required in the memory to avoid a read IOP upon insertion. That’s much easier to manage, a 1TB table will need to have only about 16MB in the buffer pool to support the inserts.

Option 2: Mapping UUIDs to Integers

Even if you use pseudo-ordered UUID values stored using binary(16), it is still a very large data type which will inflate the size of the dataset. Remember the primary key values are used as pointers in the secondary indexes by InnoDB. What if we store all the UUID values of a schema in a mapping table? The mapping table will be defined as:

CREATE TABLE `uuid_to_id` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `uuid` char(36) NOT NULL,
  `uuid_hash` int(10) unsigned GENERATED ALWAYS AS (crc32(`uuid`)) STORED NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_hash` (`uuid_hash`)
) ENGINE=InnoDB AUTO_INCREMENT=2590857 DEFAULT CHARSET=latin1;

It is important to notice the uuid_to_id table does not enforce the uniqueness of uuid. The idx_hash index acts a bit like a bloom filter. We’ll know for sure a UUID value is not present in the table when there is no matching hash value but we’ll have to validate with the stored UUID value when there is a matching hash. To help us here, let’s create a SQL function:

DELIMITER ;;
CREATE DEFINER=`root`@`%` FUNCTION `f_uuid_to_id`(pUUID char(36)) RETURNS int(10) unsigned
    DETERMINISTIC
BEGIN
        DECLARE iID int unsigned;
        DECLARE iOUT int unsigned;

        select get_lock('uuid_lock',10) INTO iOUT;

        SELECT id INTO iID
        FROM uuid_to_id WHERE uuid_hash = crc32(pUUID) and uuid = pUUID;

        IF iID IS NOT NULL THEN
            select release_lock('uuid_lock') INTO iOUT;
            SIGNAL SQLSTATE '23000'
                SET MESSAGE_TEXT = 'Duplicate entry', MYSQL_ERRNO = 1062;
        ELSE
            insert into uuid_to_id (uuid) values (pUUID);
            select release_lock('uuid_lock') INTO iOUT;
            set iID = last_insert_id();
        END IF;

        RETURN iID;
END ;;
DELIMITER ;

The function checks if the UUID values passed exist in the uuid_to_id table, and if it does it returns the matching id value otherwise it inserts the UUID value and returns the last_insert_id. To protect against the concurrent submission of the same UUID values, I added a database lock. The database lock limits the scalability of the solution. If your application cannot submit twice the request over a very short time frame, the lock could be removed. I have also another version of the function with no lock calls and using a small dedup table where recent rows are kept for only a few seconds. See my github if you are interested.

Results for the Alternate Approaches

Now, let’s have a look at the insertion rates using these alternate approaches.

Insertion on tables using UUID values as primary keys, alternative solutions

The pseudo-order results are great. Here I modified the algorithm to keep the UUID prefix constant for one minute instead of one week in order to better fit the test environment. Even if the pseudo-order solution performs well, keep in mind it is still bloating the schema and overall the performance gains may not be that great.

The mapping to integer values, although the insert rates are smaller due to the additional DMLs required, decouples the schema from the UUID values. The tables now use integers as primary keys. This mapping removes nearly all the scalability concerns of using UUID values. Still, even on a small VM with limited CPU and IOPS, the UUID mapping technique yields nearly 4000 inserts/s. Put into context, this means 14M rows per hour, 345M rows per day and 126B rows per year. Such rates likely fit most requirements. The only growth limitation factor is the size of the hash index. When the hash index will be too large to fit in the buffer pool, performance will start to decrease.

Other Options than UUID Values?

Of course, there are other possibilities to generate unique IDs. The method used by the MySQL function UUID_SHORT() is interesting. A remote device like a smartphone could use the UTC time instead of the server uptime. Here’s a proposal:

(Seconds since January 1st 1970) << 32
+ (lower 2 bytes of the wifi MAC address) << 16
+ 16_bits_unsigned_int++;

The 16 bits counter should be initialized at a random value and allowed to roll over. The odds of two devices producing the same ID are very small. It has to happen at approximately the same time, both devices must have the same lower bytes for the MAC and their 16 bits counter at the same increment.

Notes

All the data related to this post can be found in my github.

↧

How to setup a GUI via VNC for your Oracle Linux Compute Instance in Oracle Cloud Infrastructure (OCI)

November 22, 2019, 7:43 am

≫ Next: MySQL Shell Plugins: check (part 3)

≪ Previous: UUIDs are Popular, but Bad for Performance — Let’s Discuss

In a couple previous posts, I explained how to get an “Always Free” Oracle Cloud compute instance and how to install MySQL on it – as well as how to add a web server.

I started my IT career (way back in 1989) using a (dumb) terminal and a 2400-baud modem to access a server. While I still use a terminal window and the command-line, it is always nice to have access to a GUI. In this post, I will show you how to install and use a GUI on your Oracle Cloud compute instance so you can use a Virtual Network Computing (VNC) application to connect to your “Always Free” (or not-free) Oracle Cloud compute instance.

VNC is a graphical desktop-sharing system that uses the Remote Frame Buffer protocol to remotely control another computer. In other words, it is (almost) like having a monitor connected to your compute instance. Installing everything you need should take about twenty minutes (only because one yum install takes 13-15 minutes).

First, you will need to create your “Always Free” Oracle Cloud account, and at least one free compute instance. (Of course, this will also work on a paid compute instance.) If you need help creating your free compute instance, you can follow the instructions in the first part of this post (installing MySQL is optional).

Once you have your compute instance ready to go, or if you already have an compute instance running, you can continue with this post.

VNC Viewer

I am using a Mac, so I can use the Screen Sharing application that comes with the operating system (OS). If you don’t have a Mac, you will need to find a VNC application for your OS. I have also used the free (non-commercial-use only) version of VNC Connect from RealVNC, but you will need to buy a copy of you are using it for work. But there are several free ones available, such as TeamViewer, TightVNC and TigerVNC.

If you don’t use a Mac, I won’t be able to show you how to install or setup the VNC viewer you decide to use, but it should be easy to do. Whichever VNC app you choose should provide you with instructions. You should only have to input localhost and the port number of 5901.

Installing what you need on your compute instance

Login to your compute instance. When I created my compute instance, I chose to install Oracle Linux. These instructions should work for any other flavor of Linux, but if not, you can look for the similar packages for your OS and you might have to modify a few things.

Change your directory to the yum repo directory, and then download the yum repo file from yum.oracle.com for your version of Oracle Linux.

$ cd /etc/yum.repos.d
$ sudo wget http://yum.oracle.com/public-yum-ol7.repo
--2019-11-20 00:01:31--  http://yum.oracle.com/public-yum-ol7.repo
Resolving yum.oracle.com (yum.oracle.com)... 69.192.108.102
Connecting to yum.oracle.com (yum.oracle.com)|69.192.108.102|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16402 (16K) [text/plain]
Saving to: ‘public-yum-ol7.repo’

100%[=======================================>] 16,402      --.-K/s   in 0s      

2019-11-20 00:01:31 (412 MB/s) - ‘public-yum-ol7.repo’ saved [16402/16402]

Next, install the GNOME desktop via yum. This installation is 678 megabytes in size, and it will take about 13-15 minutes. You can remove the -y option to your yum command if you want to answer “yes” to the single installation question of “Is this ok?”.

Note: Normally I would post the entire output from a command, but the output is over 6,000 lines long. I will replace the majority of the screen output with three dots (…).

$ sudo yum -y groups install "Server with GUI" --skip-broken 
Loaded plugins: langpacks, ulninfo
Repository ol7_latest is listed more than once in the configuration
...
Transaction Summary
==================================================
Install  209 Packages (+659 Dependent packages)
Upgrade               (   3 Dependent packages)

Total download size: 678 M
Is this ok [y/d/N]: y
Downloading packages:
...
Complete!

Install the TigerVNC server. (I will suppress most of this output as well)

$ sudo yum -y install tigervnc-server
Loaded plugins: langpacks, ulninfo
...
Resolving Dependencies
--> Running transaction check
---> Package tigervnc-server.x86_64 0:1.8.0-17.0.1.el7 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

====================================================================
 Package           Arch      Version             Repository    Size
====================================================================
Installing:
 tigervnc-server   x86_64    1.8.0-17.0.1.el7    ol7_latest   215 k
Transaction Summary
====================================================================
Install  1 Package

Total download size: 215 k
Installed size: 509 k
Downloading packages:
tigervnc-server-1.8.0-17.0.1.el7.x86_64.rpm         | 215 kB  00:00:00     
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : tigervnc-server-1.8.0-17.0.1.el7.x86_64       1/1 
  Verifying  : tigervnc-server-1.8.0-17.0.1.el7.x86_64       1/1 

Installed:
  tigervnc-server.x86_64 0:1.8.0-17.0.1.el7
Complete!

Note: I believe the display of duplicate messages “Repository xxx is listed more than once in the configuration” is a bug in yum. You can ignore these messages.

Configure the VNC server

You will be required to provide a password that you will need to remember to be able to access this server via VNC. You can also enter a “view-only password” if you want someone to be able to connect to the server, but you don’t want them to be able to control anything (they can only view the screen). I skipped this option.

$ vncserver
You will require a password to access your desktops.
Password:
Verify:
Would you like to enter a view-only password (y/n)? n
A view-only password is not used
xauth:  file /home/opc/.Xauthority does not exist

New 'instance-20191113-1544:1 (opc)' desktop is instance-20191113-1544:1

Creating default startup script /home/opc/.vnc/xstartup
Creating default config /home/opc/.vnc/config
Starting applications specified in /home/opc/.vnc/xstartup
Log file is /home/opc/.vnc/instance-20191113-1544:1.log

SSH Tunnel for VNC

I am going to create a tunnel for VNC through SSH, so I can punch through the VNC port, and also so I will be sending all of the data through an encrypted tunnel.

Note: There is an alternate way to access your compute instance via VNC by creating an Instance Console Connection, but it only provides you with a console connection (non-GUI). If you want to do that, instructions are available via this blog.

In a terminal window, issue the following command, with your public IP address at the end. This will create the SSH tunnel for you to use for VNC.

$ ssh -L 5901:localhost:5901 opc@150.136.199.99

Now you are ready to connect to the instance using VNC. For the Mac, I open the Screen Sharing application, click on the menu option “Connection” then down to “New”. In the “Connect to” box, I enter localhost:5901 and press “Connect”.

And then enter the password you used when you ran the vncserver command earlier.

If you are using another VNC viewer, simply enter localhost:5901, or you might have to enter localhost and the port 5901 in separate fields.

Then, just open the connection with your VNC application, and you should be see the Oracle Linux GUI appear:

Creating a security rule for multiple connections

I don’t have to create a stateless security rule for port 5901. But, there is one issue you will have if you want to open multiple VNC windows at a time. By using the localhost:5901 SSH tunnel, you can only open one VNC window at a time on port 5901 on your local computer. But, you can use port forwarding on your Oracle Cloud compute instances (via the security list) so you can connect to multiple compute instances at a time from your local computer. To do this, create a Security List Ingress Rule so that your compute instance will accept another incoming port – for example, port 5902 – but you direct the connection to port 5901.

You will need to create a stateless security rule to allow ingress traffic on port 5902. From the Oracle Cloud menu (top left of your screen), go down to Networking and over to Virtual Cloud Networks.

You will be presented with a list of the Virtual Cloud Networks (VCN) you have already created, and if you are doing this from the beginning, you should only have one VCN listed. Click on the VCN name that begins with VirtualCloudNetwork.

On the left, you will see a menu like this. Click on “Security Lists”:

To the right of the above menu, you will be see a list of the security lists you have already created, and if you are doing this from the beginning, you should only have one security list available. Click on the security list name that begins with Default Security List for VirtualCloudNetwork – where the VirtualCloudNetwork name matches your VirtualCloudNetwork name.

You are going to need to add an Ingress Rule, so click on the “Add Ingress Rules” button:

Fill out the form like this, and then click on “Add Ingress Rules”.

Note: You do not want to click on the “Stateless” box. A stateless rule means that you will also need to create an egress rule for the outbound port 5901 traffic. If you leave this unchecked, the rule that is created will be a “stateful” rule, which means that if you allow inbound traffic on port 5902, outbound traffic is also automatically allowed via the redirect on port 5902.

From Oracle’s documentation:

“Marking a security rule as stateful indicates that you want to use connection tracking for any traffic that matches that rule. This means that when an instance receives traffic matching the stateful ingress rule, the response is tracked and automatically allowed back to the originating host, regardless of any egress rules applicable to the instance. And when an instance sends traffic that matches a stateful egress rule, the incoming response is automatically allowed, regardless of any ingress rules. For more details, see Connection Tracking Details for Stateful Rules.

To use port 5902 – which is redirected to port 5901, your security list should look like this:

Then, in a terminal window, you will need to use this command to open the SSH tunnel, where the outgoing port is 5902, and the destination/incoming port is 5901.

$ ssh -L 5902:localhost:5901 opc@150.136.199.98

This syntax follows the ssh man page: -L [bind_address:]port:host:hostport.

Now you know how to use VNC to connect to your Oracle Compute Cloud Instance.

	Tony Darnell is a Principal Sales Consultant for MySQL, a division of Oracle, Inc. MySQL is the world’s most popular open-source database program. Tony may be reached at info [at] ScriptingMySQL.com and on LinkedIn.
	Tony is the author of Twenty Forty-Four: The League of Patriots Visit http://2044thebook.com for more information.
	Tony is the editor/illustrator for NASA Graphics Standards Manual Remastered Edition Visit https://amzn.to/2oPFLI0 for more information.

↧

MySQL Shell Plugins: check (part 3)

November 25, 2019, 4:43 am

≫ Next: Using Kafka to throttle QPS on MySQL shards in bulk write APIs

≪ Previous: How to setup a GUI via VNC for your Oracle Linux Compute Instance in Oracle Cloud Infrastructure (OCI)

What is great with MySQL Shell Plugins, it’s that it provides you an infinite amount of possibilities. While I was writing the part I and part II of the check plugin, I realized I could extend it event more.

The new methods I added to the plugin are especially useful when you are considering to use MySQL InnoDB Cluster in Multi-Primary mode, but not only

Let’s have a look at these new methods:

These 4 new methods are targeting large queries or large transactions. It’s also possible to get the eventual hot spots.

Let’s see the first two that are more basic in action:

The first method (getQueryUpdatingSamePK()) is in fact the eventual hot spot. The second one just show the query updating the most records in the schema with the name big.

Now, let’s have a look at the other two methods that I find more interesting.

The first one, will show the transaction modifying the most records. Not the statement, but the transaction:

As you can see, the plugin provides you the which thread it was and how many rows were affected. It also provides you the possibility to see all the statements inside that transaction ! And I think this is cool !

And the last method, provides us the same behavior but for the transaction containing the biggest amount of statements:

MySQL 8.0 is really great and so is the Shell !

You can find this plugin and others on github: https://github.com/lefred/mysqlshell-plugins and don’t forget that pull requests are always welcome!

↧

Using Kafka to throttle QPS on MySQL shards in bulk write APIs

November 25, 2019, 9:27 am

≫ Next: Throttling writes: LSM vs B-Tree

≪ Previous: MySQL Shell Plugins: check (part 3)

Qi Li | Software Engineer, Real-time AnalyticsAt Pinterest, backend core services are in charge of various operations on pins, boards, and users from both Pinners and internal services. While Pinners’...

↧

Throttling writes: LSM vs B-Tree

November 25, 2019, 4:49 pm

≫ Next: Repair GTID Based Slave on Percona Cluster

≪ Previous: Using Kafka to throttle QPS on MySQL shards in bulk write APIs

Reducing response time variance is important for some workloads. This post explains sources of variance for workloads with high write rates when the index structure is an LSM or a B-Tree. I previously wrote about this in my post on durability debt.

Short summary:

For a given write rate stalls are more likely with a B-Tree than an LSM
Many RocksDB write stalls can be avoided via configuration
Write stalls with a B-Tree are smaller but more frequent versus an LSM
Write stalls are more likely when the redo log isn't forced on commit
The worst case difference between an LSM and B-Tree is larger when the working set isn't cached
Life is easier but more expensive when the working set fits in cache
Less write amplification saves IO for other uses

Less short summary:

Write stalls for an LSM occur when compaction has trouble keeping up with the incoming write rate. The worst stalls occur at write rates that a B-Tree could not sustain. One way to mitigate stalls is to reduce the write rate. Another way is to use an index structure that doesn't support or is inefficient for range scans (see index+log).
The cost from configuring RocksDB to avoid write stalls is more CPU overhead on reads as there will be more data in the upper levels of the LSM. I am partly to blame for the default configuration in RocksDB that throttles writes when the LSM tree gets too much data in the L0, L1 and L2. But that configuration can be changed.
SQLite4 has a clever LSM designed for systems that don't allow background threads. It implements a pay as you go approach to durability debt. A traditional LSM takes the opposite approach - it defers the IO cost to the background. RocksDB has optional write throttling and work has been done to smooth the impact from it but it is not solved. A B-Tree in the worst-case (buffer pool full & mostly dirty, working set not cached) also implements pay as you go approach.
I almost always disable sync-on-commit for benchmarks because I want to observe how the DBMS observes under stress and less commit latency means more writes/second and more IO stress.
See item #6 where I argue that it is good to not have the working set cached.
A common rule of thumb has been to keep all indexes in cache or all of the working set in cache. That simplifies tuning and makes it easier to avoid performance problems. But that also might force a deployment to use 2X more HW than it needs because NAND flash SSDs are everywhere and the response time difference between reading from RAM and reading from NAND flash might not matter for many applications. But if you are using a DBMS in the cloud that charges by the IO, then keeping the working set in RAM might be a good idea.
An LSM usually has less write-amp than a B-Tree. So the IO capacity it saves from that can be used elsewhere to support more read or write transactions.

Worst case behavior

I am wary of faster is better. I prefer nuance but I also know that people don't have time to read long blog posts like this or long performance reports. Here I explain worst case behavior in terms of IO overheads. Worst case behavior isn't the only way to judge an index structure but it helps me to explain performance. Another way is to measure the average amount of IO per transaction (in operations and KB) and treat IO efficiency as important.

I describe worst case behavior for a write operation under a few scenarios. By worst case I mean the largest amount of IO done in the foreground (the thread handling the write) as that determines the response time. I ignore the work done in the background which favors an LSM because that defers more work to the background. For a B-Tree I ignore undo and page splits. The write is a SQL update which is read-modify-write, as opposed to a blind-write like a Put with RocksDB. Finally, I assume the update isn't to an indexed column. The scenarios are:

Cached, PK only - working set cached, PK index only
Not cached, PK only - working set not cached, PK index only
Cached, PK and secondary index - working set cached, PK and non-unique secondary index
Not cached, PK and secondary index - working set not cached, PK and non-unique secondary index

PK only

For the cached, PK only scenario neither an LSM nor a B-Tree do IO in the foreground with the exception of the redo log fsync. Stalls are unlikely for both but more likely with a B-Tree especially when the DBMS storage uses a spinning disk.

An LSM writes the redo log buffer, optionally syncs the redo log and then does an insert into the memtable. Both memtable flush and Ln:Ln+1 compaction are deferred to background threads. If memtable flush were too slow then there are write stalls until flush catches up to avoid too many memtables wasting memory.
A B-Tree modifies a page in the buffer pool, writes the redo log buffer and optionally syncs the redo log. If checkpoint were too slow a full redo log can't be rotated until checkpoint catches up and there are write stalls.

For the not cached, PK only scenario the work done in the foreground is 1 IO/update for an LSM and 2 IO/update for a B-Tree. Here a B-Tree uses a pay as you go model.

An LSM reads a page into the block cache and then repeats the work described in cached, PK only.
A B-Tree finds a dirty page to evict, writes that page back to storage, then reads the desired page into that slot in the buffer pool and repeats the work described in cached, PK only.

PK and secondary index

For the cached, PK and secondary index scenario there is approximately twice as much work to be done per update compared to the cached, PK only scenario. Thus stalls are more likely here. But other than the optional redo fsync there is no foreground IO for the LSM and B-Tree.

An LSM repeats the work explained in the cached, PK only scenario. For the secondary index it does an additional insert to the memtable which is also logged as redo. This can double the demand for compaction.
A B-Tree repeats the work explained in the cached, PK only scenario. For the secondary index it makes an additional page dirty in the buffer pool. This can double the demand for page write back.

For the not cached, PK and secondary index scenario the foreground IO difference between an LSM and B-Tree is more significant -- 1 IO for the LSM vs 4 IO for the B-Tree -- ignoring the redo log overhead. The IO difference is reduced from 1:4 to approximately 1:2 for a B-Tree like InnoDB that implements a change buffer.

An LSM does the union of the work described in not cached, PK only and cached, PK and secondary index scenarios. Ignoring the optional redo fsync the cost is 1 read IO for the PK index and no reads for the secondary index because non-unique secondary index maintenance is read-free.
A B-Tree repeats the work explained in the cached, PK only scenario but this is done for both the PK and secondary indexes. Thus the cost is 2 IOs to write back dirty pages and then 2 IOs to read pages from the PK and secondary indexes into the buffer pool and then make them dirty -- which then requires redo log writes. So the cost for this is 4 IOs ignoring the redo log.

Make writes fast: LSM

Writes can be fast with an LSM because most of the IO cost is deferred but that also increases the need to throttle writes. Life is good as long as that deferred cost can be repaid fast enough, otherwise there will be more response time variance.

Flush and compaction are the deferred cost for an LSM write. Flush means writing the memtable to an SST on storage. Compaction means merging SSTs to move flushed data from the root to leaf of the LSM tree. Compaction costs more than flush. RocksDB can stall writes when compaction doesn't keep up with ingest. Ingest creates durability debt, compaction reduces it and write stalls are there to bound the debt. Write stalls are enabled by default but can be disabled by configuration. Putting a bound on durability debt also puts a bound on read latency by reducing the number of SSTs that can exist in the L0, L1 and L2. So if you want to support extremely high write rates than choose one of: read stalls, write stalls.

Make writes fast: B-Tree

Writes can also be fast with a B-Tree as there are no page reads/writes to/form storage when the working set is cached and background page write back is fast enough. In that case the only IO work in the foreground is the optional redo log fsync.

Page write back is the primary deferred cost for a B-Tree write. Most of my B-Tree experience is with InnoDB which does fuzzy checkpoint. The goal is to flush dirty pages before the current redo log segment gets full. Using larger redo log segments lets InnoDB defer write back for a longer time increasing the chance that more transactions will modify the page -- reducing write amplification and helping performance.

Purge can be an additional deferred cost for a B-Tree write. I use the InnoDB name here as Postgres calls this vacuum. This is the process of reclaiming space from deleted rows that are no longer visible by open MVCC snapshots. The LSM equivalent of purge is checking the open snapshot list during compaction for KV pairs that are not the latest version of a given key to determine whether that version is still needed.

When write back and purge are fast enough then write stalls should be infrequent with a B-Tree. But write back isn't always fast enough. A B-Tree write stall occurs when a write transaction must read a page into the buffer pool prior to modifying that page but 1) the buffer pool is full and 2) write back must be done for a dirty page before the memory can be reused.

Other

A few other comments that didn't have a place above:

In this post I assume the B-Tree uses no-force, but there is at least one nice OSS B-Tree that uses force.
Making commit slower is another way to throttle writes and reduce the chance of stalled writes. Examples of this include redo log fsync, semisync or synchronous replication.
The InnoDB change buffer is a wonderful feature that reduces the IO overhead for write-heavy workloads.
NAND flash GC stalls are another source of write stalls. I wish more were written about this topic.
Stalls during TRIM when using an LSM with NAND flash are another source of stalls. I wish there were more TRIM benchmarks. Smart friends tell me that NAND flash devices vary widely in their ability to handle TRIM. And they display different stall behavior when their TRIM capacity has been exceeded. Some of us were spoiled by FusionIO.

↧

Repair GTID Based Slave on Percona Cluster

November 26, 2019, 4:05 am

≫ Next: Scaling ProxySQL rapidly in Kubernetes

≪ Previous: Throttling writes: LSM vs B-Tree

Problem :

We are running 5 node percona cluster on Ubuntu 16.04, and its configured with master-slave replication. Suddenly we got an alert for replica broken from slave server, which was earlier configured with normal replication

We have tried to sync the data and configure the replication, unable to fix that immediately due to huge transactions and GTID enabled servers. So we have decided to follow with innobackupex tool, and problem fixed in 2 hours

Followed all the steps from percona doc and shared the experience in my environment

Steps involving to repair the broken Replication :

1.Backup master server

2.Prepare the backup

3.Restore and Configure the Replication

4Check Replication Status

1.Backup master server

We need to configure the complete master server database into a slave. So we are taking a full backup from the master server, before proceeding the backup we should check disk space available for the backup because of its system-level backup

We have created a specific user for taking a backup from master server, once the backup is completed will get OK like below,

2. Prepare the backup for Restore

We need to prepare the backups to apply the transaction logs into data files, once it is OK, data files are ready to restore

Before moving the prepared files into slave server, verify the GTID information from xtrabackup_binlog_info

3.Restore and Configure the Replication

We can restore the backup else to create a new data directory and move the files into the directory. We have followed created a new directory and change datadir values in mysqld.conf file.

Once changed the data directory, we need to change the owner and permission of the MySQL data directory

chown mysql:mysql /mnt/mysqldatanew

And restart the service with a new data directory, once its started login with master MySQL root user password. Because we have taken file backup from the master so metadata will be the same of master

Execute below commands to configure the replication

4. Check Replication Status

Once slave is configured, verify the replication status as below

Also, Slave has retrieved a new transaction

Thanks for Reading !!!

↧

Scaling ProxySQL rapidly in Kubernetes

November 26, 2019, 6:00 am

≫ Next: Comparing S3 Streaming Tools with Percona XtraBackup

≪ Previous: Repair GTID Based Slave on Percona Cluster

It’s not uncommon these days for us to use a high availability stack for MySQL consisting of Orchestrator, Consul, and ProxySQL. You can read more details about this stack by reading Matthias Crauwels’ blog post How To Autoscale ProxySQL In The Cloud as well as Ivan Groenwold’s post on MySQL High Availability With ProxySQL, Consul And Orchestrator. But the high-level concept is simply that Orchestrator will monitor the state of the MySQL replication topology and report changes to Consul which in turn can update ProxySQL hosts using a tool called consul-template.

Until now we’ve typically implemented the ProxySQL portion of this stack using an autoscaling group of sorts due to the high levels of CPU usage that can be associated with ProxySQL. It’s better to be able to scale up and down as traffic increases and decreases because this ensures you’re not paying for resources that you don’t need; however, this comes with a few disadvantages. The first of which is the amount of time it takes to scale up. If you are using an autoscaling group and a new instance is launched, the following steps will need to be taken:

There will be a request to your cloud service provider for a new VM instance.
Once the instance is up and running as part of the group, it will need to install ProxySQL along with supporting packages such as consul (agent) and consul-template.
Once the packages are installed, they will need to be configured to work with the consul server nodes as well as the ProxySQL nodes that are participating in the ProxySQL cluster.
The new ProxySQL host will announce to Consul that it’s available, which in turn will update all the other participating nodes in the ProxySQL cluster.

This can take time. Provisioning a new VM instance usually happens fairly quickly, normally within a couple minutes, but sometimes there can be unexpected delays. You can speed up package installation by using a custom machine image, but there is an operational overhead with keeping images up to date with the latest versions of the installed packages, so it may be easier to do this using a script that always installs the latest versions. All in all, you can expect a scale up to take more than a minute.

The next issue is how deterministic this solution is. If you’re not using a custom machine image, you’ll need to pull down your config and template files from somewhere, most likely a storage bucket, and there’s a chance that those files could be overwritten – meaning that the next time an instance is launched by the autoscaler it may not necessarily have the same configuration as the rest of the hosts participating in the ProxySQL cluster.

We can take this already impressive stack and take it another step further using Docker containers and Kubernetes.

For those of you who are unfamiliar with containerization: a container is similar to a virtual machine snapshot but is not a full snapshot that would include the OS; instead, it contains just the binary that’s required to run your process. You create this image using a Dockerfile, typically starting from a specified Linux distribution, and then use verbs like RUN, COPY, and USER to specify what should be included in your container “image”. Once this image is constructed, it can be centrally located in a repository and made available for usage by machines using a containerization platform like Docker. This method of deployment has become more and more popular in recent years due to the fact that containers are lightweight, and you know that if the container works on one system it will work exactly the same way when it’s moved to a different system, thus reducing common issues like dependencies and configuration variations from host to host.

Given that we want to be able to scale up and down, it’s safe to say we’re going to want to run more than one container. That’s where Kubernetes comes into play. Kubernetes is a container management platform that operates on an array of hosts (virtual or physical) and distributes containers on them as specified by your configuration, typically a YAML-format Kubernetes deployment file. If you’re using Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP), this is even easier as the vast majority of the work in creating a Kubernetes deployment (referred to as a ‘workload’ in GKE) YAML is handled for you via a simple UI within the GCP Console.

If you want to learn more about Docker or Kubernetes, I highly recommend Nigel Poulton’s video content on Pluralsight. For now, let’s stick to learning about ProxySQL on this platform.

If we want ProxySQL to run in Kubernetes and operate with our existing stack with Consul and Orchestrator, we’re going to need to keep best practices in mind for our containers.

Each container should run only a single process. We know that we’re working with ProxySQL, consul (agent), and consul-template, so these will all need to be in their own containers.
The primary process running in each container should run as PID 1.
The primary process running in each container should not run as root.
Log output from the primary process in the container should be sent to STDOUT so that it can be collected by Docker logs.
Containers should be as deterministic as possible – meaning they should run the same (or at least as much as possible) regardless of what environment they are deployed in.

The first thing in the list above that popped out is the need to have ProxySQL, consul-template, and consul (agent) isolated within their own containers. These are going to need to work together given that consul (agent) is acting as our communication conduit back to consul (server) hosts and consul-template is what updates ProxySQL based on changes to keys and values in Consul. So how can they work together like this if they are in separate containers?

The solution is provided by Kubernetes. When you’re thinking about Docker, the smallest computational unit is the container; however, when you’re thinking about Kubernetes, the smallest computational unit is the pod which can contain one or more containers. Any containers operating within the same pod can communicate with one another using localhost ports. So in this case, assuming you’re using default ports, the consul-template container can communicate to the consul (agent) container using localhost port 8500 and it can communicate to the ProxySQL container using port 6032 given that these three containers will be working together in the same pod.

So let’s start looking at some code, starting with the simplest container and then working our way to the most complex.

Consul (Agent) Container

Below is a generic version of the Dockerfile that I’m using for consul (agent). The objective is to install Consul and then instruct it to connect as an agent to the Consul cluster comprised of the consul (server) nodes.

FROM centos:7
RUN yum install -q -y unzip wget && \
  yum clean all
RUN groupadd consul && \
  useradd -r -g consul -d /var/lib/consul consul
RUN mkdir /opt/consul && \
  mkdir /etc/consul && \
  mkdir /var/log/consul && \
  mkdir /var/lib/consul && \
  chown -R consul:consul /opt/consul && \
  chown -R consul:consul /etc/consul && \
  chown -R consul:consul /var/log/consul && \
  chown -R consul:consul /var/lib/consul
RUN wget -q -O /opt/consul/consul.zip https://releases.hashicorp.com/consul/1.6.1/consul_1.6.1_linux_amd64.zip && \
  unzip /opt/consul/consul.zip -d /opt/consul/ && \
  rm -f /opt/consul/consul.zip && \
  ln -s /opt/consul/consul /usr/local/bin/consul
COPY supportfiles/consul.conf.json /etc/consul/
USER consul
ENTRYPOINT ["/usr/local/bin/consul", "agent", "--config-file=/etc/consul/consul.conf.json"]

Simply put, the code above follows these instructions:

Start from CentOS 7. This is a personal preference of mine. There are probably more lightweight distributions that can be considered, such as Alpine as recommended by Google, but I’m not the best OS nerd out there so I wanted to stick with what I know.
Install our dependencies, which in this case is unzip and wget.
Create our consul user, group, and directory structure.
Install consul.
Copy over the consul config file from the host where the Docker build is being performed.
Switch to the consul user.
Start consul (agent).

Now let’s check the code and see if it matches best practices.

Container runs a single process
- The ENTRYPOINT runs Consul directly, meaning that nothing else is being run. Keep in mind that ENTRYPOINT specifies what should be run when the container starts. This means that when the container starts it won’t have to install anything because the packages come with the image as designated by the Dockerfile, but we still need to launch Consul when the container starts.
Process should be PID 1
- Any process ran by ENTRYPOINT will run as PID 1.
Process should not be run as root
- We switched to the Consul user prior to starting the entrypoint.
Log output should go to STDOUT
- If you run Consul using the command noted in the ENTRYPOINT, you’ll see that log output goes to STDOUT
Should be as deterministic as possible
- We’ve copied the configuration file into the container, meaning that the container doesn’t have to get support files from anywhere else before Consul starts. The only way the nature of Consul will change is if we recreate the container image with a new configuration file.

There’s really nothing special about the Consul configuration file that gets copied into the container. You can see an example of this by checking the aforementioned blog posts by Matthias or Ivan for this particular HA stack.

ProxySQL Container

Below is a generic version of the Dockerfile that I’m using for ProxySQL. The objective is to install ProxySQL and make it available to receive traffic requests on 6033 for write traffic, 6034 for read traffic, and 6032 for the admin console which is how consul-template will interface with ProxySQL.

FROM centos:7
RUN groupadd proxysql && \
  useradd -r -g proxysql proxysql
RUN yum install -q -y https://github.com/sysown/proxysql/releases/download/v2.0.6/proxysql-2.0.6-1-centos67.x86_64.rpm mysql curl && \
  yum clean all
COPY supportfiles/* /opt/supportfiles/
COPY startstop/* /opt/
RUN chmod +x /opt/entrypoint.sh
RUN chown proxysql:proxysql /etc/proxysql.cnf
USER proxysql
ENTRYPOINT ["/opt/entrypoint.sh"]

Simply put, the code above follows these instructions:

Start from CentOS 7.
Create our ProxySQL user and group.
Install ProxySQL and dependencies, which in this case is curl which while will be used to poll the GCP API in order to determine what region the ProxySQL cluster is in. We’ll cover this in more detail below.
Move our configuration files and entrypoint script to the container.
Make sure that the ProxySQL config file is readable by ProxySQL.
Switch to the ProxySQL user.
Start ProxySQL via the entrypoint script that’s provided with the container.

In my use case, I have multiple ProxySQL clusters – one per GCP region. They have to be logically grouped together in order to ensure they route read traffic to replicas within the local region but send traffic to the master regardless of what region it’s in. In my solution, a hostgroup is noted for read replicas in each region, so my mysql_query_rules table needs to be configured accordingly. In my solution, the MySQL hosts will be added to different host groups, but the routing to each hostgroup would remain consistent. Given that it’s highly unlikely to change, I have mysql_query_rules configured in the configuration file. This means that I need to select the correct configuration file based on my region before starting ProxySQL, and this is where my entrypoint script comes into play. Let’s have a look at a simplified and more generic version of my code:

#!/bin/bash
dataCenter=$(curl http://metadata.google.internal/computeMetadata/v1/instance/zone -H "Metadata-Flavor: Google" | awk -F "/" '{print $NF}' | cut -d- -f1,2)
...
case $dataCenter in
  us-central1)
    cp -f /opt/supportfiles/proxysql-us-central1.cnf /etc/proxysql.cnf
    ;;
  us-east1)
    cp -f /opt/supportfiles/proxysql-us-east1.cnf /etc/proxysql.cnf
    ;;
esac
...
exec proxysql -c /etc/proxysql.cnf -f -D /var/lib/proxysql

The script starts by polling the GCP API to determine what region the container has been launched in. Based on the result, it will copy the correct config file to the appropriate location and then start ProxySQL.

Let’s see how the combination of the Dockerfile and the entrypoint script allows us to meet best practices.

Container runs a single process
- ENTRYPOINT calls the entrypoint.sh script, which does some conditional logic based on the regional location of the container and then ends by running ProxySQL. This means that at the end of the process ProxySQL will be the only process running.
Process should be PID 1
- The command “exec” at the end of the entrypoint script will start ProxySQL as PID 1
Process should not be run as root
- We switched to the proxysql user prior to starting the entrypoint.
Log output should go to STDOUT
- If you run proxysql using the command noted at the end of the entrypoint script you’ll see that log output goes to STDOUT
Should be as deterministic as possible
- We’ve copied the potential configuration files into the container. Unlike Consul, there are multiple configuration files and we need to determine which will be used based on the region that the container lives in, but the configuration files themselves will not change unless the container image itself is updated. This ensures that all containers running within the same region will behave the same.

Consul-template container

Below is a generic version of the Dockerfile that I’m using for consul-template. The objective is to install consul-template and have it act as the bridge between Consul via the consul (agent) container and ProxySQL, updating ProxySQL as needed when keys and values change in Consul.

FROM centos:7
RUN yum install -q -y unzip wget mysql nmap-ncat curl && \
  yum clean all
RUN groupadd consul && \
  useradd -r -g consul -d /var/lib/consul consul
RUN mkdir /opt/consul-template && \
  mkdir /etc/consul-template && \
  mkdir /etc/consul-template/templates && \
  mkdir /etc/consul-template/config && \
  mkdir /opt/supportfiles && \
  mkdir /var/log/consul/ && \
  chown -R consul:consul /etc/consul-template && \
  chown -R consul:consul /etc/consul-template/templates && \
  chown -R consul:consul /etc/consul-template/config && \
  chown -R consul:consul /var/log/consul
RUN wget -q -O /opt/consul-template/consul-template.zip https://releases.hashicorp.com/consul-template/0.22.0/consul-template_0.22.0_linux_amd64.zip && \
  unzip /opt/consul-template/consul-template.zip -d /opt/consul-template/ && \
  rm -f /opt/consul-template/consul-template.zip && \
  ln -s /opt/consul-template/consul-template /usr/local/bin/consul-template
RUN chown -R consul:consul /opt/consul-template
COPY supportfiles/* /opt/supportfiles/
COPY startstop/* /opt/
RUN chmod +x /opt/entrypoint.sh
USER consul
ENTRYPOINT ["/opt/entrypoint.sh"]

Simply put, the code above follows these instructions:

Start from CentOS 7.
Install our dependencies which are unzip, wget, mysql (client), nmap-ncat, and curl.
Create our Consul user and group.
Create the consul-template directory structure.
Download and install consul-template.
Copy the configuration file, template files, and entrypoint script to the container.
Make the entrypoint script executable.
Switch to the Consul user.
Start consul-template via the entrypoint script that’s provided with the container.

Much like our ProxySQL container, we really need to look at the entrypoint here in order to get the whole story. Remember, this is multi-region so there is additional logic that has to be considered when working with template files.

#!/bin/bash
dataCenter=$(curl http://metadata.google.internal/computeMetadata/v1/instance/zone -H "Metadata-Flavor: Google" | awk -F "/" '{print $NF}' | cut -d- -f1,2)
...
cp /opt/supportfiles/consul-template-config /etc/consul-template/config/consul-template.conf.json
case $dataCenter in
  us-central1)
    cp /opt/supportfiles/template-mysql-servers-us-central1 /etc/consul-template/templates/mysql_servers.tpl
    ;;
  us-east1)
    cp /opt/supportfiles/template-mysql-servers-us-east1 /etc/consul-template/templates/mysql_servers.tpl
    ;;
esac
cp /opt/supportfiles/template-mysql-users /etc/consul-template/templates/mysql_users.tpl
### Ensure that proxysql has started
while ! nc -z localhost 6032; do
  sleep 1;
done
### Ensure that consul agent has started
while ! nc -z localhost 8500; do
  sleep 1;
done
exec /usr/local/bin/consul-template --config=/etc/consul-template/config/consul-template.conf.json

This code is very similar to the entrypoint file that was used for ProxySQL in the sense that it checks for the region that the container is in and then moves configuration and template files into the appropriate location, but there is some additional logic here that checks to ensure that ProxySQL is up and listening on 6032 and that consul (agent) is up and listening on port 8500. The reason for this is the consul-template needs to be able to communicate with both of these hosts. You really have no assurance as to what container is going to load in what order in a pod, so to avoid excessive errors in the consul-template log, I have it wait until it knows that its dependent services are running.

Let’s go through our best practices checklist one more time against our consul-template container code.

Container runs a single process.
- ENTRYPOINT calls the entrypoint.sh script, which does some conditional logic based on the regional location of the container and then ends by running consul-template. This means that at the end of the process consul-template will be the only process running.
Process should be PID 1.
- The command “exec” at the end of the entrypoint script will start consul-template as PID 1.
Process should not be run as root.
- We switched to the coinsul user prior to starting the entrypoint.
Log output should go to STDOUT.
- If you run Consul using the command noted at the end of the entrypoint script, you’ll see that log output goes to STDOUT.
Should be as deterministic as possible.
- Just like ProxySQL and consul (agent), all the supporting files are packaged with the container. Yes, there is logic to determine what files should be used, but you have the assurance that the files won’t change unless you create a new version of the container image.

Putting it all together

Okay, we have three containers that represent the three processes that we need to package together so ProxySQL can work as part of our HA stack. Now we need to put it all together in a pod so that Kubernetes can have it run against our resources.

In my use case, I’m running this on GCP, meaning that once my containers have been built they are going to need to be pushed up to the Google Container Registry. Once that’s done we can create our workload to run our pod and specify how many pods we want to run.

Getting this up and running can be done with just a few short and simple steps:

Create a Kubernetes cluster if you don’t already have one. This is what provisions the Cloud Compute VMs that the pods will run on.
Push your three Docker images to the Google container registry. This makes the images available for use by the Kubernetes engine.
Create your Kubernetes workload, which can be done simply via the user interface in the GCP console. All that’s required is selecting the latest version of the three containers that you’ve pushed up to the registry, optionally applying some metadata like an application name, Kubernetes namespace, and labels, then selecting which cluster you want to run the workload on.

Once you click deploy, the containers will spin up and, assuming there are no issues bringing the containers online, you’ll quickly have a functioning ProxySQL pod in Kubernetes that follows these high-level steps:

The pod is started.
The three containers will start. In Kubernetes, pods are fully atomic. All the containers start without error or the pod will not consider itself started.
The consul-template container will poll consul (agent) and ProxySQL on their respective ports until it’s confirmed that those processes have started and then consul-template will start.
Consul-template will create the new SQL files meant to configure ProxySQL based on the contents of the Consul key/value store.
Consul-template will run the newly created SQL files against ProxySQL via its admin interface.
The pod is now ready to receive traffic.

The YAML

During the process of creating your workload, or even after the fact, you’ll be able to see the YAML that you’d normally have to create with standard Kubernetes deployments. Let’s have a look at the YAML that was created for my particular deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: "2019-10-16T15:41:37Z"
  generation: 64
  labels:
    app: pythian-proxysql
    env: sandbox
  name: pythian-proxysql
  namespace: pythian-proxysql
  resourceVersion: "7516809"
  selfLink: /apis/apps/v1/namespaces/pythian-proxysql/deployments/pythian-proxysql
  uid: 706c6284-f02b-11e9-8f3e-42010a800050
spec:
  minReadySeconds: 10
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: pythian-proxysql
      env: sandbox
  strategy:
    rollingUpdate:
      maxSurge: 100%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: pythian-proxysql
        env: sandbox
    spec:
      containers:
      - image: gcr.io/pythian-proxysql/pythian-proxysql-proxysql@sha256:3ba95101eb7a5aac58523e4c6489956869865452d1cbdbd32b4186a44f2a4500
        imagePullPolicy: IfNotPresent
        name: pythian-proxysql-proxysql-sha256
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - image: gcr.io/pythian-proxysql/pythian-proxysql-consul-agent@sha256:7c66fa5e630c4a0d70d662ec8e9d988c05bd471b43323a47e240294fc00a153d
        imagePullPolicy: IfNotPresent
        name: pythian-proxysql-consul-agent-sha256
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - image: gcr.io/pythian-proxysql/pythian-proxysql-consul-template@sha256:1e70f4b96614dfd865641bf75784d895a794775a6c51ce6b368387591f3f1918
        imagePullPolicy: IfNotPresent
        name: pythian-proxysql-consul-template-sha256
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 2
  collisionCount: 1
  conditions:
  - lastTransitionTime: "2019-10-16T15:41:37Z"
    lastUpdateTime: "2019-11-11T15:56:55Z"
    message: ReplicaSet "pythian-proxysql-8589fdbf54" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2019-11-11T20:41:31Z"
    lastUpdateTime: "2019-11-11T20:41:31Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 64
  readyReplicas: 2
  replicas: 2
  updatedReplicas: 2

The first thing I have to point out is that this is a LOT of YAML that we didn’t have to create given that it was all handled by the Google Kubernetes Engine. This is a huge part of easing the process which allows us to get our solution working so quickly.

However, despite the fact that we have a lot of YAML created for us, there are still some occasions where we may need to modify this manually, such as working with Kubernetes Container Lifecycle Hooks, or working with requests or limits for hardware resources for individual containers in our pod.

How do I access my ProxySQL instance?

One consideration for Kubernetes is that when pods are started and stopped they will get an ephemeral IP address, so you don’t want to have your applications connect to your pods directly. Kubernetes has a feature called a “service” that allows your pods to be exposed via a consistent network interface. This service can also handle load balancing, which is what I’m planning on using with my Kubernetes deployment. Adding a service to your GKE workload is very simple and can be added with a few clicks.

Autoscaling

As noted earlier in this post, before the implementation of Kubernetes for this solution, it was recommended to use cloud compute autoscaling groups in order to handle fluctuations in traffic. We’re going to want to include the same strategy with Kubernetes to ensure we have enough pods available to handle traffic demand. Including autoscaling in your workload is also fairly simple and can be done via the console UI.

One important thing to note about scaling with Kubernetes is the time it takes to scale up and down. In the intro section of this post, I noted the process of adding and removing nodes from an autoscaling group and how that can take minutes to achieve depending on how quickly your cloud provider can stand up a new instance and the complexity of your configuration. With Kubernetes, I’ve seen my pods scale up in as little as three seconds and scale down in less than one second. This is part of what makes this solution so powerful.

Considerations for Connections During Scale Up and Down

One important thing to note is that, with pods being added to and removed from the workload, your connections to ProxySQL via the exposed service can be interrupted. It’s noted in the autoscaling documentation that this can cause disruption and your application needs to be able to handle this much in the same way that it would need to have to handle this for a cloud compute autoscaling group. You’ll want to ensure that your application has retry on database failure logic built in before incorporating Kubernetes autoscaling (or any autoscaling for that matter) as part of your data platform.

Considerations for MySQL users in ProxySQL

There are three tables that are replicated when working with ProxySQL cluster: mysql_servers, mysql_query_rules, and mysql_users – meaning that when a change to any of these tables is made on one of the nodes in the cluster, it will be replicated to all the other nodes.

We really don’t need to worry about this when working with mysql_servers given that all nodes will get their mysql_server information from Consul via consul-template, so I’ve disabled this clustering feature.

With my particular use case I don’t need to worry about mysql_query_rules either because, as noted earlier in this post, my traffic is being routed based on the port that traffic is being sent to. The rules for this are simple and should not change so I have it in the configuration file and I have disabled replicating this table, as well.

The last table to consider is mysql_users and this is where things get interesting. Remember that with Kubernetes it’s possible to have persistent storage, but we really want our containers to be as stateless as possible, so if we were to follow the Docker and Kubernetes philosophy as closely as possible we wouldn’t want to have our data persist. This falls into the whole cattle vs pets discussion when working with containers, but I digress.

Let’s assume we’ve opted NOT to persist our ProxySQL data, typically stored in SQLite, and we lose all of the pods in our Kuberenetes cluster. Unlikely, but we always need to be ready for disaster. When the first pod comes up, it’s starting with a blank slate and this isn’t a problem considering it will get its initial set of mysql_server data from Consul via consul-template and it’ll get its mysql_query_rules data from the config file. However, there is no source of truth for mysql_users data, so all that data would be lost.

In this case, we need to incorporate some source of truth for the ProxySQL mysql_users table. It’s possible to use a cloud compute VM with ProxySQL installed that could be an ever-present member of the cluster which could seed data for new joining pods, but that breaks our construct of working specifically with containers. Plus, if you have a multi-cluster configuration like I do where there is one cluster in each region, then you need one ProxySQL “master host” in each region, which is a bit of a waste considering it’s just acting as a source of truth for mysql_users, which likely will be the same across all clusters.

My solution, in this case, is to leverage the source of truth that we already have in place: Consul. If it’s already acting as a source of truth for mysql_servers, there’s no reason why it can’t act as a source of truth for this as well.

All I need to do is have my MySQL users and password hashes (always stay secure) in Consul and use consul-template to create these on new ProxySQL host, or change them as keys and values change. You may have noticed this in the entrypoint script in my consul-template container.

To Cluster or Not To Cluster?

I mentioned before that ProxySQL cluster handles the replication of three tables: mysql_users, mysql_query_rules, and mysql_servers. Considering that all three of these tables now have their own source of truth, we really don’t need to worry about replicating this data. As changes are reported to Consul, it will update all the ProxySQL pods considering that all of them have consul (agent) and consul-template containers as part of the pod.

With this in mind, I’ve opted to rely on my constructed sources of truth and reduce solution complexity by removing ProxySQL clustering; however, this is going to vary from use case to use case.

Conclusion

The solution implemented in this use case has required the inclusion of a lot of new technologies that MySQL DBAs may or may not have familiarity with: ProxySQL, Orchestrator, Consul, GTIDs, etc. We’ve made this solution a little more complex by adding Docker and Kubernetes to the stack, but I personally believe this complexity is worth it considering the higher degree of idempotency that is built into the solution, the lack of need for ProxySQL clustering, and the speed in which scale up and scale down occurs.

One last consideration is the simple need for learning how to incorporate containers into your stack. This is not my first blog post on container philosophy and implementation and I believe that containers are going to become a greater part of the landscape for all of us – even us, the database professionals with our highly stateful technological challenges. If you have not already started educating yourself on these technologies, I would highly encourage you to do so in order to better prepare yourself for the shift from “Database Administrator” to “Database Reliability Engineer”

↧

Comparing S3 Streaming Tools with Percona XtraBackup

November 26, 2019, 8:07 am

≫ Next: MySQL 8.0.18 New Features Summary

≪ Previous: Scaling ProxySQL rapidly in Kubernetes

Making backups over the network can be done in two ways: either save on disk and transfer or just transfer without saving. Both ways have their strong and weak points. The second way, particularly, is highly dependent on the upload speed, which would either reduce or increase the backup time. Other factors that influence it are chunk size and the number of upload threads.

Percona XtraBackup 2.4.14 has gained S3 streaming, which is the capability to upload backups directly to s3-compatible storage without saving locally first. This feature was developed because we wanted to improve the upload speeds of backups in Percona Operator for XtraDB Cluster.

There are many implementations of S3 Compatible Storage: AWS S3, Google Cloud Storage, Digital Ocean Spaces, Alibaba Cloud OSS, MinIO, and Wasabi.

We’ve measured the speed of AWS CLI, gsutil, MinIO client, rclone, gof3r and the xbcloud tool (part of Percona XtraBackup) on AWS (in single and multi-region setups) and on Google Cloud. XtraBackup was compared in two variants: a default configuration and one with tuned chunk size and amount of uploading threads.

Here are the results.

AWS (Same Region)

The backup data was streamed from the AWS EC2 instance to the AWS S3, both in the us-east-1 region.

tool	settings	CPU	max mem	speed	speed comparison
AWS CLI	not changeable	66%	149Mb	130MiB/s	baseline
MinIO client	not changeable	10%	679Mb	59MiB/s	-55%
rclone rcat	not changeable	102%	7138Mb	139MiB/s	+7%
gof3r	default settings	69%	252Mb	97MiB/s	-25%
gof3r	10Mb block, 16 threads	77%	520Mb	108MiB/s	-17%
xbcloud	default settings	10%	96Mb	25MiB/s	-81%
xbcloud	10Mb block, 16 threads	60%	185Mb	134MiB/s	+3%

Tip: If you run MySQL on an EC2 instance to make backups inside one region, do snapshots instead.

AWS (From US to EU)

The backup data was streamed from AWS EC2 in us-east-1 to AWS S3 in eu-central-1.

tool	settings	CPU	max mem	speed	speed comparison
AWS CLI	not changeable	31%	149Mb	61MiB/s	baseline
MinIO client	not changeable	3%	679Mb	20MiB/s	-67%
rclone rcat	not changeable	55%	9307Mb	77MiB/s	+26%
gof3r	default settings	69%	252Mb	97MiB/s	+59%
gof3r	10Mb block, 16 threads	77%	520Mb	108MiB/s	+77%
xbcloud	default settings	4%	96Mb	10MiB/s	-84%
xbcloud	10Mb block, 16 threads	59%	417Mb	123MiB/s	+101%

Tip: Think about disaster recovery, and what will you do when the whole region is not available. It makes no sense to back up to the same region; always transfer backups to another region.

Google Cloud (From US to EU)

The backup data were streamed from Compute Engine instance in us-east1 to Cloud Storage europe-west3. Interestingly, Google Cloud Storage supports both native protocol and S3(interoperability) API. So, Percona XtraBackup can transfer data to Google Cloud Storage directly via S3(interoperability) API.

tool	settings	CPU	max mem	speed	speed comparison
gsutil	not changeable, native protocol	8%	246Mb	23MiB/s	etalon
rclone rcat	not changeable, native protocol	6%	61Mb	16MiB/s	-30%
xbcloud	default settings, s3 protocol	3%	97Mb	9MiB/s	-61%
xbcloud	10Mb block, 16 threads, s3 protocol	50%	417Mb	133MiB/s	+478%

Tip: A cloud provider can block your account due to many reasons, such as human or robot mistakes, inappropriate content abuse after hacking, credit card expire, sanctions, etc. Think about disaster recovery and what will you do when a cloud provider blocks your account, it may make sense to back up to another cloud provider or on-premise.

Conclusion

xbcloud tool (part of Percona XtraBackup) is 2-5 times faster with tuned settings on long-distance with native cloud vendor tools, and 14% faster and requires 20% less memory than analogs with the same settings. Also, xbcloud is the most reliable tool for transferring backups to S3-compatible storage because of two reasons:

It calculates md5 sums during the uploading and puts them into a .md5/filename.md5 file and verifies sums on the download (gof3r does the same).
xbcloud sends data in 10mb chunks and resends them if any network failure happens.

PS: Please find instructions on GitHub if you would like to reproduce this article’s results.

↧

MySQL 8.0.18 New Features Summary

November 26, 2019, 10:58 am

≫ Next: The ProxySQL multiplexing wild goose chase

≪ Previous: Comparing S3 Streaming Tools with Percona XtraBackup

Presentation of some of the new features of MySQL 8.0.18 released on October 14, 2019.

↧

The ProxySQL multiplexing wild goose chase

November 28, 2019, 8:47 am

≫ Next: Where's the MySQL Team from December 2019 to February 2020

≪ Previous: MySQL 8.0.18 New Features Summary

TL;DR – We encountered multiplexing issues with ProxySQL and after going around in circles for a while we found that the impact of mysql-auto_increment_delay_multiplex and mysql-connection_delay_multiplex_ms was not documented

At my present company we are using a multi-layer ProxySQL setup to route our traffic to the appropriate database cluster’s primary or replica hosts. For this post it doesn’t matter whether you run a single or a multi-layer setup, so I’ll simplify our architecture to a single ProxySQL layer where the application connects to all three proxies evenly:

proxysql-multiplexing-wild-goose-chase-topology

The reason for having N+2 proxies is that this ensures us that we can retain high availability after failure of a single node. I’ve also added a fourth (canary) node and sent a small amount of traffic to that to see how any change would affect multiplexing.

One of the key features of ProxySQL is that it’s able to multiplex connections. In short: ProxySQL will attempt to send queries to your database servers over idle database connections. In effect queries of separate incoming connections can be sent to the master with as little backend connections as possible, loweing the connection overhead on your database servers. This could also mean queries from the same incoming connection are being multiplexed over multiple backend connections. The limitation is that you can’t multiplex queries within a transaction and neither if you place locks as this would require the connection to stick with the same backend connection. See also the ProxySQL multiplexing documentation on every condition where it will be disabled.

Problems encountered with high connection counts

After the migration we found that ProxySQL was hardly multiplexing connections at all and due to the increase in connections by adding a shard this wasn’t a scalable solution. The number of ingress connections is about 600 per shard, while the number egress connections stuck around 400 per shard. This meant that the ratio of ingress vs egress is about 66% and that’s not a good sign as ProxySQL is supposed to be multiplexing. A good and effective ratio would be more in the lines of 5%. We were certain ProxySQL was multiplexing on other hostgroups before as the ratio was far more favorable there.

proxysql-multiplexing-wild-goose-chase-first-layer-high-connections

Incoming and outgoing ProxySQL connections are ridiculously high

Why is this a big issue then? For us having more than 2000 active connections frontend and about 1400 backend meant ProxySQL was using a lot of CPU time. On a 4 core machine we noticed our CPUs were over 70% busy all the time, which wouldn’t allow us to lose a single ProxySQL node anymore. To stay within safe margins we first upgraded all ProxySQL hosts to 8 cores and this kept the CPUs within the regions of 40% to 50%. Adding a new shard would increase the incoming connections with 600 and backend connections with 400. That would only allow us to add maybe one or two shards before we would no longer be able to lose a ProxySQL node again and upgrade once more. Adding more ProxySQL hosts would work, but as we have a multi-layer ProxySQL architecture we need to scale all layers to keep up with the incoming connections on the other layers as well. In other words: our sharded environment just faced a new challenge as scale out wasn’t an (easy) option.

When searching for ProxySQL and Multiplexing you will end up at the documentation of ProxySQL Multiplexing or a few blog posts describing the wonders of ProxySQL multiplexing. In the documentation a few conditions when multiplexing is disabled are described, so naturally we examined if our application was meeting these conditions. We examined the general log, query digest inside ProxySQL, tcpdumps and concluded it wasn’t meeting any of these conditions and thus theoretically multiplexing should be functioning. But as we already established with the high incoming and backend connection it clearly wasn’t. Our first dead end.

Investigating ProxySQL bug reports

Then we started to investigate bug reports of ProxySQL to see if anything matched there. At first we found issue #1897 where Multiplexing was erroneously was disabled and there were a few other solved issues that hinted we should try to upgrade. However upgrading a production environment without knowing the impact is never favorable, so instead we added a fourth ProxySQL host and sent 1% of the traffic towards it. This allowed us to easily upgrade the host to various ProxySQL versions and see if this would resolve the issue. Unfortunately we found none of the upgrades would resolve our multiplexing issues. Another dead end.

We did notice the set names metric in ProxySQL was increasing fast, so this lead us to issue #1019 where multi-layered ProxySQL can have issues with set names where the database has a different characterset as ProxySQL’s default characterset. This was the case with for us, but the documentation doesn’t mention set names influences multiplexing and the previous upgrade to 2.0.8 didn’t resolve our issues. At least we found out why the number of set names is increasing. Another dead end.

Then one of our developers pointed me towards ProxySQL issue #2339 in which a wjordan requests a documentation update on temporary conditions when multiplexing is disabled. He describes the mysql-auto_increment_delay_multiplex and mysql-connection_delay_multiplex_ms variables to be missing in this page. I totally ignored this when I searched through the ProxySQL issues as the title contained the word documentation. Insert face palm here.

The two variables basically are a workaround for issue #1048 (Problem with “LAST_INSERT_ID()” returning 0), issue #1421 (Forward SELECTs on LAST_INSERT_ID) and issue #1452 (select last_insert_id() as ID) and are bundled in issue #1828. Ironically issue #1828 is the source for issue #1897, so I went full circle there! Insert second facepalm here.

How ProxySQL delays multiplexing

So what do these two variables do then? First we have to understand the origin of the problem here: with multiplexing enabled, whenever an insert or update would cause an auto increment column to increase, the LAST_INSERT_ID() query would not be guaranteed to be run on the same connection or right after the insert happened. So either it would be proxied over another connection (returning a wrong identifier or 0) or it would be run out of order due to multiplexing (again returning a wrong identifier or 0). These two variables allow you to have the connection stop multiplexing for X-number of queries or X-number of milliseconds after an insert or update happened on an auto increment column.

But wait! How does ProxySQL detect an auto increment column increased then? That’s quite simple: on every successfully query the MySQL binary protocol returns an OK packet to the client. Whenever an insert or update would trigger an auto increment column this OK packet also returns the last inserted identifier. This is what triggers ProxySQL to stop multiplexing for X-number of queries or milliseconds.

But wait! Why did the majority of applications still function correctly before ProxySQL 1.4.14 then? That’s also quite simple: most programming languages are using the native MySQL drivers. For instance PHP PDO makes use of mysqlnd that uses the native MySQL C-driver. Just like ProxySQL, mysqlnd reads the OK packet and stores the last inserted identifier internally. So when you make use of the LastInsertId function in PDO, this will retrieve the value from mysqlnd internally using the function . In general you can assume database drivers never run a SELECT LAST_INSERT_ID() against a database. However you should be cautious with some ORMs, like Hybernate, that are actually depending on queries like this.

proxysql-multiplexing-wild-goose-chase-mysqlnd-native-driver

mysqlnd stores the last insert identifier internally

The default for mysql-auto_increment_delay_multiplex is 5 and mysql-connection_delay_multiplex_ms is 0. So whenever ProxySQL encounters an OK packet with last inserted identifier set it will disable multiplexing on the connection for 5 consecutive queries. This basically locks the frontend connection to the backend connection for at least 5 queries. This to allow an ORM (or application) to run a LAST_INSERT_ID() query on the same connection. New incoming connections will then have to use a different connection from the connection pool. Obviously whenever ProxySQL encounters another OK packet with last inserted identifier it will reset this connection again to 5 consecutive queries again.

Back to our own problem statement

So why did this happen only to the sharded databases then? The primary key of our sharded database contains a UUID and not an auto increment. There was, however, an auto increment column present on the table to facilitate a sequence number in our records. We clearly overlooked this in our investigation as well. There are various reasons why we currently have to keep this column, so it wasn’t feasible to remove the column for now.

Also our sharded databases have a read/write ratio of around 3:2. That means 2 out of 5 queries will cause an auto increment to trigger and thus lock the frontend connection to a backend connection for at least 5 consecutive queries! With that ratio we will hardly ever multiplex! Once we changed this variable on our canary ProxySQL we immediately saw a significant drop in backend connections on ProxySQL. First we set mysql-auto_increment_delay_multiplex to 0, which caused all 16 incoming connections to be proxied over an average of 0 connections! When we changed the value to 1 it averaged over 3 connections!

proxysql-multiplexing-wild-goose-chase-connection-drop-canary

Drop in active backend connections after setting mysql-auto_increment_delay_multiplex to 0

Before applying this to all other ProxySQL hosts, there was one more hurdle: both variables are global, so they apply to all hostgroups. Once we established our codebase never ran a LAST_INSERT_ID() anywhere we changed it on all other ProxySQL hosts. The difference is night and day:

proxysql-multiplexing-wild-goose-chase-first-layer-connection-drop

Effect of setting mysql-auto_increment_delay_multiplex to 0 on first ProxySQL layer

And the CPU usage dropped to a more sane 30%:

proxysql-multiplexing-wild-goose-chase-first-layer-cpu-usage-drop

CPU usage decreases dramatically after multiplexing works again

Conclusion

There are clearly three problems with the introduction of mysql-auto_increment_delay_multiplex and mysql-connection_delay_multiplex_ms:

Both variables are global variables. If you change them, they will apply to all ProxySQL hostgroups, servers, users and connections
The default of mysql-auto_increment_delay_multiplex is set to 5. This means multiplexing will be disabled or less effective on any write-intensive workload that contains an auto increment column. This is regardless whether your application actually uses the LAST_INSERT_ID statement or not.
With the introduction of both variables in version 1.4.14, the documentation on both variables was updated. However it’s impact on multiplexing has never been updated.

By the time this blogpost goes public I have already made a change in the documentation of ProxySQL on multiplexing. I’ve also created a feature request for ProxySQL for the ability to control the delay per server or per user. But after I submitted the feature request I realized that this might actually make it only worse: even more variables, configurations and query rules to check and it doesn’t fix the problem at it’s root. It would be much cleaner for ProxySQL to store the last inserted identifier returned by the OK packet in a variable bound to the frontend connection that made the request. This identifier can then be returned whenever that frontend connection has a query that contains the LAST_INSERT_ID function.

↧

Where's the MySQL Team from December 2019 to February 2020

November 29, 2019, 1:00 am

≫ Next: The dangers of replication filters in MySQL

≪ Previous: The ProxySQL multiplexing wild goose chase

As follow up to the regular shows announcements, we would like to inform you about places & shows where you can find MySQL Community team or MySQL experts at during December to February timeframe. Please find the list below:

December 2019:

UKOUG Tech Fest, Brighton, UK, December 1-4, 2019
- Oracle is having a big booth at this show, however this time MySQL is not part of this. Instead David Stokes, the MySQL Community Manager has a MySQL talk on "MySQL 8.0" scheduled for December 2nd @16:45-17:30. Please see the schedule.
London Open Source DB meetup, London, UK, December 4, 2019
- David Stokes, the MySQL Community Manager is going to give a talk at the London Open Source Database Meetup on Dec 4, 2019. Details:
  - Date: Dec 4, 2019
  - Time: From
  - Agenda:
    - 18:00 - Arrival, networking. Food and drinks offered by Oracle
    - 18:20 - Intro by Federico Razzoli, database consultant * 18:30 Community lightning talks
    - 19:00 - "New MySQL Features and a Brief Look into 2020!" - by David Stokes, MySQL Community Manager
  - Place: Innovation Warehouse 1 E Poultry Ave · London, UK
IT Tage, Frankfurt am Main, Germany, December 12, 2019
- IT Tage is one of Germany’s largest IT conferences to date. It covers a wide range of topics from agile development, over microservices, data-management/-usage as well as application and container design over to the internet of things:
- More than 230 sessions on various topics make it the must-go event of 2019 for IT interested parties.
- Carsten Thalheimer & Henry Kroeger will deliver a talk on “MySQL best practices – 8 easy steps to a more optimized MySQL instance”:
  - On basis of RedHat Linux 8 we will demonstrate and discuss the creation of a scalable MySQL instance. Additionally an overview of the clustering possibilities within MySQL 5.7 & MySQL 8.0 will be given. The target audience for this talk are database administrators as well as those possessing a good basic understanding of database mechanics.
- You can find both, Carsten & Henry at Oracle booth all the time except the session time. Please come to talk to us!

January 2020:

OpenSource Conference Osaka, Japan, January 24-25, 2020
- As tradition we are again part of OSC Osaka as Gold sponsor. You can find us at the expo area at MySQL booth as well as find a MySQL 8.0 talk in the schedule. During the talk we will introduce MySQL 8.0, the latest version of MySQL, MySQL Enterprise Edition (which provides extended functions and support), MySQL Cluster, a high-performance and highly reliable distributed cluster & Oracle MySQL Cloud Service. Please watch the conference website for talk timing & details.
- We are looking forward to talking to you at OSC Osaka!

February 2020:

FOSDEM 2020, Brussels, Belgium, February 1-2, 2020
- As tradition also this year MySQL wide team is part of FOSDEM. Find us at MySQL booth in the expo area as well as in the approved FOSDEM MySQL, MariaDB & Friends Devroom. The CFP ended on Nov 20 and soon the full schedule will be announced.
SunShine PHP, Miami, FL, US, February 6-8, 2020
- "MySQL New Features" given by David Stokes, the MySQL Community Manager. Please check the website schedule for the exact timing. You can also find MySQL at the shared booth in the expo area.
DeveloperWeek, SFO, US, February 12-16, 2020
- This year for the first time you can find MySQL Community team at this large Developer show in SFO Bay Area. You will be able to find us at the MySQL booth in the expo area as well as find a talk given by David Stokes, the MySQL Community Manager on "MySQL without the SQL - Oh My!" in the program. The talk is preliminary scheduled for Feb 13, 2020 @3:30-3:55.

↧

The dangers of replication filters in MySQL

November 29, 2019, 3:43 am

≫ Next: MySQL Shell Plugins: InnoDB

≪ Previous: Where's the MySQL Team from December 2019 to February 2020

MySQL supports replication filters and binlog filters. These features are powerful, but dangerous. Here you'll find out the risks, and how to mitigate them.

↧

MySQL Shell Plugins: InnoDB

November 29, 2019, 2:16 pm

≫ Next: Fun with Bugs #90 - On MySQL Bug Reports I am Subscribed to, Part XXIV

≪ Previous: The dangers of replication filters in MySQL

Today, we will cover a totally different MySQL Shell plugin: InnoDB.

Currently only 3 methods have been created:

Those related to the Table space fragmentation, have already been covered in this recent article.

Let’s discover the getAlterProgress()method. This method allows us to have an overview of the progress of some alter statements status like:

stage/innodb/alter table (end)
stage/innodb/alter table (flush)
stage/innodb/alter table (insert)
stage/innodb/alter table (log apply index)
stage/innodb/alter table (log apply table)
stage/innodb/alter table (merge sort)
stage/innodb/alter table (read PK and internal sort)
stage/innodb/alter tablespace (encryption)

This is an output of the method:

As you can see, the plugin displays an estimation of the % progress of the alter statement.

I’ve also created an User-Defined Report to display the same output continuously:

But of course, the report needs to have all the required instruments and consumers enabled. The plugin can enable them for you:

Once again, you have an overview of how useful is the MySQL Shell for a MySQL DBA’s daily tasks.

The code of the plugin is available here: https://github.com/lefred/mysqlshell-plugins

And the code of the User-Defined Reports are also on github: https://github.com/lefred/mysql-shell-udr

Don’t forget that it’s time to upgrade to MySQL 8.0 #MySQL8isGreat

↧

Fun with Bugs #90 - On MySQL Bug Reports I am Subscribed to, Part XXIV

November 30, 2019, 8:53 am

≫ Next: pre-FOSDEM MySQL Days 2020: registration is now open !

≪ Previous: MySQL Shell Plugins: InnoDB

Previous post in this series was published 3 months ago and the last Bug #96340 from it is already closed as fixed in upcoming MySQL 8.0.19. I've picked up 50+ more bugs to follow since that time, so I think I should send quick status update about interesting public MySQL bug reports that are still active.

As usual I concentrate mostly on InnoDB, replication and optimizer bugs. Here is the list, starting from the oldest:

Bug #96374 - "binlog rotation deadlock when innodb concurrency limit setted". This bug was reported by Jia Liu, who used gdb to show threads deadlock details. I admit that recently more bug reporters use gdb and sysbench with custom(ized) Lua scripts to prove the point, and I am happy to see this happening!
Bug #96378 - "Subquery with parameter is exponentially slower than hard-coded value". In my primitive test with user variables replaced by constants (on MariaDB 10.3.7) I get the same plan for the query, so I am not 100% sure that the analysis by my dear friend Sinisa Milivojevic was right and it's about optimization (and not comparing values with different collations, for example). But anyway, this problem reported by Jeff Johnson ended up as a verified feature request. Let's see what may happen to it next.
Bug #96379 - "First query successful, second - ERROR 1270 (HY000): Illegal mix of collations ". This really funny bug was reported by Владислав Сокол.
Bug #96400 - "MTS STOP SLAVE takes over a minute when master crashed during event logging". Nice bug report by Przemyslaw Malkowski from Percona, who used sysbench and dbdeployer to demonstrate the problem. Later Przemysław Skibiński (also from Percona) provided a patch to resolve the problem.
Bug #96412 - "Mess usages of latch meta data for InnoDB latches (mutex and rw_lock)". Fungo Wang had to make a detailed code analysis to get this bug verified. I am not sure why it ended up with severity S6 (Debug Builds) though.
Bug #96414 - "CREATE TABLE events in wrong order in a binary log.". This bug was reported by Iwo P. His test case to demonstarte the problem included small source code modification, but (unlike with some other bug reports) this had NOT prevented accepting it as a true, verified bug. The bug not affect MySQL 8.0.3+ thanks to WL#6049 "Meta-data locking for FOREIGN KEY tables" implemented there.
Bug #96472 - "Memory leak after 'innodb.alter_crash'". Yet another bug affecting only MySQL 4.7 and not MySQL 8.0. It was reported by Yura Sorokin from Percona.
Bug #96475 - "ALTER TABLE t IMPORT TABLESPACE blocks SELECT on I_S.tables.". Clear and simple "How to repeat" instructions (using dbdeployer) by Jean-François Gagné. See also his related Bug #96477 - "FLUSH TABLE t FOR EXPORT or ALTER TABLE t2 IMPORT TABLESPACE broken in 8.0.17" for MySQL 8. The latter is a regression bug (without a regression tag), and I just do not get how the GA releases with such new bugs introduced may happen.
Bug #96504 - "Refine atomics and barriers for weak memory order platform". Detailed analysis, with links to code etc from Cai Yibo.
Bug #96525 - "Huge malloc when open file limit is high". Looks more like a systemd problem (in versions < 240) to me. Anyway, useful report from Andreas Hasenack.
Bug #96615 - "mysql server cannot handle write operations after set system time to the past". A lot of arguments were needed to get this verified, but Shangshang Yu was not going to give up. First time I see gstack used in the bug report to get a stack trace quickly. It's a part of gdb RPM on CentOS 6+. I have to try it vs gdb and pstack one day and decide what is the easiest and most efficient way to get backtraces of all threads in production...
Bug #96637 - "Clone fails on just upgraded server from 5.7". I had not used MySQL 8 famous clone plugin yet in practice, but I already know that it has bugs. This bug was reported by Satya Bodapati, who also suggested a patch.
Bug #96644 - "Set read_only on a master waiting for semi-sync ACK blocked on global read lock". Yet another problem (documented limitation) report from Przemyslaw Malkowski. Not sure why it was not verified on MySQL 8.0. Without a workaround to set master to read only it is unsafe to use long rpl_semi_sync_master_timeout values, as we may end up with that long downtime.
Bug #96677 - ""SELECT ... INTO var_name FOR UPDATE" not working in MySQL 8". This regression bug was reported by Vinodh Krish. Some analysis and patch were later suggested by Zsolt Parragi.
Bug #96690 - "sql_require_primary_key should not apply to temporary tables". This bug was also reported by Przemyslaw Malkowski from Percona. It ended up as a verified feature request, but not everyone in community is happy with this. Let me quote:
"[30 Aug 8:08] Jean-François Gagné
Could we know what was the original severity of this bug as reported by Przemyslaw ? This is now hidden as it has been reclassified as S4 (Feature Request).

From my point of view, this is actually a bug, not a feature request and it should be classified as S2. A perfectly working application would break for no reason when a temporary table does not have a Primary Key, so this is actually a big hurdle for using sql_require_primary_key, hence serious bug in the implementation of this otherwise very nice and useful feature."

That's all about bugs I've subscribed to in summer.

Winter is coming, so why not to remember nice warm sunny days and interesting MySQL bugs reported back then.

To summarize:

We still see some strange "games" played during bugs processing and trend to decrease severity of reports. I think this is a waste of time for both Oracle engineers and community bug reporters.
I am still not sure if Oracle's QA does not use ASan or just ignore problems reported for MTR test cases. Anyway, Percona engineers do this for them, and report related bugs :)
dbdeployer and sysbench are really popular among MySQL bug reporters recently!
Importing of InnoDB tablespaces is broken in MySQL 8.0.17+ at least.
There are many interesting MySQL bugs reported during last 3 months, so I epxect more posts in this series soon.

↧

pre-FOSDEM MySQL Days 2020: registration is now open !

December 2, 2019, 2:04 am

≫ Next: Monitoring Results in MySQL Performance Gains

≪ Previous: Fun with Bugs #90 - On MySQL Bug Reports I am Subscribed to, Part XXIV

Hello dear MySQL Community ! As you know FOSDEM 2020 will take place January 1st and 2nd. After having received many, many requests we decided to organize for the 4th year in a row the pre-FOSDEM MySQL Day… with a big change, for this edition the event will be called “pre-FOSDEM MySQL Days” !

We have decided to extend that extra day related to the world’s most popular open source database to 2 days: 30th and 31st of January at the usual location in Brussels.

We will also have the usual sessions track but most probably a second room for those who would like to get their hands dirty in the code !!

Please don’t forget to register as soon as possible as you may already know, the seats are limited !

And, please don’t forget, that if you have register for the event and you cannot make it, please free back your ticket for somebody else.

The schedule will be available soon, stay tuned !

↧

Monitoring Results in MySQL Performance Gains

December 2, 2019, 8:30 pm

≫ Next: Geo-Scale MySQL for Continuous Global Operations & Fast Response Times

≪ Previous: pre-FOSDEM MySQL Days 2020: registration is now open !

Author: Robert Agar

MySQL is one of the most popular and widely used database platforms in the world. If you are a DBA or database developer, there is a very high probability that at least some of the systems under your purview are powered by MySQL. The standard tasks such as user administration and ensuring that the databases are backed up and can be restored are important facets of your daily responsibilities. You will also be charged with maintaining a high level of performance that addresses the concerns of the database’s users.

Creating backup jobs and setting up new user accounts are fairly straightforward tasks that should not be overly challenging to an experienced DBA. Even if you do not have extensive experience with MySQL, you will very quickly become comfortable with any idiosyncrasies that the platform presents. Performance tuning, on the other hand, can be a complicated undertaking. It can be difficult to identify the particular modifications that are required to speed up database response time and minimize calls from dissatisfied users. Most DBAs would welcome some assistance in optimizing their database performance.

Monitoring to the Rescue

Finding the areas in a database that need to be addressed to improve performance cannot be done randomly. You could spend months implementing hit or miss changes that do little to make things better. In some cases, your incorrect guesses can make things much worse. The issue of performance tuning needs to be done systematically. Information with which to make tuning decisions is vitally important.

The most common reason for user complaints when interacting with a web application is the speed at which the program returns the desired results. This might be a report which takes an inordinate amount of time to produce or queries that demonstrate a painfully slow response time. The problems are often not associated with the front-end application but are the results of issues with the underlying MySQL database. As a DBA, your job is to address this issue and find out where and why the database speed is being impacted.

Performance issues can suddenly pop up in places where everything was previously running smoothly. Slow and inefficiently coded queries are usually the prime culprit when performance lags on a MySQL database. Quickly identifying the top three queries that are causing the problems is a great place to begin your tuning efforts. The ability to look back at least three hours to find these queries gives you the best chance of addressing the real problems affecting your database.

Realtime monitoring will display the current state of your database and allows you to see which queries are being executed and which ones are slow. You can drill down on a particular query to analyze its details. This can shed light on ways that the query can be optimized. Perhaps the query needs to have an index added to it to streamline its performance. A built-in query analyzer can expedite this process by allowing you to locate performance gains directly from monitored data.

Generating alerts is a fundamental feature of a comprehensive monitoring tool that informs the database team of problems before they start affecting users. The ability to customize the monitoring application allows you to determine exactly what gets monitored and how warning messages are created. You want to control alert generation to avoid overload which can eventually lead to important messages being ignored.

The Right Tool for MySQL Monitoring

The monitoring tips outlined above make use of SQL Diagnostic Manager for MySQL. It offers an agentless MySQL monitoring solution which enables your database team to identify the problem queries that need to be tuned to optimize performance. Proactive altering, the ability to quickly find slow-running queries, and the capacity to kill locked queries are features that make this an excellent tool for MySQL DBAs.

IDERA’s SQL Diagnostic Manager for MySQL was recently renamed from its previous title, Monyog. Don’t be confused when the application is referenced by that name in this instructional video which demonstrates how the tool can be used to tune database performance. The video takes a deeper look at the points discussed in this post and is well worth watching if you want to improve the performance of your MySQL databases. The 60 minutes spent viewing the video will repay you handsomely in ideas for tuning your systems and keeping your users happy.

The post Monitoring Results in MySQL Performance Gains appeared first on Monyog Blog.

↧

Geo-Scale MySQL for Continuous Global Operations & Fast Response Times

December 3, 2019, 3:50 am

≫ Next: Webinar 12/5: Introduction to MySQL Query Tuning for DevOps

≪ Previous: Monitoring Results in MySQL Performance Gains

Geo-scale MySQL – or how to build a global, multi-region MySQL cloud back-end capable of serving hundreds of player accounts

This blog introduces a series of blogs we’ll be publishing over the next few months that discuss a number of different customer use cases that our solutions support and that centre around achieving continuous MySQL operations with commercial-grade high availability (HA), geographically redundant disaster recovery (DR) and global scaling.

This first use case looks at a customer of ours who are a global gaming company with several hundred million world-wide player accounts.

What is the challenge?

How to reliably, and fast, cater to hundreds of millions of game players around the world? The challenge here is to serve a game application for a geographically-distributed audience; in other words, a pretty unique challenge.

It requires fast, local response times for read traffic, a limited number of updates, and a single consolidated view of the data across the world, which is very typical for gaming applications, and for all account/subscription management systems in general.

What is the solution?

Continuent Tungsten Clustering. The solution we implemented for this customer is comprised of four (4) geo-distributed Composite Tungsten Clusters, with one active cluster in USA West accepting writes and updates and handling local read traffic, and three passive Tungsten clusters in USA East, EMEA and APAC providing very fast local reads to access the player accounts.

What are the benefits?

The continuous operations. The benefits that this solution provides are clear: geo-scale, availability and disaster recovery.

More specifically, it includes low-latency, geo-distributed data access providing fast response times for read traffic as well as local, rapid-failover automated high availability.

This, combined with simple administration, system visibility and stability also helps create high return on investment.

Watch the webinar replay

We’ve covered this particular use case in a recent webinar. You can watch the replay of the webinar here.

About Tungsten Clustering

Tungsten Clustering allows enterprises running business-critical MySQL database applications to cost-effectively achieve continuous operations with commercial-grade high availability (HA), geographically redundant disaster recovery (DR) and global scaling.

To find out more, visit our Tungsten Clustering product page.

↧

Webinar 12/5: Introduction to MySQL Query Tuning for DevOps

December 3, 2019, 6:14 am

≫ Next: Percona Live 2020: Call For Papers

≪ Previous: Geo-Scale MySQL for Continuous Global Operations & Fast Response Times

MySQL does its best to return requested bytes as fast as possible. However, it needs human help to identify what is important and should be accessed in the first place. Queries, written smartly, can significantly outperform automatically generated ones. Indexes and Optimizer statistics, not limited to the Histograms only, help increase the speed of the query.

Join Percona’s Principal Support Engineer for MySQL Sveta Smirnova on Thurs, Dec 5th from 10 to 11 am PST to learn how MySQL query performance can be improved through the utilization of Developer and DevOps tools. In addition, you’ll learn troubleshooting techniques to help identify and solve query performance issues.

If you can’t attend, sign up anyways we’ll send you the slides and recording afterward.

↧

Percona Live 2020: Call For Papers

December 3, 2019, 7:29 am

≫ Next: MySQL Workbench now using Casmine for unit and integration testing

≪ Previous: Webinar 12/5: Introduction to MySQL Query Tuning for DevOps

The Call For Papers (CFP) for Percona Live 2020 is now open!

Percona Live will be held in Austin, Texas from Monday, May 18 through Wednesday, May 20, 2020 at a new venue, the AT&T Hotel and Conference Center. The CFP is open for submissions from November 27, 2019, through January 13, 2020. We invite abstracts covering any and all aspects of open source databases, including on-premise, in the cloud, and across the multi-verse!

Hot Open Source Topics for 2020

All Open Source database themes are welcome, but these are our hot topics for 2020:

Success in the multi-verse: How to optimize performance, architecture, high-availability, replication, and more in a multi-cloud, multi-database environment.
Support of cloud-native applications in database environments: How you burst scale and performance when you need it.
Managing systems at scale: How to manage 1000’s of databases effectively.
Finding and solving problems quickly: How you keep systems up and running in the heat of an outage or a slowdown.
Data security: How you prevent your database from leaking data.
Development: Best practices for enabling developers to self-support and make database choices.

All abstracts will get a full, fair, and competitive assessment by our Conference Program Committee of open source database experts. We’re currently finalizing our committee membership, which will be announced soon.

Sponsorship Opportunities

The conference will also include presentations by sponsoring companies that operate on the leading edge of open source technology. Many of our sponsors are pivotal players in the industry and make important contributions to open source projects. To learn more about sponsorship opportunities, contact Bronwyn Campbell.

Key Points

Proposals can be for half-day or full-day tutorials, 50-minute conference sessions, 25-minute conference sessions, or 10-minute lightning talks.
A talk can be shared by up to four speakers.
All speakers, except lightning talks, receive a full, free pass to Percona Live.
The closing date for proposals is 11:59 p.m. AoE (GMT -12) on Monday, January 13, 2020.
We may select some proposals early, before the CFP closes, to announce an agenda sneak peek. So the earlier you submit, the better your chance of success.

If you have any questions about the CFP or the conference, don’t hesitate to get in touch! You can contact me via email at community-team@percona.com. Meanwhile, if you are ready to register then sign up now – you can save your submission in progress, so there’s no need to do it all in one session.

Good luck!

↧

MySQL Workbench now using Casmine for unit and integration testing

December 4, 2019, 2:20 am

≫ Next: How to Save on AWS RDS MySQL Costs by Instance Right-sizing

≪ Previous: Percona Live 2020: Call For Papers

Starting with version 8.0.18 the MySQL Workbench source package finally ships also all our GPL unit and integration tests, which we are using internally to control code quality. For that we had first to replace our old, outdated testing framework by something new and more appropriate. We evaluated quite a few C++ testing frameworks but found them either not to be sufficient or difficult to use. Instead we had something in mind that comes close to the Jasmine framework which is widely used among JS developers. The way it hides all the boring test management details and the clear structure it uses, was quite an inspiration for us and we decided to develop our own testing framework modeled after that.

Casmine – C++17 BDD Testing Framework

Casmine is a C++ unit and integration testing framework written in C++17 that enables testing C and C++ code with matchers, similar to the way Jasmine does for Javascript (hence the similar name). Care has been taken to make the framework very flexible but still easy to write tests and later read them with ease. The framework runs on any platform with a C++17 compiler.

Casmine hides most of the boilerplate code necessary to run the tests and comes with a set of elements that help to structure your test code in a way that is easily consumable by humans. Tests are organized in spec files (one test case per file) and are self registering. After you add such a spec file to your testing project it will automatically execute on run. Two main elements structure a spec file: $describe and $it. The first covers an entire test case and consists of zero or more $it blocks, which will be executed in definition order ($describe execution is random, because of static initialization). In addition to that, you can specify blocks that are executed before and after all tests, or before/after each test – good for setup and shutdown of test (groups).

Each test case and test can be individually disabled or focused on (which disables all non-focused tests). Focused tests in disabled test cases are still disabled, however. Additionally, a test can be marked as pending, which causes Casmine to ignore any result from it and produces a record in the test output that allows you to identify it easily. This helps greatly to start with spec skeletons and fill them later without missing one test or to specifying a reason why a test is not executed now (what $xit does not allow).

Main features

Written in C++17.
Type safe tests with type specific matchers.
Comprehensive collection of matchers and expects that should cover most of your needs.
Very simple, yet powerful and flexible test spec structure, using only $describe and $it.
Run test management code by using $beforeAll, $afterAll, $beforeEach and $afterEach.
Focus on specific tests or entire test cases by using $fit and $fdescribe.
Exclude specific tests or entire test cases by using $xit and $xdescribe.
Mark tests explicitly as succeeded or failed by using $success() and $fail().
Mark tests as pending by using $pending().
Built-in console reporter with colored output.
Built-in JSON reporter, which generates a JSON result file for easy integration with any CI system.
Easily extend the matchers and the expects, in case your objects don’t fit the stock classes.

Getting the Code + Compile It

Currently Casmine is part of MySQL Workbench and you can find the code for it in the MySQL Workbench Github repository. It consists only of a handful of files, which are easy to compile (VC++ 2017, clang 11 or gcc 8.0). On macOS you can only build casmine on macOS 10.15 because we use the C++17 std::filesystem namespace for test data and output folder preparation. There’s a cmake file for Casmine in the folder, which you can use or simply add all files in that folder to your project/solution, to compile the testing framework.

Getting Started

In its simplest form a test spec file looks like this:

#include "casmine.h"

using namespace casmine;

namespace {

$ModuleEnvironment() {};

$describe("Test Case Description") {
  $it("Test Description", []() {
    $expect(1).toBe(1);
  });
});

}

The $ModuleEnvironment macro ensures the necessary structures are in place for the spec file, which are then used by the $describe and $it blocks. Since this implements the same class in each spec file it is necessary to wrap the entire test spec with an anonymous namespace, to avoid linker problems. On the other hand, this macro enables you to easily extend Casmine for your own project structure (more about that below).

Macros are used to record file name + line number of the executing code block. This is also important for possible (unexpected) exceptions, which otherwise would not have any source location attached.

The $it block takes a std::function for execution, which consists of $expect calls that do the actual test steps (checks). Each $expect call creates a temporary expectation object, runs the associated matcher (here toBe) and records the result. After that the expectation object is freed.

Typically a test case also consists of setup and shutdown code, which could be like this:

$describe("Test Case Description") {
  $beforeAll([]() {
    myServer.start();
  });
  
  $afterAll([]() {
    myServer.stop();
  });
  
  $beforeEach([]() {
    myServer.reset();
  });
  
  $afterEach([]() {
    // Remove temporary data.
  });
  
  $it("Test Description)", []() {
    $expect(1).toBe(1);
  });
});

To run the test suite, call the Casmine context, which manages everything, in your main.cpp file:

#include "casmine.h"

using namespace casmine;

int main(int argc, const char *argv[]) {
  auto context = CasmineContext::get();
  
#ifdef _MSC_VER
  SetConsoleOutputCP(CP_UTF8);
#endif

  context->runTests(/* specify the base path here */);

  return 0;
}

The runTests path parameter is used to locate data and output directories. Additionally, all file names reported in the results are made relative to this path. Use CasmineContext::basePath() to get this path in your spec files. Casmine expects a data folder in the base directory (<base>/data) and will copy its entire content to a temp data dir before running any test. The member Casmine::tmpDataDir() can be used to get the name for that folder, when you want to access your data. Because this is a copy of the original data, it’s allowed to be modified, without affecting the original data.

Casmine creates an output folder during the test execution, which you can use to store any output your tests might produces. The actual path to that can be read from Casmine::outputDir().

The temporary data dir as well as the output dir are removed when Casmine finishes (and also on start up, when they were left over from a previous run, which would indicate a crash of the application).

Overview

A test step (which is part of an $it block) has a pretty simple form:

 $expect().[Not.]toXXX(, );

$expect is a macro that records line number and file name where it is called from and then calls one of the factory functions (depending on the type of <actual value>), which instantiates the associated Expect classes, executes the check, records the result and frees the Expect instance. The toXXX call usually performs a relational operation, but can in special cases also do more complex processing (like matching a regular expression or compare to file content). The member Not inverts the expectation. Results are not stored in Casmine, but only sent to the registered reporters. These can then just count success + failure or do something else with the results.

All calls to record results are synchronized (and hence thread-safe), to avoid logging results to the wrong test spec or test. Other access is not guarded because that’s either read-only or using temporary objects.

Casmine catches every exception that might occur in the test code, be it wanted or unexpected. Exceptions are logged with the line info and file name of the last successful call to either $describe, $it or $expect, whichever was called last. Additionally, you can do checks for exceptions by placing the code that throws them in a std::function and use that in an $expect call. By default test execution is stopped if an unexpected exception occurs, however this can be changed by the setting continueOnException (see also the Configuration section).

It is not necessary that $expect calls are written in the same file where their surrounding $describe or $it block is located. Casmine is a stateful implementation that records test results from anywhere to the currently executing test. This enables sharing of test code between tests, by moving it out to a separate file, and then calling that in different places.

Important: while you can write normal code in a $describe block, outside of any $it, $beforeAll etc. call, you must not access anything of that in these calls (via capture by reference), because the outer $describe block is executed during registration and variables in it are no longer available when the inner blocks are later executed. Code like the following will crash:

$describe("Test Case Description") {
  size_t count = 42;

  $it("Test Description", [&]() {
    // Crash here because this closure is executed outside of the outer block.
    $expect(count).toBe(42); 
  });
});

See the “Test Data” section below for details how you can hold test values etc. for use in tests.

Note: it is not necessary for either the actual or expected value to be copy-assignable. Internally, all provided values are held as const references, making it possible to check also non-copyable objects like unique_ptr. However, for strings, sometimes a copy is necessary to allow a comparison of different string types (string against char* and char[] etc.). It’s clear that both actual and expected values must stay valid during a call to $expect.

Casmine comes with a number of predefined Expect classes, which are automatically selected via type deduction and provide a number of matcher functions, specific to the deduced type. The framework contains Expect classes for:

Scalars (integral types, pointers (except string pointers) and object references)
Strings (basic_string, as well as char, wchar_t, char16_t and char32_t pointers and arrays)
Classes (all classes/objects, except standard containers, exceptions and strings as they have specialized matchers)
Exceptions (exceptions thrown in a std::function)
Containers (array, vector, deque, forward_list, list, set, multiset and their unordered variants)
Associative Containers (map, multi_map and their unordered variants)
Smart Pointers (unique_ptr, shared_ptr, weak_ptr)

Each of the Expect classes is derived from one or more of the following predefined matchers, which provide the type-dependent match functions:

MatcherScalar (used by the scalar and pointer Expects)
toBe (value equality)
toEqual (value equality or object identity for classes that override the == operator)
toBeLessThan (< comparison, also for classes overriding that operator)
toBeLessThanOrEqual (<= comparison)
toBeGreaterThan (> comparison)
toBeGreaterThanOrEqual (>= comparison)
toBeTrue (actual value cast to boolean)
toBeFalse (actual value cast to boolean)
toBeCloseTo (numeric value with a maximum distance to a given value)
toBeOddNumber (for integral types, see std::is_integral)
toBeEvenNumber (for integral types)
toBeWholeNumber (for arithmetic types, see std::is_arithmetic)
toBeWithinRange (numeric value within a range)
toBeInf (infinitive number)
toBeNan (not a number)
MatcherString (used by the string Expect)
toBe (alias to toEqual)
toEqual (string equality using std::basic_string::compare)
toBeLessThan (< string comparison using the string compare function)
toBeLessThanOrEqual (<= string comparison)
toBeGreaterThan (> string comparison)
toBeGreaterThanOrEqual (>= string comparison)
toContain (find text in the actual value)
toStartWith (test that the actual value begins with a string)
toEndWith (test that the actual value ends with a string)
toContainOnlyWhitespaces (only tab, vertical tab, space, CR/LF and form feed)
toContainNoWhitespaces (all but the aforementioned whitespaces)
toBeSameLengthAs (actual value length == expected value length)
toBeLongerThan (actual value length > expected value length)
toBeShorterThan (actual value length < expected value length)
toMatch (actual value matches regular expression, given either as string or regex)
toEqualContentOfFile (loads a text file and compares it line-by-line to the test value)
MatcherTypeSupport (used by class and pointer Expects)
toBeInstanceOf (object is an instance of a specific class)
toBeSameType (class type derivation check)
MatcherException (used by exception Expect)
toThrow (function throws any exception)
toThrowError (function throws a specific exception)
MatcherContainer (used by container Expect)
toContain (actual value contains a specific element)
toContainValues (actual value contains a list of values with no implied order)
toHaveSize (actual value size)
toEqual (actual value content comparison via == operator)
MatcherAssociativeContainer (used by associative container Expect)
toContainKey (key search)
toContainValue (value lookup for a given key)
toHaveSize (container size)
toEqual (content comparison via == operator)
MatcherNull (used by pointer and smart pointer Expects)
toBeNull (equals nullptr)
toBeValid (non-null value for smart pointers or nullptr equality)
MatcherClass (used by class Expect)
toEqual (object identity via == operator)
toBeLessThan (< comparison)
toBeLessThanOrEqual (<= comparison)
toBeGreaterThan (> comparison)
toBeGreaterThanOrEqual (>= comparion)

Matcher functions generate nice human readable output in case of test failures, including source location information. The failure message can be overridden by specifying a custom message in the matcher function, for example:

$expect(1).toBe(1, "Surprising result");

To invert a test check, use the Not member of the Expect class:

$expect(1).Not.toBe(1, "Less surprising");

Test Data

Most of the time, when you run tests, you probably need some data to test on or to share between tests. Casmine supports that with a separate macro, named $TestData. This is actually a simple struct that can hold whatever you need. A typical use case looks like this:

#include "casmine.h"

using namespace casmine;

namespace {

$ModuleEnvironment() {};

$TestData {
  std::string testValue = "ABC";
};

$describe("Test Case Description") {
  $it("Test Description", [this]() {
    $expect(data->testValue).toBe("ABC");
  });

  $it("Test Description 2", [this]() {
    $expect(data->testValue).Not.toBe("XYZ");
  });
});

}

As you can see, this struct is made available as member data in the object behind $describe. As such, you can access all values via data->member, provided you capture the this pointer in the $it call. The data member is not available and produces a compiler error if there’s no definition of the $TestData struct.

The $TestData macro can be placed anywhere in the file provided it appears before the $describe call (where a member of this type is declared, as explained above).

Configuration

Casmine provides two types of configurations: settings to control the execution of itself and configuration values that are used by the tests.

Settings

The following settings are used by Casmine:

continueOnException (bool, default: false) see below
verbose (bool, default: false) free to use in your tests, for example for debug messages
no-colors (bool, default: false) do not use terminal colors in the console reporter
only-selected (bool, default: false) see below

Even though all unexpected exceptions are caught during a test run (to ensure proper shutdown), Casmine will still stop execution if one occurred, unless the value continueOnException is set to true. This does not apply to exceptions checks in your test code, of course, which are expected exceptions. It’s also guaranteed that $afterAll and $afterEach are called if an unexpected exception came up during the run of a test spec.

Normally test specs run in random order and according to their type (normal, focused, disabled). If you want to change the type, you have to recompile your tests. Sometimes it might be necessary to run specs in a certain order (for example while working on a specific spec) or when switching between specs frequently. In this case you can enable the only-selected setting and call CasmineContext::forceSpecList() with a list of spec file names (no path, no extension). This could be implemented as application parameter, taking the user input and forwarding that to this function. For a list of registered specs call CasmineContext::getSpecList().

Casmine settings are available in the CasmineContext::settings member, which contains key/value pairs of the form:

typedef std::variant<:string int double bool> ConfigMapVariableTypes;
std::map<:string configmapvariabletypes> settings;

This member is flexible enough to introduce further settings in the future.

JSON Configuration

For test configuration you sometimes need sensitive data (like passwords). Passing them on the command line or in the test code is not recommended and unsave. Instead Casmine comes with a JSON document member (provided by rapidJSON) that can be loaded from an external JSON file. A typical approach is to allow the user to specify the path to such a config file as application parameter and then load it directly into Casmine, like this:

  auto context = CasmineContext::get();
  std::ifstream configStream;
  configFile = expandPath(configFile);

  if (!configFile.empty())
    configStream.open(configFile);
  if (configStream.good()) {
    rapidjson::IStreamWrapper streamWrapper(configStream);

    rapidjson::ParseResult parseResult = context->configuration.ParseStream(streamWrapper);
    if (parseResult.IsError()) {
      const RAPIDJSON_ERROR_CHARTYPE *message = GetParseError_En(parseResult.Code());
      std::cerr configuration.IsObject()) {
      std::cerr

Casmine supports easy access to this config data via CasmineContext::configuration. Additionally, you can use helper functions to get values from a specific path through the JSON file (nothing fancy like XPath, just a simple recursive name lookup, which limits this to named elements, such as, no array elements). The names of these helpers are pretty self explanatory:

  std::string getConfigurationStringValue(std::string const& path, std::string const& defaultValue = "") const;
  int getConfigurationIntValue(std::string const& path, int defaultValue = 0) const;
  double getConfigurationDoubleValue(std::string const& path, double defaultValue = 0.0) const;
  bool getConfigurationBoolValue(std::string const& path, bool defaultValue = false) const;

Customizations

Casmine enables you to customize several aspects in a consistent way, without compromising the overall readability or handling.

Reporters

There are 2 predefined reporters currently:

Terminal/Console with color support (where possible + meaningful, that means no ugly codes in log files)
JSON result file (for easy consumption by CI and other automated systems)

Color support is automatically determined (but can also be switched off by assigning true to the settings value no-colors or setting the environment variable CASMINE_NO_COLORS=1).

It's easy to add own reporters (for example to log results in a database) by deriving from the abstract Reporter class and implementing the necessary functions that do the specific handling. The new reporter must be registered via CasmineContext::addReporter. Registered reporters can be removed using CasmineContext::clearReporters.

Expects and Matchers

Casmine already covers a large part of possible test types, but sometimes it is necessary to expand on that for specific handling, for example to provide own matcher functions for either already covered cases or your own custom types. The main anchor point for extensions is the already mentioned $ModuleEnvrionment macro. It consists of two parts:

Local extensions
Library extensions

Local extensions are valid only in the file where you specify them, while library extensions are reusable by all test specs. The technique behind this extension mechanism works with variadic base types, which is a variation of parameter packs (see the paragraph titled Base specifiers and member initializer lists). In short: the factory struct for the Expect class creation functions is composed of three parts:

A struct with all default factory functions (named DefaultExpects).
A struct in the local file (the { } block of the $ModuleEnvironment macro).
A list of custom structs which can be implemented outside the current spec file. The names of these structs must be specified in the parentheses of the $ModuleEnvironment macro).

A full environment call can be similar to this:

#include "casmine.h"

using namespace casmine;

namespace {

$ModuleEnvironment(MyExpects1, MyExpects2) {
  static auto makeExpect(const char *file, size_t line, MyType const& value) {
    return ExpectTemplate>(file, line, value);
  }
};

...

}

The two library extensions (MyExpects1 and MyExpects2), as well as the code in the environment block are all combined together with the default expects into a single local environment struct and take part in type deduction for your custom types automatically.

The implementation of a custom expect class now offers you the way to add standard matchers or own matcher functions by defining it (for example) as:

template
class MyExpect1 : public Expect, public MatcherClass, public MySuperMatcher {
public:
MyExpect1(const char *file, size_t line, T const& value, bool inverse) :
  MatcherBase(file, line, value, inverse),
  Expect(),
  MatcherClass(file, line, value, inverse),
  MySuperMatcher(file, line, value, inverse) {}
};

A custom Expect class always has to derive from 2 classes: Expect which is the base for all Expect classes and MatcherBase which is the base for all Matcher classes. Due to the virtual inheritance used for the matchers, it is necessary to explicitly call the MatcherBase constructor, even if the sub Expect class does not directly derive from it.

In addition to these 2 base classes you can derive from any number of your own (or built-in) matcher classes. Each of them can provide a specific set of possible matcher functions. Take a closer look at the built-in matchers to learn how to write such a class.

In summary: the actual testing functionality is provided by the matcher classes, while the expect classes are used to select those matchers that provide the correct functionality for a given C/C++ type and provide the environment to easily execute normal and inverted expectations.

To supplement the previous description of Casmine, see the following concrete example for a custom class:

// The class to test (implementation omitted for brevity).
class FtpClass {
public:
  void connect();
  void disconnect();
  bool isConnected();
  std::string toString();
}

// The matcher providing specific check functions for this custom class.
template
class MatcherFtp : MatcherBase {
public:
  MatcherFtp(const char *file, size_t line, T const& actual, bool inverse) :
    MatcherBase(file, line, actual, inverse) {}

  using MatcherBase::actualValue;
  using MatcherBase::invertComparison;
  using MatcherBase::processResult;

  void toBeConnected() {
    bool result = actualValue.isConnected();
    if (inverComparison)
      result != result;
    processResult(result, invertComparison ? "The connection should be closed" : "The connection should be open");
  }
}

// The wrapping Expect class for the FTP matcher.
template 
class ExpectFtp : public Expect, public MatcherFtp {
public:
  ExpectFtp(const char *file, size_t line, T const& value, bool inverse) :
    MatcherBase(file, line, value, inverse),
    Expect(),
    MatcherFtp(file, line, value, inverse) {}
};

$ModuleEnvironment() { 
  // This factory method takes part in the type deduction process to find an Expect class
  // for a specific custom type.
  static auto makeExpect(const char *file, size_t line, const FtpClass &value) {
    return ExpectTemplate(file, line, value);
  }
};

FtpClass ftp;

$describe("Test FTP") {
  $beforeAll([]() {
    ftp.connect();
  });

  $it("Test Description)", []() {
    $expect(ftp).toBeConnected();
  });
});

Library Functions

In addition to the main functionality for tests, there are some helper functions/structs to ease certain tasks. They are all used in Casmine, but are available also for general use in your tests or custom matchers.

ANSI Styles

If you want to print custom messages in your tests or reporters that are colored and/or have a certain style (bold, italic etc.), then you can use the output stream operators in the file ansi-styles.h. Here's an example:

std::cout

These operators will automatically take care to print nothing if the current output pipe is not a terminal (for example when redirected to a file).

Type Deduction

Correct type deduction is a key aspect in Casmine to provide the right set of matching functions for a specific group of types. In the file common.h are helper templates for more concise factory functions, for example EnableIf, EnableIfNot, IsContainer, IsSmartPointer and others. Here's how the scalar Expect factory method uses some:

  template>,
    typename = EnableIfNot<:is_pointer>>,
    typename = EnableIfNot>
  >
  static auto makeExpect(const char *file, size_t line, T value, int *dummy = nullptr) {
    return ExpectTemplate>(file, line, value);
  }

It is possible to freely mix the helper templates and the standard ones (as shown here for the pointer switch).

Others

The header file helpers.h contains a number of other helper functions:

splitBySet: splits a string into components using the passed-in separators.
utf32ToUtf8: converts std::u32string to std::string.
utf16ToUtf8: converts std::u16string to std::string.
utf8ToUtf16: converts std::string to std::u16string.
randomString: generates a random string from ASCII characters + digits with a maximum length.
getEnvVar: safe variant to get an environment variable, returns the given default value if a variable doesn't exist.
expandPath: expands all environment variables and special chars (like the tilde on Linux/macOS) in the given string. Converts only those values that are valid on the current platform.
relativePath: returns the second path relative to the given base path, provided both have a common ancestor.
length: determines the length of a string literal at compile time (for instance for static assertions).
typeToString(): platform agnostic C++ type name demangling.
toString(): recursive value-to-string conversion beyond the abilities of std::to_string. Output is always UTF-8encoded, also for UTF-16 and UTF-32 input.
containerToString: converts any container object (set, map etc.) into a string.

↧