Quantcast
Channel: Planet MySQL
Viewing all 18801 articles
Browse latest View live

PHP 7.2.8 & MySQL 8.0

$
0
0

A good news for all PHP CMS users like Drupal and Joomla!, PHP 7.2.8 (available on Remi’s repo for those using rpms) supports the new MySQL 8.0 default authentication plugincaching_sha2_password‘ !

So, I’ve installed PHP 7.2.8:

And I’ve my user (here joomla) uses the caching_sha2_password:

I give acces those this user to SELECT in mysql.user table:

Now let’s share my wonderful PHP skills :

<!--?php 
$mysqli = new mysqli('localhost', 'joomla', 'joomla', 'joomla_db'); if ($mysqli->connect_errno) {
   echo "Sorry, this website is experiencing problems.<p>";
   echo "Error: Failed to make a MySQL connection, here is why: <br>";
   echo "Errno: " . $mysqli->connect_errno . "<br>";
   echo "Error: " . $mysqli->connect_error . "<br>";
   exit;
}

echo "Wooohoooo it works with PHP" . phpversion() ."!!<br><hr>";

$sql = "select user, plugin from mysql.user where user = 'joomla'";
if (!$result = $mysqli->query($sql)) {
   echo "Error: Our query failed to execute and here is why: &lt;br&gt;";
   echo "Query: " . $sql . "<br>";
   echo "Errno: " . $mysqli->errno . "<br>";
   echo "Error: " . $mysqli->error . "<br>";
   exit;
}

$user = $result->fetch_assoc();
echo "user: " . $user['user'] . "<br>";
echo "plugin: " . $user['plugin'];

$result->free();
$mysqli->close();
?>

And here is the result:

In conclusion, it seems the PHP team is taking care of MySQL 8.0 support and when you check the release tags in the github commit, you can see that this is not only for PHP 7.2.8 ! Good job ! So, please upgrade to a PHP version supporting MySQL 8 !


Comparing TokuDB, RocksDB and InnoDB Performance on Intel(R) Xeon(R) Gold 6140 CPU

$
0
0

Recently one of our customers wanted us to benchmark InnoDB, TokuDB and RocksDB on Intel(R) Xeon(R) Gold 6140 CPU (with 72 CPUs),  nvme SSD (7 TB) and  530 GB RAM for performance. We have used Ubuntu xenial 16.04.4, Percona Server 5.7 (included storage engines- InnoDB/XtraDB, TokuDB and RocksDB) and  Sysbench 1.0.15 with custom Lua scripts for this exercise, This benchmarking exercise included bulk INSERTS, WRITES, READS and READS-WRITES. We have tried our best to capture maximum information about the hardware infrastructure and copied / shared scripts we have used for benchmarking. This is not a paid / sponsored benchmarking effort by any of the software or hardware vendors, We will remain forever an vendor neutral and independent web-scale database infrastructure operations company with core expertise in performance, scalability, high availability and database reliability engineering. This benchmarking is conducted by Shiv Iyer, You can contact him directly on shiv@minervadb.com to discuss more about this benchmarking project.

Hardware information 

We have captured detailed information of the infrastructure (CPU, Diskand Memory) used for this benchmarking, This really helps anyone doing capacity planning / sizing of their database infrastructure.

CPU details (Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz with 72 CPUs)

root@blr1p01-pfm-008:/home/t-minervadb# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                72
On-line CPU(s) list:   0-71
Thread(s) per core:    2
Core(s) per socket:    18
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
Stepping:              4
CPU MHz:               1000.000
CPU max MHz:           2301.0000
CPU min MHz:           1000.0000
BogoMIPS:              4601.52
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-17,36-53
NUMA node1 CPU(s):     18-35,54-71
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt spec_ctrl retpoline kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

Storage devices used for benchmarking (we used NVME SSD)

root@blr1p01-pfm-008:/home/t-minervadb# lsblk -io NAME,TYPE,SIZE,MOUNTPOINT,FSTYPE,MODEL
NAME        TYPE   SIZE MOUNTPOINT FSTYPE MODEL
sda         disk 446.1G                   LSI2208         
|-sda1      part 438.7G /          ext4   
`-sda2      part   7.5G [SWAP]     swap   
nvme0n1     disk   2.9T /mnt       ext4   Micron_9200_MTFDHAL3T2TCU               
nvme1n1     disk   2.9T                   Micron_9200_MTFDHAL3T2TCU               
nvme2n1     disk   2.9T                   Micron_9200_MTFDHAL3T2TCU               
nvme3n1     disk   2.9T                   Micron_9200_MTFDHAL3T2TCU               
`-nvme3n1p1 part   128M                   
nvme4n1     disk   2.9T                   Micron_9200_MTFDHAL3T2TCU               
`-nvme4n1p1 part   128M                   
nvme5n1     disk   2.9T                   Micron_9200_MTFDHAL3T2TCU               
`-nvme5n1p1 part   128M                   
nvme6n1     disk   2.9T                   Micron_9200_MTFDHAL3T2TCU               
`-nvme6n1p1 part   128M                   
nvme7n1     disk   2.9T                   Micron_9200_MTFDHAL3T2TCU               
`-nvme7n1p1 part   128M

Memory

root@blr1p01-pfm-008:/home/t-minervadb# free
              total        used        free      shared  buff/cache   available
Mem:      527993080    33848440   480213336       18304    13931304   492519988
Swap:       7810044           0     7810044
root@blr1p01-pfm-008:/home/t-minervadb#

Benchmarking OLTP INSERT performance on TokuDB, RocksDB and InnoDB

TokuDB OLTP INSERT performance benchmarking using Sysbench

Building Database Infrastructure for benchmarking (Percona Server with TokuDB) with INSERT operations:

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_insert.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=root --mysql-password=USEYOURPASSWORD --mysql-storage-engine=tokudb prepare
sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)

Initializing worker threads...

Creating table 'sbtest1'...
Inserting 100000000 records into 'sbtest1'
Creating a secondary index on 'sbtest1'...

“sbtest1” schema structure ( TokuDB storage engine with 100M rows)

mysql> show table status like 'sbtest1%'\G;
*************************** 1. row ***************************
           Name: sbtest1
         Engine: TokuDB
        Version: 10
     Row_format: tokudb_zlib
           Rows: 100000000
 Avg_row_length: 189
    Data_length: 18900000000
Max_data_length: 9223372036854775807
   Index_length: 860808942
      Data_free: 18446744065817975570
 Auto_increment: 100000001
    Create_time: 2018-08-03 23:03:35
    Update_time: 2018-08-03 23:23:51
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: 
        Comment: 
1 row in set (0.00 sec)

ERROR: 
No query specified

Benchmarking TokuDB (with 100M rows) INSERT using Sysbench (oltp_insert.lua)

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_insert.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=root --mysql-password=USEYOURPASSWORD --mysql-storage-engine=tokudb run

Monitoring the benchmarking

mysql> show full processlist\G;
*************************** 1. row ***************************
           Id: 106
         User: root
         Host: localhost
           db: test
      Command: Query
         Time: 0
        State: update
         Info: INSERT INTO sbtest1 (id, k, c, pad) VALUES (0, 49754892, '62632931051-58961919101-49940198850-21078424594-43546312816-91483171956-63147821178-73320074434-75390450161-85244468625', '72758152721-79346997448-32739052749-09956023061-33461120469')
    Rows_sent: 0
Rows_examined: 0
*************************** 2. row ***************************
           Id: 107
         User: root
         Host: localhost
           db: test
      Command: Query
         Time: 0
        State: update
         Info: INSERT INTO sbtest1 (id, k, c, pad) VALUES (0, 38299901, '73492364485-17164009439-13897782190-82384134069-56725118845-05888552123-04466761496-73013947541-76946111000-82170241506', '57825848902-56599269429-55553620227-85565361679-86108748354')
    Rows_sent: 0
Rows_examined: 0
*************************** 3. row ***************************
           Id: 108
         User: root
         Host: localhost
           db: test
      Command: Query
         Time: 0
        State: closing tables
         Info: INSERT INTO sbtest1 (id, k, c, pad) VALUES (0, 50461359, '82489034494-43306780333-31830745333-81619557910-15670574031-38606658735-35015531633-82686313168-29930813640-55800112343', '98734612239-15166737116-32153746057-36526618555-01917900606')
    Rows_sent: 0
Rows_examined: 0
*************************** 4. row ***************************
           Id: 109
         User: root
         Host: localhost
           db: test
      Command: Query
         Time: 0
        State: update
         Info: INSERT INTO sbtest1 (id, k, c, pad) VALUES (0, 50305368, '26004165285-71866035101-19429620467-21730816230-28360163045-85578016857-31504027785-22011080750-52188150293-29047779256', '40086488864-24563838334-16649832399-35567929449-35827527600')
    Rows_sent: 0
Rows_examined: 0

*************************** 98. row ***************************
           Id: 203
         User: root
         Host: localhost
           db: test
      Command: Query
         Time: 0
        State: update
         Info: INSERT INTO sbtest1 (id, k, c, pad) VALUES (0, 50008367, '08590860349-55330969614-92736003669-70093680275-08791372163-86879862146-65906035624-31616634007-39285699730-30091204027', '03546380555-08125979095-56416888610-57364610871-45465441885')
    Rows_sent: 0
Rows_examined: 0
*************************** 99. row ***************************
           Id: 204
         User: root
         Host: localhost
           db: test
      Command: Query
         Time: 0
        State: update
         Info: INSERT INTO sbtest1 (id, k, c, pad) VALUES (0, 54541565, '62284574810-41408816172-84693515960-17097326417-15199773762-35816031089-51785557714-03836189148-75055812047-57404275889', '89419445215-23758954221-31182195029-89303506158-96423989766')
    Rows_sent: 0
Rows_examined: 0
*************************** 100. row ***************************
           Id: 205
         User: root
         Host: localhost
           db: test
      Command: Query
         Time: 0
        State: update
         Info: INSERT INTO sbtest1 (id, k, c, pad) VALUES (0, 49961655, '04968809340-71773840704-69257717063-97968863839-17701720758-38065324563-11587467460-13905955489-57279753705-77707929689', '02758577051-41889982054-46749141829-07683639044-92209230468')
    Rows_sent: 0
Rows_examined: 0
*************************** 101. row ***************************
           Id: 206
         User: root
         Host: localhost
           db: NULL
      Command: Query
         Time: 0
        State: starting
         Info: show full processlist
    Rows_sent: 0
Rows_examined: 0
101 rows in set (0.00 sec)

ERROR: 
No query specified

Result

When interpreting the benchmarking results, I look for transactions / queries per second (in this case, it is 10048.74 per sec.) and average latency (9.95 ms.) ,

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_insert.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=root --mysql-password=USEYOURPASSWORD --mysql-storage-engine=tokudb run
sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 100
Initializing random number generator from current time


Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            0
        write:                           18088064
        other:                           0
        total:                           18088064
    transactions:                        18088064 (10048.74 per sec.)
    queries:                             18088064 (10048.74 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          1800.0299s
    total number of events:              18088064

Latency (ms):
         min:                                    0.24
         avg:                                    9.95
         max:                                  210.80
         95th percentile:                       22.28
         sum:                            179905047.86

Threads fairness:
    events (avg/stddev):           180880.6400/323.88
    execution time (avg/stddev):   1799.0505/0.01

Benchmarking OLTP INSERT performance on RocksDB using Sysbench 

Step 1 – Prepare data

sysbench oltp_insert.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=root --mysql-password=USEYOURPASSWORD --mysql-storage-engine=rocksdb prepare

Step 2 – “sbtest1” schema structure ( RocksDB storage engine with 100M rows)

mysql> show table status like 'sbtest1'\G;
*************************** 1. row ***************************
           Name: sbtest1
         Engine: ROCKSDB
        Version: 10
     Row_format: Fixed
           Rows: 100000000
 Avg_row_length: 198
    Data_length: 19855730417
Max_data_length: 0
   Index_length: 750521287
      Data_free: 0
 Auto_increment: 100000001
    Create_time: NULL
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: 
        Comment: 
1 row in set (0.01 sec)

ERROR: 
No query specified

Step3 – Benchmarking OLTP INSERT performance on RocksDB

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_insert.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=root --mysql-password=USEYOURPASSWORD --mysql-storage-engine=rocksdb run
sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 100
Initializing random number generator from current time


Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            0
        write:                           137298161
        other:                           0
        total:                           137298161
    transactions:                        137298161 (76275.15 per sec.)
    queries:                             137298161 (76275.15 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          1800.0344s
    total number of events:              137298161

Latency (ms):
         min:                                    0.29
         avg:                                    1.31
         max:                                   66.32
         95th percentile:                        1.67
         sum:                            179465859.14

Threads fairness:
    events (avg/stddev):           1372981.6100/73.07
    execution time (avg/stddev):   1794.6586/0.02

Interpreting results 

Transactions / Queries  (per second) – 76275.15

Average latency (ms) – 1.31

Benchmarking OLTP INSERT performance on InnoDB using Sysbench 

Step 1 – prepare data for benchmarking

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_insert.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=root --mysql-password=USEYOURPASSWORD  prepare
sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)

Initializing worker threads...

Creating table 'sbtest1'...
Inserting 100000000 records into 'sbtest1'
Creating a secondary index on 'sbtest1'...

Step 2 – “sbtest1” schema structure ( InnoDB storage engine with 100M rows)

mysql> show table status like 'sbtest1%'\G; 
*************************** 1. row ***************************
           Name: sbtest1
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 98682155
 Avg_row_length: 218
    Data_length: 21611151360
Max_data_length: 0
   Index_length: 0
      Data_free: 3145728
 Auto_increment: 100000001
    Create_time: 2018-08-04 17:14:04
    Update_time: 2018-08-04 17:11:01
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: 
        Comment: 
1 row in set (0.00 sec)

ERROR: 
No query specified

Step3 – Benchmarking OLTP INSERT performance on InnoDB

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_insert.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=root --mysql-password=USEYOURPASSWORD  run
sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 100
Initializing random number generator from current time


Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            0
        write:                           42243914
        other:                           0
        total:                           42243914
    transactions:                        42243914 (23468.40 per sec.)
    queries:                             42243914 (23468.40 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          1800.0319s
    total number of events:              42243914

Latency (ms):
         min:                                    0.12
         avg:                                    4.26
         max:                                 1051.64
         95th percentile:                       21.50
         sum:                            179801087.85

Threads fairness:
    events (avg/stddev):           422439.1400/1171.09
    execution time (avg/stddev):   1798.0109/0.01

Interpreting results 

Transactions / Queries  (per second) – 23468.40

Average latency (ms) – 4.26

Graphical representation of  OLTP INSERT performance in TokuDB, RocksDB and InnoDB: 

Benchmarking OLTP READ-ONLY transactions performance on TokuDB, RocksDB and InnoDB

Benchmarking READ-ONLY OLTP transactions (100M records using oltp_read_only.lua) on TokuDB:

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_read_only.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-storage-engine=tokudb --mysql-user=root --mysql-password=USEYOURPASSWORD prepare

Step 2- Confirm TokuDB schema is available with 100M records:

mysql> show table status like 'sbtest1%'\G;
*************************** 1. row ***************************
           Name: sbtest1
         Engine: TokuDB
        Version: 10
     Row_format: tokudb_zlib
           Rows: 100000000
 Avg_row_length: 189
    Data_length: 18900000000
Max_data_length: 9223372036854775807
   Index_length: 860426496
      Data_free: 18446744065835135232
 Auto_increment: 100000001
    Create_time: 2018-08-05 12:53:50
    Update_time: 2018-08-05 13:13:38
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: 
        Comment: 
1 row in set (0.00 sec)

ERROR: 
No query specified

Step 3 – Benchmarking TokuDB OLTP READ-ONLY transaction performance:

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_read_only.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-storage-engine=tokudb --mysql-user=root --mysql-password=USEYOURPASSWORD run  
sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 100
Initializing random number generator from current time


Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            231960820
        write:                           0
        other:                           33137260
        total:                           265098080
    transactions:                        16568630 (9204.59 per sec.)
    queries:                             265098080 (147273.50 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          1800.0348s
    total number of events:              16568630

Latency (ms):
         min:                                    1.71
         avg:                                   10.86
         max:                                   51.11
         95th percentile:                       13.22
         sum:                            179951191.99

Threads fairness:
    events (avg/stddev):           165686.3000/481.89
    execution time (avg/stddev):   1799.5119/0.01

Interpreting results 

QPS  (Queries per second) – 147273.50

Average latency (ms) – 10.86

Benchmarking READ-ONLY OLTP transactions on RocksDB 

Step 1- Build data(100M records using oltp_read_only.lua) for benchmarking:

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_read_only.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-storage-engine=rocksdb --mysql-user=root --mysql-password=USEYOURPASSWORD prepare

Step 2- Confirm RocksDB schema is available with 100M records:

mysql> show table status like 'sbtest%'\G; 
*************************** 1. row ***************************
           Name: sbtest1
         Engine: ROCKSDB
        Version: 10
     Row_format: Fixed
           Rows: 100000000
 Avg_row_length: 198
    Data_length: 19855730417
Max_data_length: 0
   Index_length: 750521333
      Data_free: 0
 Auto_increment: 100000001
    Create_time: NULL
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: 
        Comment: 
1 row in set (0.00 sec)

ERROR: 
No query specified

Step 3 – Benchmarking RocksDB OLTP READ-ONLY transaction performance:

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_read_only.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-storage-engine=rocksdb --mysql-user=root --mysql-password=USEYOURPASSWORD run 
sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 100
Initializing random number generator from current time


Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            494461100
        write:                           0
        other:                           70637300
        total:                           565098400
    transactions:                        35318650 (19621.05 per sec.)
    queries:                             565098400 (313936.76 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          1800.0349s
    total number of events:              35318650

Latency (ms):
         min:                                    1.80
         avg:                                    5.09
         max:                                  323.58
         95th percentile:                        7.70
         sum:                            179898262.01

Threads fairness:
    events (avg/stddev):           353186.5000/2619.22
    execution time (avg/stddev):   1798.9826/0.02

Interpreting results 

QPS  (Queries per second) – 313936.76

Average latency (ms) – 5.09

Benchmarking READ-ONLY OLTP transactions on InnoDB

Step 1: Build data (100M records using oltp_read_only.lua) for benchmarking:

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_read_only.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock  --mysql-user=root --mysql-password=USEYOURPASSWORD prepare

Step 2 – Step 2- Confirm InnoDB schema is available with 100M records:

mysql> show table status like 'sbtest1'\G;
*************************** 1. row ***************************
           Name: sbtest1
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 98650703
 Avg_row_length: 224
    Data_length: 22126002176
Max_data_length: 0
   Index_length: 0
      Data_free: 3145728
 Auto_increment: 100000001
    Create_time: 2018-08-05 17:20:48
    Update_time: 2018-08-05 17:18:19
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: 
        Comment: 
1 row in set (0.00 sec)

ERROR: 
No query specified

Step 3 – Benchmarking InnoDB OLTP READ-ONLY transaction performance:

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_read_only.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock  --mysql-user=root --mysql-password=USEYOURPASSWORD run
sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 100
Initializing random number generator from current time


Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            251061874
        write:                           0
        other:                           35865982
        total:                           286927856
    transactions:                        17932991 (9962.59 per sec.)
    queries:                             286927856 (159401.44 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          1800.0300s
    total number of events:              17932991

Latency (ms):
         min:                                    1.66
         avg:                                   10.03
         max:                                 1478.79
         95th percentile:                       33.12
         sum:                            179947481.25

Threads fairness:
    events (avg/stddev):           179329.9100/1283.20
    execution time (avg/stddev):   1799.4748/0.01

Interpreting results 

QPS  (Queries per second) – 159401.44

Average latency (ms) – 10.03

Graphical representation of  OLTP READ-ONLY transactions performance in TokuDB, RocksDB and InnoDB: 

Benchmarking OLTP READ-WRITE transactions performance on TokuDB, RocksDB and InnoDB

Benchmarking READ-WRITE OLTP transactions on TokuDB

Step 1: Build data (100M records using oltp_read_write.lua) for benchmarking:

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_read_write.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-storage-engine=tokudb --mysql-user=root --mysql-password=USEYOURPASSWORD prepare

Step 2- Confirm TokuDB schema is available with 100M records:

mysql> show table status like 'sbtest1%'\G; 
*************************** 1. row ***************************
           Name: sbtest1
         Engine: TokuDB
        Version: 10
     Row_format: tokudb_zlib
           Rows: 100000000
 Avg_row_length: 189
    Data_length: 18900000000
Max_data_length: 9223372036854775807
   Index_length: 860645232
      Data_free: 18446744065834916496
 Auto_increment: 100000001
    Create_time: 2018-08-05 22:41:43
    Update_time: 2018-08-05 23:01:00
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: 
        Comment: 
1 row in set (0.00 sec)

ERROR: 
No query specified

Step3 – Benchmarking OLTP READ-WRITE performance on TokuDB:

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_read_write.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-storage-engine=tokudb --mysql-user=root --mysql-password=USEYOURPASSWORD run 
sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 100
Initializing random number generator from current time


Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            19844342
        write:                           5669812
        other:                           2834906
        total:                           28349060
    transactions:                        1417453 (787.44 per sec.)
    queries:                             28349060 (15748.86 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          1800.0668s
    total number of events:              1417453

Latency (ms):
         min:                                    3.90
         avg:                                  126.99
         max:                                  426.41
         95th percentile:                      147.61
         sum:                            179997357.31

Threads fairness:
    events (avg/stddev):           14174.5300/7.61
    execution time (avg/stddev):   1799.9736/0.02

Interpreting results 

QPS  (Queries per second) – 15748.86

Average latency (ms) – 126.99

Benchmarking READ-WRITE OLTP transactions on RocksDB

Step 1: Build data (100M records using oltp_read_write.lua) for benchmarking:

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_read_write.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-storage-engine=rocksdb --mysql-user=root --mysql-password=USEYOURPASSWORD prepare

Step 2- Confirm RocksDB schema is available with 100M records:

mysql> show table status like 'sbtest1%'\G; 
*************************** 1. row ***************************
           Name: sbtest1
         Engine: ROCKSDB
        Version: 10
     Row_format: Fixed
           Rows: 100000000
 Avg_row_length: 198
    Data_length: 19855694789
Max_data_length: 0
   Index_length: 750521319
      Data_free: 0
 Auto_increment: 100000001
    Create_time: NULL
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: 
        Comment: 
1 row in set (0.00 sec)

ERROR: 
No query specified

Step3 – Benchmarking OLTP READ-WRITE performance on RocksDB:

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_read_write.lua --threads=100 --time=1800 --table-size=100000000  --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-storage-engine=rocksdb --mysql-user=root --mysql-password=USEYOURPASSWORD run 
sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 100
Initializing random number generator from current time


Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            286818014
        write:                           81910410
        other:                           40961372
        total:                           409689796
    transactions:                        20474371 (11374.39 per sec.)
    queries:                             409689796 (227600.23 per sec.)
    ignored errors:                      12630  (7.02 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          1800.0375s
    total number of events:              20474371

Latency (ms):
         min:                                    2.50
         avg:                                    8.79
         max:                                  402.68
         95th percentile:                       12.75
         sum:                            179935638.52

Threads fairness:
    events (avg/stddev):           204743.7100/2264.14
    execution time (avg/stddev):   1799.3564/0.01

Interpreting results 

QPS  (Queries per second) – 227600.23

Average latency (ms) – 8.79

Benchmarking READ-WRITE OLTP transactions on InnoDB

Step 1: Build data (100M records using oltp_read_write.lua) for benchmarking:

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_read_write.lua --threads=100 --time=1800 --table-size=100000000 --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=root --mysql-password=USEYOURPASSWORD prepare

Step 2- Confirm InnoDB schema is available with 100M records:

mysql> show table status like 'sbtest1%'\G; 
*************************** 1. row ***************************
           Name: sbtest1
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 100000000
 Avg_row_length: 221
    Data_length: 21885878272
Max_data_length: 0
   Index_length: 0
      Data_free: 6291456
 Auto_increment: 100000001
    Create_time: 2018-08-06 10:24:54
    Update_time: 2018-08-06 10:31:53
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: 
        Comment: 
1 row in set (0.00 sec)

ERROR: 
No query specified

Step3 – Benchmarking OLTP READ-WRITE performance on InnoDB:

root@blr1p01-pfm-008:/usr/share/sysbench# sysbench oltp_read_write.lua --threads=100 --time=1800 --table-size=100000000 --db-driver=mysql --mysql-db=test --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=root --mysql-password=USEYOURPASSWORD run
sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 100
Initializing random number generator from current time


Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            67383470
        write:                           19251931
        other:                           9626043
        total:                           96261444
    transactions:                        4812938 (2673.78 per sec.)
    queries:                             96261444 (53477.03 per sec.)
    ignored errors:                      167    (0.09 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          1800.0491s
    total number of events:              4812938

Latency (ms):
         min:                                    2.28
         avg:                                   37.40
         max:                                 1177.78
         95th percentile:                       71.83
         sum:                            179981855.37

Threads fairness:
    events (avg/stddev):           48129.3800/110.24
    execution time (avg/stddev):   1799.8186/0.00

Interpreting results 

QPS  (Queries per second) – 53477.03

Average latency (ms) – 37.40

Graphical representation of  OLTP READ-WRITE transactions performance in TokuDB, RocksDB and InnoDB: 

Conclusion

The results of benchmarking concluded RocksDB the most ideal candidate for SSD based storage infrastructure compared to InnoDB and TokuDB, The most compelling reasons for using RocksDB on SSD are storage efficiency, compression and much smaller write amplification compared to InnoDB or TokuDB.

The post Comparing TokuDB, RocksDB and InnoDB Performance on Intel(R) Xeon(R) Gold 6140 CPU appeared first on MySQL Consulting, Support and Remote DBA Services.

Webinar Tues 8/7: Utilizing ProxySQL for Connection Pooling in PHP

$
0
0
ProxySQL for connection pooling

ProxySQL for connection poolingPlease join Percona’s Architect, Tibi Köröcz as he presents Utilizing ProxySQL for Connection Pooling in PHP on Tuesday August 7, 2018, at 7:00 am PDT (UTC-7) / 10:00 am EDT (UTC-4).

 

ProxySQL is a very powerful tool, with extended capabilities. This presentation will demonstrate how to use ProxySQL to gain functionality (seamless database backend switch) and correct problems (applications missing connection pooling).

The presentation will be a real-life study on how we use ProxySQL for connection pooling, database failover and load balancing the communication between our (third party) PHP-application and our master-master MySQL-cluster.
Also, we will show monitoring and statistics using Percona Monitoring and Management (PMM).

Register Now!

Tibor Köröcz

Architect

ProxySQL for Connection Pooling

Tibi joined Percona in 2015 as a Consultant. Before joining Percona, among many other things, he worked at the world’s largest car hire booking service as a Senior Database Engineer. He enjoys trying and working with the latest technologies and applications which can help or work with MySQL together. In his spare time he likes to spend time with his friends, travel around the world and play ultimate frisbee.

 

The post Webinar Tues 8/7: Utilizing ProxySQL for Connection Pooling in PHP appeared first on Percona Database Performance Blog.

Multi-Cloud SaaS Applications: Speed + Availability = Success!

$
0
0

In this blog post, we talk about how to run applications across multiple clouds (i.e. AWS, Google Cloud, Microsoft Azure) using Continuent Clustering. You want your business-critical applications to withstand node, datacenter, availability-zone or regional failures. For SaaS apps, you also want to bring data close to your application users for faster response times and a better user experience. With cross-cloud capability, Continuent also helps avoid lock-in to any particular cloud provider

The key to success for the database layer is to be available and respond rapidly.

From both a business and operational perspective, spreading the application across cloud environments from different vendors provides significant protection against vendor-specific outages and vendor lock-in. Running on multiple platforms provides greater bargaining leverage with each vendor, because they do not have a monopoly on your cloud operations.

Continuent Clustering is flexible and platform-agnostic, which means it allows me to run across many different environments. You may mix-and-match any and all together (i.e. one or more clusters of 3 nodes each per environment), which means I can have a cluster spanning AWS, GCS, Azure and even include my bare-metal datacenter. For cost savings, test/dev/etc. can be put on VM’s, Docker containers or even VirtualBox.

There are many challenges to running an application over large distances, including network latency for reads and getting local writes distributed to other regions.

Continuent Clustering provides a cohesive solution which addresses the various concerns when running a geo-distributed application.

Let’s look at the various factors, and how each is handled:

Be available

  • (local) – when looked at from a local perspective, this is considered high availability (HA). If the MySQL database, which is handling writes (and reads) should become unavailable, automatically switch to another server with the same information and proceed to serve writes and reads to the application layer with as little down-time as possible.
  • (global) – when looked at from a global perspective, this is called disaster recovery (DR). Should an entire site, region, availability zone or even cloud become unavailable, allow another site with the same information to serve writes and reads to the application layer with as little down-time as possible.
  • (global) – the Tungsten Replicator is cluster-aware, which means that in the event of the complete loss of a remote node to obtain data from, the Replicator is able to automatically switch to another source node and pick up where it left off.
  • (global) – the Tungsten Connector is cluster- and site-aware, and is able to route both read and write requests to both local and remote resources. When the Connector is installed directly on an application server, a Multimaster cluster topology can withstand the loss of the entire database layer at one site by redirecting reads and writes to another region.

Respond rapidly to requests

  • (local) – by using the failover replicas as read sources, we are able to offload the requests from the master, freeing up valuable resources on the master (i.e. CPU, memory, disk I/O, network bandwidth, etc.). This has the double effect of increasing performance on the write master and improving response time for reads.
  • (global) – employing active/active multimaster clustering, writes to each region are replicated to all other regions. This then makes the data available for local reads, so the database layer is able to respond much more quickly to requests for specific data which otherwise would have to be fetched from a remote site over the WAN, adding precious milliseconds to every query.
  • (global) – the built-in Tungsten Replicator provides loosely-coupled asynchronous data transfer both to local read replicas as well as to all remote sites. Given that WAN connections sometimes have high latency and even complete disconnects, the Replicator is able to track every event, pause when the link is down and resume when the link becomes available.

All of the above facets combine to make a polished diamond of a solution, fit for your company’s worldwide enterprise-quality deployment!

If you are interested in running a proof-of-concept, please contact us now!

Releasing ProxySQL 1.4.10

$
0
0
ProxySQL 1.4.10

Proudly announcing the release of the latest stable release of ProxySQL 1.4.10 as of the 6th of August 2018.

ProxySQL is a high performance, high availability, protocol aware proxy for MySQL. Freely usable and accessible according to GPL license and can be downloaded here or using the APT / YUM repos listed in the project's Github WIKI.

ProxySQL 1.4.10 includes a number of important improvements and bug fixes including:

  • Fixed a bug related to FreeBSD compile #1536
  • Various memory leaks addressed
  • Better handling of connect_timeout_server_max
  • ProxySQL now exits if unable to the specified config on startup
  • Fixed proxysql_galera_checker.sh /sbin in $PATH #1597
  • If a backend generates errors while running queries, implement the same logic of errors during connections
  • Handled cases in which dbname in HandshakeResponse41 is not null terminated
  • Fixed STMT_SEND_LONG_DATA processing which was incorrectly reading data from the STMT_EXECUTE packet, causing corruption of any subsequent parameters.
  • Report a warning if mysql-query_digests=false #1591
  • Disabled unnecessary options from builtin curl compile
  • Use SELECT @@global.read_only for monitoring to avoid contention/locking issues #1621
  • Add randomness when scheduling backend monitor checks #1630
  • Do not decrease count of used connection when connection was rejected #1626
  • Kill backend connections using KILL when a client disconnects, added global variable mysql-kill_backend_connection_when_disconnect
  • Fixed a bug to prevent proxysql from hanging when sending query to server is slow
  • Define CLOCK_MONOTONIC as CLOCK_SYSTEM when not defined #1571

The related issues/commits can be found in the v1.4.10 release.

A special thanks to all the people that report bugs: this makes each version of ProxySQL better than the previous one.

Please report any bugs or feature requests on github issue tracker

Authored by: Nick Vyzas

Replicating from MySQL 8.0 to MySQL 5.7

$
0
0
replicate from MySQL 8 to MySQL 5.7

In this blog post, we’ll discuss how to set a replication from MySQL 8.0 to MySQL 5.7. There are some situations that having this configuration might help. For example, in the case of a MySQL upgrade, it can be useful to have a master that is using a newer version of MySQL to an older version slave as a rollback plan. Another example is in the case of upgrading a master x master replication topology.

Officially, replication is only supported between consecutive major MySQL versions, and only from a lower version master to a higher version slave. Here is an example of a supported scenario:

5.7 master –> 8.0 slave

while the opposite is not supported:

8.0 master –> 5.7 slave

In this blog post, I’ll walk through how to overcome the initial problems to set a replication working in this scenario. I’ll also show some errors that can halt the replication if a new feature from MySQL 8 is used.

Here is the initial set up that will be used to build the topology:

slave > select @@version;
+---------------+
| @@version     |
+---------------+
| 5.7.17-log |
+---------------+
1 row in set (0.00 sec)
master > select @@version;
+-----------+
| @@version |
+-----------+
| 8.0.12    |
+-----------+
1 row in set (0.00 sec)

First, before executing the CHANGE MASTER command, you need to modify the collation on the master server. Otherwise the replication will run into this error:

slave > show slave status\G
                   Last_Errno: 22
                   Last_Error: Error 'Character set '#255' is not a compiled character set and is not specified in the '/opt/percona_server/5.7.17/share/charsets/Index.xml' file' on query. Default database: 'mysql8_1'. Query: 'create database mysql8_1'

This is because the default character_set and the collation has changed on MySQL 8. According to the documentation:

The default value of the character_set_server and character_set_database system variables has changed from latin1 to utf8mb4.

The default value of the collation_server and collation_database system variables has changed from latin1_swedish_ci to utf8mb4_0900_ai_ci.

Let’s change the collation and the character set to utf8 on MySQL 8 (it is possible to use any option that exists in both versions):

# master my.cnf
[client]
default-character-set=utf8
[mysqld]
character-set-server=utf8
collation-server=utf8_unicode_ci

You need to restart MySQL 8 to apply the changes. Next, after the restart, you have to create a replication user using mysql_native_password.This is because MySQL 8 changed the default Authentication Plugin to caching_sha2_password which is not supported by MySQL 5.7. If you try to execute the CHANGE MASTER command with a user using caching_sha2_password plugin, you will receive the error message below:

Last_IO_Errno: 2059
Last_IO_Error: error connecting to master 'root@127.0.0.1:19025' - retry-time: 60 retries: 1

To create a user using mysql_native_password :

master> CREATE USER 'replica_user'@'%' IDENTIFIED WITH mysql_native_password BY 'repli$cat';
master> GRANT REPLICATION SLAVE ON *.* TO 'replica_user'@'%';

Finally, we can proceed as usual to build the replication:

master > show master status\G
*************************** 1. row ***************************
File: mysql-bin.000007
Position: 155
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set:
1 row in set (0.00 sec)
slave > CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_USER='replica_user', MASTER_PASSWORD='repli$cat',MASTER_PORT=19025, MASTER_LOG_FILE='mysql-bin.000007', MASTER_LOG_POS=155; start slave;
Query OK, 0 rows affected, 2 warnings (0.01 sec)
Query OK, 0 rows affected (0.00 sec)
# This procedure works with GTIDs too
slave > CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_USER='replica_user', MASTER_PASSWORD='repli$cat',MASTER_PORT=19025,MASTER_AUTO_POSITION = 1 ; start slave;

Checking the replication status:

master > show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 127.0.0.1
Master_User: replica_user
Master_Port: 19025
Connect_Retry: 60
Master_Log_File: mysql-bin.000007
Read_Master_Log_Pos: 155
Relay_Log_File: mysql-relay.000002
Relay_Log_Pos: 321
Relay_Master_Log_File: mysql-bin.000007
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 155
Relay_Log_Space: 524
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 100
Master_UUID: 00019025-1111-1111-1111-111111111111
Master_Info_File: /home/vinicius.grippa/sandboxes/rsandbox_5_7_17/master/data/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.01 sec)

Executing a quick test to check if the replication is working:

master > create database vinnie;
Query OK, 1 row affected (0.06 sec)

slave > show databases like 'vinnie';
+-------------------+
| Database (vinnie) |
+-------------------+
| vinnie |
+-------------------+
1 row in set (0.00 sec)

Caveats

Any tentative attempts to use a new feature from MySQL 8 like roles, invisible indexes or caching_sha2_password will make the replication stop with an error:

master > alter user replica_user identified with caching_sha2_password by 'sekret';
Query OK, 0 rows affected (0.01 sec)

slave > show slave status\G
               Last_SQL_Errno: 1396
               Last_SQL_Error: Error 'Operation ALTER USER failed for 'replica_user'@'%'' on query. Default database: ''. Query: 'ALTER USER 'replica_user'@'%' IDENTIFIED WITH 'caching_sha2_password' AS '$A$005$H	MEDi\"gQ
                        wR{/I/VjlgBIUB08h1jIk4fBzV8kU1J2RTqeqMq8Q2aox0''

Summary

Replicating from MySQL 8 to MySQL 5.7 is possible. In some scenarios (especially upgrades), this might be helpful, but it is not advisable to have a heterogeneous topology because it will be prone to errors and incompatibilities under some cases.

You might also like:

 

The post Replicating from MySQL 8.0 to MySQL 5.7 appeared first on Percona Database Performance Blog.

Upcoming Webinar Tuesday, 7/31: Using MySQL for Distributed Database Architectures

$
0
0
Distributed Database Architectures

Distributed Database ArchitecturesPlease join Percona’s CEO, Peter Zaitsev as he presents Using MySQL for Distributed Database Architectures on Tuesday, July 31st, 2018 at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4).

 

In modern data architectures, we’re increasingly moving from single-node design systems to distributed architectures using multiple nodes – often spread across multiple databases and multiple continents. Such architectures bring many benefits (such as scalability and resiliency), but can also bring a lot of pain if incorrectly designed and executed.

In this presentation, we will look at how we can use MySQL to engineer distributed multi-node systems.

Register for the webinar.

Peter ZaitsevPeter Zaitsev, CEO and Co-Founder

Peter Zaitsev co-founded Percona and assumed the role of CEO in 2006. As one of the foremost experts on MySQL strategy and optimization, Peter leveraged both his technical vision and entrepreneurial skills to grow Percona from a two-person shop to one of the most respected open source companies in the business. With over 140 professionals in 30 plus countries, Peter’s venture now serves over 3000 customers – including the “who’s who” of internet giants, large enterprises and many exciting startups. Inc. 5000 named Percona to their list in 2013, 2014, 2015 and 2016. Peter was an early employee at MySQL AB, eventually leading the company’s High-Performance Group. A serial entrepreneur, Peter co-founded his first startup while attending Moscow State University where he majored in Computer Science. Peter is a co-author of High-Performance MySQL: Optimization, Backups, and Replication, one of the most popular books on MySQL performance. Peter frequently speaks as an expert lecturer at MySQL and related conferences, and regularly posts on the Percona Database Performance Blog. He has also been tapped as a contributor to Fortune and DZone, and his ebook Practical MySQL Performance Optimization is one of Percona’s most popular downloads.

 

The post Upcoming Webinar Tuesday, 7/31: Using MySQL for Distributed Database Architectures appeared first on Percona Database Performance Blog.

MySQL to Amazon Redshift Replication.

$
0
0

In our work, We used to get a lot of requirements for replicating data from one data source to another. Our team provided solutions to replicate data from MySQL to Vertica, Amazon Redshift, Hadoop. Out of which Amazon Redshift replication is a bit complicated as Amazon Redshift is a Database as a service (DBaaS) and the process is not straightforward.

So, I take this opportunity to guide on how to replicate the specific set of tables from MySQL to AWS Redshift using Tungsten replicator.

1.0. Tungsten Replicator:

Tungsten Replicator is an open source replication engine supports data extract from MySQL, MySQL Variants such as RDS, Percona Server, MariaDB and Oracle and allows the data extracted to be applied on other data sources such as Vertica, Cassandra, Redshift etc.

Tungsten Replicator includes support for parallel replication, and advanced topologies such as fan-in and multi-master, and can be used efficiently in cross-site deployments.

1.1.0. General Architecture:

Screen Shot 2018-07-27 at 1.06.19 PM.png

There are three major components in tungsten replicator
1. Extractor / Master Service
2. Transaction History Log (THL)
3. Applier / Slave Service

1.1.1. Extractor / Master Service:

The extractor component reads data from MySQL’s binary log and writes that information into the Transaction History Log (THL).

1.1.2. Transaction History Log (THL):

The Transaction History Log (THL) acts as a translator between two different data sources. It stores transactional data from different data servers in a universal format using the replicator service acting as a master, That could then be processed Applier / Slave service.

1.1.3. Applier / Slave Service:

All the raw row-data recorded on the THL logs is re-assembled or constructed into another format such as JSON or BSON, or external CSV formats that enable the data to be loaded in bulk batches into a variety of different targets.

Therefore Statement information is not supported for heterogeneous deployments. So It’s mandatory that Binary log format on MySQL is ROW.

1.2.0. Pre Requisites:

1.2.1. Server Packages:

  • JDK 7 or higher
  • Ant 1.8 or higher
  • Ruby
  • Net-SSH

1.2.2. MySQL:

  • All the tables to be replicated must have a primary key.
  • Following MySQL configuration should be enabled on MySQL
    binlog-format             = row
    binlog-row-image      = full
    collation-server          = utf8_general_ci
    character-set-server  = utf8

1.2.3. Redshift:

  • Database name, Schema_name should be same as MySQL Database name of the tables to be replicated.

1.2.4. S3 Bucket:

  •  Read & write access to an AWS S3 Bucket. (Access key, Secret key is required)

2.0. Requirement:

  • Consider the servers with below details are used for Demo.

AWS EC2 MySQL Server  – 172.19.12.234
AWS Redshift                     – 172.19.12.116 (Database as a Service)
AWS S3 bucket                  – s3://mydbops-migration

As Redshift is a database as a service, We just have an endpoint to connect. Therefore We will be installing both the tungsten Master / Slave service on the MySQL server itself.

  • We would need to replicate the tables empemp_records from new_year database on the MySQL server to Redshift. Structures of the table are given below.
CREATE TABLE `emp` (
`no` int(11) NOT NULL,
`city` varchar(50) DEFAULT NULL,
`state` varchar(50) DEFAULT NULL,
PRIMARY KEY (`no`)
) ENGINE=InnoDB;

CREATE TABLE `emp_records` (
`no` int(11) NOT NULL,
`name` varchar(50) DEFAULT NULL,
`address` varchar(50) DEFAULT NULL,
PRIMARY KEY (`no`)
) ENGINE=InnoDB;

 

3.0. Implementation:

The implementation consists of following steps.

  1. Installation / Building tungsten from source
  2. Preparing equivalent schema for Redshift
  3. Configuring Master service
  4. Configuring Slave service
  5. Generating worker tables (temp tables used by tungsten) for replication to be created on redshift
  6. Start the replication

Tungsten (1).png

3.1. Installation / Building From Source:

  • Download the source package from the GIT.
#git clone https://github.com/continuent/tungsten-replicator.git
  • Compile this package it will generate the tungsten-replicator.tar file.
#sh tungsten-replicator/builder/build.sh

#mkdir -p tungsten
  • Once the tar file is generated extract the file to the folder created named tungsten and remove the old tungsten replicator package.
#tar --strip-components 1 -zxvf tungsten-replicator/builder/build/tungsten-replicator-5.2.1.tar.gz -C tungsten/
  • Now we have got tungsten binaries, Clean up source packages unless required.
#rm -rf tungsten-replicator

 

3.2. Preparing equivalent schema for Redshift:

  • Create database new_year on Redshift.
dev=# create database new_year;
CREATE DATABASE
  • The new database was created. Now I am going to create a new schema.
  • Before creating schema first you have to switch to new_year database.
dev=# \c new_year
psql (9.2.24, server 8.0.2)
  • Then create tables in new_year schema.
new_year=# create table new_year.emp(no int primary key, city varchar(50),state varchar(50));
CREATE TABLE

new_year=# create table new_year.emp_records(no int primary key, name varchar(50),address varchar(50) );
CREATE TABLE

Note:

  • If you do not mention schema name while creating the table, the table will create inside the public schema.
  • To check tables are created inside the correct new_year schema.
new_year=# \dt new_year.*;
List of relationsList of relations
schema    | name         | type  |   owner
----------+--------------+-------+-----------
new_year  | emp          | table | redshift-usr
new_year  | emp_records  | table | redshift-usr
(2 rows)

 

3.3. Configuring Master Service:

  • Create a replication user on MySQL with Replication Slave privilege to stream binlog from MySQL to Tungsten Master service.
mysql> grant replication slave on *.* to 'tungsten'@'localhost' identified by 'secret';
  • Switch to tungsten directory and Reset the defaults configuration file.
#cd ~/tungsten
#./tools/tpm configure defaults --reset
  • Configure the Master service on the directory of your choice, We have used /opt/master
  • Following commands will prepare the configuration file for Master service.
#./tools/tpm configure master \
--install-directory=/opt/master \
--enable-heterogeneous-service=true \
--members=mysql-db-master \
--master=mysql-db-master
#./tools/tpm configure master --hosts=mysql-db-master \
--replication-user=tungsten \
--replication-password=tungsten \
--skip-validation-check=MySQLUnsupportedDataTypesCheck \
--property=replicator.filter.pkey.addColumnsToDeletes=true \
--property=replicator.filter.pkey.addPkeyToInserts=true
  • Once the configuration is prepared, Then we can install it using tpm.
#./tools/tpm install

Configuration is now complete.  For further information, please consult
Tungsten documentation, which is available at docs.continuent.com.
NOTE  >> Command successfully completed
  • Now Master service will be configured under /opt/master/
  • Start the tungsten Master service.
#/opt/master/tungsten/cluster-home/bin/startall

Starting Tungsten Replicator Service...
Waiting for Tungsten Replicator Service.
running: PID:22291
  • Verify it’s working by checking the master status.
#/opt/master/tungsten/tungsten-replicator/bin/trepctl services

Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 0
appliedLatency  : 1.667
role            : master
serviceName     : master
serviceType     : local
started         : true
state           : ONLINE
Finished services command...
#/opt/master/tungsten/tungsten-replicator/bin/trepctl status

Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000011:0000000000000510;-1
appliedLastSeqno       : 0
appliedLatency         : 1.667
autoRecoveryEnabled    : false
autoRecoveryTotal      : 0
channels               : 1
clusterName            : master
currentEventId         : mysql-bin.000011:0000000000000510
currentTimeMillis      : 1525355498784
dataServerHost         : mysql-db-master
extensions             : 
host                   : mysql-db-master
latestEpochNumber      : 0
masterConnectUri       : thl://localhost:/
masterListenUri        : thl://mysql-db-master:2112/
maximumStoredSeqNo     : 0
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : jdbc:mysql:thin://mysql-db-master:3306/tungsten_master?noPrepStmtCache=true
relativeLatency        : 21.784
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : master
serviceType            : local
simpleServiceName      : master
siteName               : default
sourceId               : mysql-db-master
state                  : ONLINE
timeInStateSeconds     : 21.219
timezone               : GMT
transitioningTo        : 
uptimeSeconds          : 21.741
useSSLConnection       : false
version                : Tungsten Replicator 5.2.1
Finished status command...
  • If the master did not start properly refer to this (/opt/master/service_logs/trepsvc.log) error log.

3.4. Configuring Slave service:

  • Switch to tungsten directory and Reset the defaults configuration file.
#cd ~/tungsten
#./tools/tpm configure defaults --reset
  • Create JSON file with name s3-config-slave.json in the format below, Fill in your AWS S3 Bucket details like Access key, Secrect key, S3 bucket path.
{
"awsS3Path" : "s3://mydbops-migration",
"awsAccessKey" : "XXXXXX",
"awsSecretKey" : "YYYYYYY",
"gzipS3Files" : "false",
"cleanUpS3Files" : "true"
}
  • Configure the Slave service on the directory of your choice, We have used /opt/slave
  • Following commands will prepare the configuration file for Slave service.
#./tools/tpm configure slave \
--install-directory=/opt/slave \
--enable-heterogeneous-service=true \
--members=mysql-db-master
  • Add the replication filter to only replicate those two tables. Use Redshift host, user, password to configure the slave service.
#./tools/tpm configure slave --hosts=mysql-db-master \
--replication-host=172.19.12.116 \
--replication-user=redshift-usr \
--replication-password='redshift-pass' --datasource-type=redshift \
--batch-enabled=true \
--batch-load-template=redshift \
--redshift-dbname=new_year \
--svc-applier-filters=dropstatementdata,replicate \
--property=replicator.filter.replicate.do=new_year.emp,new_year.emp_records \
--svc-applier-block-commit-interval=10s \
--svc-applier-block-commit-size=5 \
--rmi-port=10002 \
--thl-port=2113 \
--master-thl-port=2112 \
--master-thl-host=mysql-db-master
  • Once the configuration is prepared, Then we can install it using tpm.
#./tools/tpm install
Configuration is now complete.  For further information, please consult
Tungsten documentation, which is available at docs.continuent.com.
NOTE  >> Command successfully completed
Once it complete copy the s3-config-slave.json file to slave (share) directory.
#cp s3-config-slave.json /opt/slave/share/
  • Now the slave is configured, Before starting we need to create worker/stage table used by tungsten to replicate data on Redshift.

3.5. Generating Worker / Stage tables To Be Created On Redshift:

  • Tungsten provides a utility named ddlscan to generate the Worker  / Stage tables required for the replication functionality to work.
#/opt/slave/tungsten/tungsten-replicator/bin/ddlscan -db new_year -template ddl-mysql-redshift-staging.vm > staging_ddl
  • Apply the schema generated from the above operation on Redshift.
  • Now we have Worker / Stage tables created on redshift.
new_year=# \dt new_year.*;

List of relations
  schema  |       name            | type  |   owner   
----------+-----------------------+-------+-----------
 new_year | emp                   | table | redshift-usr
 new_year | emp_records           | table | redshift-usr
 new_year | stage_xxx_emp         | table | redshift-usr
 new_year | stage_xxx_emp_pkey    | table | redshift-usr
 new_year | stage_xxx_emp_records | table | redshift-usr
 new_year | stage_xxx_emp_pkey    | table | redshift-usr

(6 rows)

 

3.6. Starting Replication:

  • Once the slave is configured and the stage tables are created in Redshift, then start the slave
#/opt/slave/tungsten/cluster-home/bin/startall

Starting Tungsten Replicator Service...
Waiting for Tungsten Replicator Service.
running: PID:23968
  • Verify it’s working by checking the slave status.
#/opt/slave/tungsten/tungsten-replicator/bin/trepctl services

NAME              VALUE
----              -----
appliedLastSeqno: -1
appliedLatency  : -1.0
role            : slave
serviceName     : slave
serviceType     : local
started         : true
state           : ONLINE
Finished services command...
# /opt/slave/tungsten/tungsten-replicator/bin/trepctl status

Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000011:0000000000000510;-1
appliedLastSeqno       : 0
appliedLatency         : 251.018
autoRecoveryEnabled    : false
autoRecoveryTotal      : 0
channels               : 1
clusterName            : slave
currentEventId         : NONE
currentTimeMillis      : 1525355728202
dataServerHost         : 172.19.12.116
extensions             : 
host                   : 172.19.12.116
latestEpochNumber      : 0
masterConnectUri       : thl://mysql-db-master:2112/
masterListenUri        : null
maximumStoredSeqNo     : 0
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : thl://mysql-db-master:2112/
relativeLatency        : 251.202
resourcePrecedence     : 99
rmiPort                : 10002
role                   : slave
seqnoType              : java.lang.Long
serviceName            : slave
serviceType            : local
simpleServiceName      : slave
siteName               : default
sourceId               : 172.19.12.116
state                  : ONLINE
timeInStateSeconds     : 12.558
timezone               : GMT
transitioningTo        : 
uptimeSeconds          : 24.407
useSSLConnection       : false
version                : Tungsten Replicator 5.2.1
Finished status command...
  • If the slave did not start properly refer to this (/opt/slave/service_logs/trepsvc.log) error log.

4.0. Testing:

  • Now both master & slave are in sync. Now I am going to insert a few record on MySQL server in emp and emp_records table.
insert into emp values(1,'chennai','tamilnadu');
insert into emp values (2,'Banglore','Karnataka');
insert into emp_records values(1,'suresh','Noth car street');
insert into emp_records values(2,'John','South car street');
  • Above these records are inserted in the master server. At the same I have checked redshift these records are replicated or not.
new_year=# select * from new_year.emp;

no  | city      | state
----+-----------+----------
1   | chennai   | tamilnadu
2   | Banglore  | Karnataka

(2 rows)
new_year=# select * from new_year.emp_records;

no | name   | address
----+----------+---------
1 | suresh   | Noth car street
2 | John     | South car street

(2 rows)

 

5.0. Troubleshooting:

Replication can be broken due to incorrect data types. During such scenarios, We should analyze the issue and fix the datatype and resume replication.

Sample Error :

# /opt/slave/tungsten/tungsten-replicator/bin/trepctl status
Processing status command...

NAME                     VALUE
----                     -----
appliedLastEventId     : NONE
appliedLastSeqno       : -1
appliedLatency         : -1.0
autoRecoveryEnabled    : false
autoRecoveryTotal      : 0
channels               : -1
clusterName            : slave
currentEventId         : NONE
currentTimeMillis      : 1526577299571
dataServerHost         : 172.25.12.119
extensions             : 
host                   : 172.25.12.119
latestEpochNumber      : -1
masterConnectUri       : thl://mysql-db-master:2112/
masterListenUri        : null
maximumStoredSeqNo     : -1
minimumStoredSeqNo     : -1
offlineRequests        : NONE
pendingError           : Stage task failed: stage=q-to-dbms seqno=75 fragno=0
pendingErrorCode       : NONE
pendingErrorEventId    : mysql-bin.000027:0000000000072461;-1
pendingErrorSeqno      : 75
pendingExceptionMessage: CSV loading failed: schema=new table=doc_content CSV file=/opt/slave/tmp/staging/slave/staging0/yp-yp_user_doc_content-69.csv message=Wrapped org.postgresql.util.PSQLException: ERROR: Value too long for character type

                           Detail: 

                           -----------------------------------------------
                           error:  Value too long for character type
                           code:      8001
                           context:   Value too long for type character varying(256)
                           query:     1566568
                           location:  funcs_string.hpp:395
                           process:   query0_75_1566568 [pid=10475]
                           -----------------------------------------------
                          (/opt/slave/tungsten/tungsten-replicator/appliers/batch/redshift.js#256)
pipelineSource         : UNKNOWN
relativeLatency        : -1.0

This error info explains that value is too long for character data type for table doc_content on new database in Redshift.

  • In MySQL, the table doc_content consists of column “context” with TEXT data type.
  • Even in Redshift, context is a TEXT column.
  • Here the catch, In Redshift, the text datatype is equivalent to varchar(256).
  • So writing anything above 256 on MySQL will break replication.

So the solution is to increase the varchar length from 255 to varchar of 1000. In Redshift changing the datatype will not work.

yp=# alter table new.doc_content ALTER COLUMN content TYPE varchar(2000);
ERROR:  ALTER COLUMN TYPE is not supported
  • We can’t increase the column size in Redshift without recreating the table.
  • The alternate solution is to add a new column with the required changes and move the data and then the old column can be dropped.
ALTER TABLE yp.yp_user_doc_content ADD COLUMN content_new VARCHAR(2000);
UPDATE yp.yp_user_doc_content SET content_new = content;
ALTER TABLE yp.yp_user_doc_content DROP COLUMN content;
ALTER TABLE yp.yp_user_doc_content RENAME COLUMN content_new TO content;
  • Now we’re good to restart the replication again.

6.0. Conclusion:

Tungsten replicator is a great tool when it comes to replication of data with heterogeneous data sources. If we understand it’s working, It’s easy to configure and operate.


A beginner’s guide to database multitenancy

$
0
0

Introduction In software terminology, multitenancy is an architectural pattern which allows you to isolate customers even if they are using the same hardware or software components. Multitenancy has become even more attractive with the widespread adoption of cloud computing. A relational database system provides a hierarchy structure of objects which, typically, looks like this: catalog … Continue reading A beginner’s guide to database multitenancy

The post A beginner’s guide to database multitenancy appeared first on Vlad Mihalcea.

Percona Monitoring and Management 1.13.0 Is Now Available

$
0
0
Percona Monitoring and Management

Percona Monitoring and ManagementPMM (Percona Monitoring and Management) is a free and open-source platform for managing and monitoring MySQL and MongoDB performance. You can run PMM in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL and MongoDB servers to ensure that your data works as efficiently as possible.

The most significant feature in this release is Prometheus 2, however we also packed a lot of visual changes into release 1.13:

  • Prometheus 2 – Consumes less resources, and Dashboards load faster!
  • New Dashboard: Network Overview – New dashboard for all things IPv4!
  • New Dashboard: NUMA Overview – New Dashboard! Understand memory allocation across DIMMs
  • Snapshots and Updates Improvements – Clearer instructions for snapshot sharing, add ability to disable update reporting
  • System Overview Dashboard improvements – See high level summary, plus drill in on CPU, Memory, Disk, and Network
  • Improved SingleStat for percentages – Trend line now reflects percentage value

We addressed 13 new features and improvements, and fixed 13 bugs.

Prometheus 2

The long awaited Prometheus 2 release is here!  By upgrading to PMM release 1.13, Percona’s internal testing has shown you will achieve a 3x-10x reduction in CPU usage, which translates into PMM Server being able to handle more instances than you could in 1.12.  You won’t see any gaps in graphs since internally PMM Server will run two instances of Prometheus and leverage remote_read in order to provide consistent graphs!

Our Engineering teams have worked very hard to make this upgrade as transparent as possible – hats off to them for their efforts!!

Lastly on Prometheus 2, we also included a new set of graphs to the Prometheus Dashboard to help you better understand when your PMM Server may run out of space. We hope you find this useful!

Network Overview Dashboard

We’re introducing a new dashboard that focuses on all things Networking – we placed a Last Hour panel highlighting high-level network metrics, and then drill into Network Traffic + Details, then focus on TCP, UDP, and ICMP behavior.

Snapshots and Updates Improvements

Of most interest to current Percona Customers, we’ve clarified the instructions on how to take a snapshot of a Dashboard in order to highlight that you are securely sharing with Percona. We’ve also configured the sharing timeout to 30 seconds (up from 4 seconds) so that we more reliably share useful data to Percona Support Engineers, as shorter timeout led to incomplete graphs being shared.

Packed into this feature is also a change to how we report installed version, latest version, and what’s new information:

Lastly, we modified the behavior of the docker environment option DISABLE_UPDATES to remove the Update button.  As a reminder, you can choose to disable update reporting for environments where you want tighter control over (i.e. lock down) who can initiate an update by launching the PMM docker container along with the environment variable as follows:

docker run ... -e DISABLE_UPDATES=TRUE

System Overview Dashboard Improvements

We’ve updated our System Overview Dashboard to focus on the four criteria of CPU, Memory, Disk, and Network, while also presenting a single panel row of high level information (uptime, count of CPUs, load average, etc)

Our last feature we’re introducing in 1.13 is a fix to SingleStat panels where the percentage value is reflected in the level of the trend line in the background.  For example, if you have a stat panel at 20% and 86%, the line in the background should fill the respective amount of the box:Improved SingleStat for percentages

New Features & Improvements

  • PMM-2225 – Add new Dashboard: Network Overview
  • PMM-2485 – Improve Singlestat for percentage values to accurately display trend line
  • PMM-2550 – Update to Prometheus 2
  • PMM-1667 – New Dashboard: NUMA Overview
  • PMM-1930 – Reduce Durability for MySQL
  • PMM-2291 – Add Prometheus Disk Space Utilization Information
  • PMM-2444 – Increase space for legends
  • PMM-2594 – Upgrade to Percona Toolkit 3.0.10
  • PMM-2610 – Configure Snapshot Timeout Default Higher and Update Instructions
  • PMM-2637 – Check for Updates and Disable Updates Improvements
  • PMM-2652 – Fix “Unexpected error” on Home dashboard after upgrade
  • PMM-2661 – Data resolution on Dashboards became 15sec min instead of 1sec
  • PMM-2663 – System Overview Dashboard Improvements

Bug Fixes

  • PMM-1977 – after upgrade pmm-client (1.6.1-1) can’t start mysql:metrics – can’t find .my.cnf
  • PMM-2379 – Invert colours for Memory Available graph
  • PMM-2413 – Charts on MySQL InnoDB metrics are not fully displayed
  • PMM-2427 – Information loss in CPU Graph with Grafana 5 upgrade
  • PMM-2476 – AWS PMM is broken on C5/M5 instances
  • PMM-2576 – Error in logs for MySQL 8 instance on CentOS
  • PMM-2612 – Wrong information in PMM Scrapes Task
  • PMM-2639 – mysql:metrics does not work on Ubuntu 18.04
  • PMM-2643 – Socket detection and MySQL 8
  • PMM-2698 – Misleading Graphs for Rare Events
  • PMM-2701 – MySQL 8 – Innodb Checkpoint Age
  • PMM-2722 – Memory auto-configuration for Prometheus evaluates to minimum of 128MB in entrypoint.sh

How to get PMM Server

PMM is available for installation using three methods:

The post Percona Monitoring and Management 1.13.0 Is Now Available appeared first on Percona Database Performance Blog.

This Week in Data with Colin Charles 47: MySQL 8.0.12 and It’s Time To Submit!

$
0
0
Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Don’t wait, submit a talk for Percona Live Europe 2018 to be held in Frankfurt 5-7 November 2018. The call for proposals is ending soon, there is a committee being created, and it is a great conference to speak at, with a new city to boot!

Releases

  • A big release, MySQL 8.0.12, with INSTANT ADD COLUMN support, BLOB optimisations, changes around replication, the query rewrite plugin and lots more. Naturally this also means the connectors get bumped up to the 8.0.12, including a nice new MySQL Shell.
  • A maintenance release, with security fixes, MySQL 5.5.61 as well as MariaDB 5.5.61.
  • repmgr v4.1 helps monitor PostgreSQL replication, and can handle switch overs and failovers.

Link List

  • Saving With MyRocks in The Cloud – a great MyRocks use case, as in the cloud, resources are major considerations and you can save on I/O with MyRocks. As long as your workload is I/O bound, you’re bound to benefit.
  • Hasura GraphQL Engine allows you to get an instant GraphQL API on any PostgreSQL based application. This is in addition to Graphile. For MySQL users, there is Prisma.

Industry Updates

  • Jeremy Cole (Linkedin) ended his sabbatical to start work at Shopify. He was previously hacking on MySQL and MariaDB Server at Google, and had stints at Twitter, Yahoo!, his co-owned firm Proven Scaling, as well as MySQL AB.
  • Dremio raises $30 million from the likes of Cisco and more for their Series B. They are a “data-as-a-service” company, having raised a total of $45m in two rounds (Crunchbase).

Upcoming Appearances

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

 

The post This Week in Data with Colin Charles 47: MySQL 8.0.12 and It’s Time To Submit! appeared first on Percona Database Performance Blog.

Default configuration benchmarks

$
0
0
Default configuration benchmarks are an interesting problem. Most storage engines require some configuration tuning to get good performance and efficiency. We configure an engine to do the right thing for the expected workload and hardware. Unfortunately the configuration is done in the language of the engine (innodb_write_io_threads, rocksdb_default_cf_options) which requires a significant amount of time to understand.

Hardware comes in many sizes and engines frequently don't have code to figure out the size -- how many CPUs, how much RAM, how many GB of storage, how many IOPs from storage. Even when that code exists the engine might not be able to use everything it finds:
  • HW can be shared and the engine is only allowed a fraction of it. 
  • It might be running on a VM that gets more CPU when other VMs on the host are idle.
  • SSDs get slower when more full. It can take a long time to reach that state.

Minimal configuration

I assume there is a market for storage engines that have better performance with the default configuration, but it will take time to get there. A step in the right direction is to enhance engines to get great performance and efficiency with minimal configuration (minimal != default). I am still figuring out what minimal means. I prefer to use the language of the engine user (HW capacity and performance/efficiency goals) rather than the language of the engine. I'd rather not set engine-specific options, even easy to understand ones like innodb_buffer_pool_size. I want the engine to figure out its configuration given the minimal tuning. For now I have two levels for minimal:
  • HW-only - tell the engine how much HW it can use -- number of CPU cores, GB of RAM, storage capacity and IOPs. Optionally you can ask it to use all that it finds.
  • HW + goals - in addition to HW-only this supports goals for read, write, space and cache amplification. For now I will be vague about the goals. 

Things change

Another part of the configuration challenge is that database workloads change while configurations tend to be static. I prefer that the engine does the right thing, while respecting the advice provided via minimal configuration. I want the engine to adapt to the current workload without ruining performance for the future workload. Adapting by deferring index maintenance can make loads faster, but might hurt the queries that follow.

Types of change include:
  • The working set no longer fits in memory and the workload shifts from CPU to IO bound.
  • Daily maintenance (vacuum, reorg, defrag, DDL, reporting) runs during off-peak hours.
  • Web-scale workloads have daily peak cycles as people wake and sleep.
  • New features get popular, old features get deprecated. Their tables and indexes arrive, grow large, become read-only, get dropped and more. Some deprecated features get un-deprecated.
  • Access patterns to data changes. Rows might be write once, N times or forever and write once/N rows eventually become read-only. Rows might be read never, once, a few-times or forever.
  • Different types of data (see previous point) can live within the same index. Even if you were willing to tune per-index (some of us are) this isn't sufficient when there is workload diversity within an index.
Real workloads include the types of change listed above but benchmarks rarely include them. Any benchmark that includes such change is likely to need more than 24-hours to run which will limit its popularity -- but maybe that isn't a bad thing. I hope we see a few new benchmarks that include such types of change. I might even try to write one.

Resource Usage Improvements in Percona Monitoring and Management 1.13

$
0
0
PMM 1-13 reduction CPU usage by 5x

In Percona Monitoring and Management (PMM) 1.13 we have adopted Prometheus 2, and with this comes a dramatic improvement in resource usage, along with performance improvements!

What does it mean for you? This means you can have a significantly larger number of servers and database instances monitored by the same PMM installation. Or you can reduce the instance size you use to monitor your environment and save some money.

Let’s look at some stats!

CPU Usage

PMM 1.13 reduction in CPU usage by 5x

Percona Monitoring and Management 1.13 reduction in CPU usage after adopting Prometheus 2 by 8x

We can see an approximate 5x and 8x reduction of CPU usage on these two PMM Servers. Depending on the workload, we see CPU usage reductions to range between 3x and 10x.

Disk Writes

There is also less disk write bandwidth required:

PMM 1.13 reduction in disk write bandwidth

On this instance, the bandwidth reduction is “just” 1.5x times. Note this is disk IO for the entire PMM system, which includes more than only the Prometheus component. Prometheus 2 itself promises much more significant IO bandwidth reduction according to official benchmarks

According to the same benchmark, you should expect disk space usage reduction by 33-50% for Prometheus 2 vs Prometheus 1.8. The numbers will be less for Percona Monitoring and Management, as it also stores Query Statistics outside of Prometheus.

Resource usage on the monitored hosts

Also, resource usage on the monitored hosts is significantly reduced:

Percona Monitoring and Management 1.13 reduction of resource usage by Prometheus 2

Why does CPU usage go down on a monitored host with a Prometheus 2 upgrade? This is because PMM uses TLS for the Prometheus to monitored host communication. Before Prometheus 2, a full handshake was performed for every scrape, taking a lot of CPU time. This was optimized with Prometheus 2, resulting in a dramatic CPU usage decrease.

Query performance is also a lot better with Prometheus 2, meaning dashboards visually load a lot faster, though we did not do any specific benchmarks here to share the hard numbers. Note though this improvement only applies when you’re querying the data which is stored in Prometheus 2.

If you’re querying data that was originally stored in Prometheus 1.8, it will be queried through the much slower and less efficient “Remote Read” interface, being quite a bit slower and using a lot more CPU and memory resources.

If you love better efficiency and Performance, consider upgrading to PMM 1.13!

The post Resource Usage Improvements in Percona Monitoring and Management 1.13 appeared first on Percona Database Performance Blog.

Monitoring NDBCluster Copying Alter Progress

$
0
0

MySQL NDB Cluster has great support for online (inplace) schema changes, but it is still sometimes necessary to perform an offline (copying) ALTER TABLE. These are relatively expensive to make as the entire table is copied into a new table which eventually replace the old table.

One example where a copying ALTER TABLE is required is when upgrading from MySQL NDB Cluster 7.2 or earlier to MySQL NDB Cluster 7.3 or later. The format used for temporal columns changed between these version (corresponding to MySQL Server 5.5 to 5.6). In order to take advantage of the new temporal format, a table rebuild is required.

Note: Support for the old temporal format has been removed in MySQL 8.0. So, you must upgrade your tables before an upgrade is possible. There is at the time of writing no MySQL NDB Cluster releases based on MySQL Server 8.0.
Schematic representation of a copying ALTER TABLE
Schematic representation of a copying ALTER TABLE

For long running operations, it can be useful to monitor the progress. There is no built-in way to do this like there is for InnoDB in MySQL 5.7 and later (I promise, I will soon write a blog about that), however the ndbinfo schema can give some information about the progress.

The ndbinfo schema is a virtual schema with views that show information from the data nodes. You can argue it is MySQL NDB Cluster’s answer to the Performance Schema. The ndbinfo schema was introduced in MySQL NDB Cluster 7.1 more than eight years ago and has steadily seen more and more information becoming available.

One of these changes arrived in MySQL NDB Cluster 7.4 where the memory_per_fragment views was added. This view shows detailed information about the memory used per fragment (in most cases the same as partitions). This can also be used to get an estimate of the progress of a copying ALTER TABLE.

As mentioned, a copying ALTER TABLE is similar to creating a new table with the new schema (which may potential be the same as the old schema), then inserting all of the data from the old table to the new. At the end, the two tables are swapped and the old table dropped.

Note: Remember that a copying ALTER TABLE is an offline operation. Any changes made to the table during the operation may be lost! Make sure the table is read-only while the ALTER TABLE is executing.

The temporary table (that later become the real table) is an NDBCluster table like other user created tables. This means the table will show up in ndbinfo.memory_per_fragment as a normal table, just with a special table name.

Temporary tables are named like #sql-7f4b_4 where the part after the – is generated based on the operating system process ID of the mysqld process and the connection id of the connection executing the ALTER TABLE. The schema name for the temporary table is the same as for the original table. In the example the process ID is 32587 or 7f4b in hexadecimal notation and the connection ID is 4.

As an example consider a rebuild of the db1.t1 table. In this case the fully qualified name (the name used by NDB Cluster instead of the normal table name) is db1/def/t1, i.e. the schema name and table name with /def/ between them. You can choose to create the fully qualified name for the temporary table as described above. An alternative, if you just have one concurrent table rebuild in the schema is to just look for the fully qualified name matching db1/def/#sql-%.

So, you can use the ndbinfo.memory_per_fragment table to see how much memory is allocated per fragment of the temporary table compared to the original table. For example:

mysql> SELECT fq_name, parent_fq_name, type, table_id,
              (fixed_elem_alloc_bytes-fixed_elem_free_bytes) AS FixedBytes,
              (var_elem_alloc_bytes-var_elem_free_bytes) AS VarBytes,
              hash_index_alloc_bytes
         FROM ndbinfo.memory_per_fragment
        WHERE fq_name = 'db1/def/t1' OR fq_name LIKE 'db1/def/#sql-%'
              OR parent_fq_name = 'db1/def/t1' OR parent_fq_name LIKE 'db1/def/#sql-%';
+------------------------+---------------------+-------------------+----------+------------+----------+------------------------+
| fq_name                | parent_fq_name      | type              | table_id | FixedBytes | VarBytes | hash_index_alloc_bytes |
+------------------------+---------------------+-------------------+----------+------------+----------+------------------------+
| db1/def/NDB$BLOB_45_3  | db1/def/t1          | User table        |       46 |     100580 |  1038088 |                  40960 |
| db1/def/NDB$BLOB_45_3  | db1/def/t1          | User table        |       46 |      99320 |  1056380 |                  40960 |
| db1/def/NDB$BLOB_45_3  | db1/def/t1          | User table        |       46 |     100580 |  1038088 |                  40960 |
| db1/def/NDB$BLOB_45_3  | db1/def/t1          | User table        |       46 |      99320 |  1056380 |                  40960 |
| sys/def/45/val1$unique | db1/def/t1          | Unique hash index |       49 |      77640 |        0 |                  40960 |
| sys/def/45/val1$unique | db1/def/t1          | Unique hash index |       49 |      76184 |        0 |                  40960 |
| sys/def/45/val1$unique | db1/def/t1          | Unique hash index |       49 |      77640 |        0 |                  40960 |
| sys/def/45/val1$unique | db1/def/t1          | Unique hash index |       49 |      76184 |        0 |                  40960 |
| sys/def/45/val1        | db1/def/t1          | Ordered index     |       48 |      39424 |        0 |                      0 |
| sys/def/45/val1        | db1/def/t1          | Ordered index     |       48 |      37792 |        0 |                      0 |
| sys/def/45/val1        | db1/def/t1          | Ordered index     |       48 |      39424 |        0 |                      0 |
| sys/def/45/val1        | db1/def/t1          | Ordered index     |       48 |      37792 |        0 |                      0 |
| sys/def/45/PRIMARY     | db1/def/t1          | Ordered index     |       47 |      39424 |        0 |                      0 |
| sys/def/45/PRIMARY     | db1/def/t1          | Ordered index     |       47 |      37792 |        0 |                      0 |
| sys/def/45/PRIMARY     | db1/def/t1          | Ordered index     |       47 |      39424 |        0 |                      0 |
| sys/def/45/PRIMARY     | db1/def/t1          | Ordered index     |       47 |      37792 |        0 |                      0 |
| db1/def/NDB$BLOB_14_3  | db1/def/#sql-7f4b_4 | User table        |       15 |      43180 |   446148 |                  24576 |
| db1/def/NDB$BLOB_14_3  | db1/def/#sql-7f4b_4 | User table        |       15 |      44404 |   471920 |                  24576 |
| db1/def/NDB$BLOB_14_3  | db1/def/#sql-7f4b_4 | User table        |       15 |      43360 |   450184 |                  24576 |
| db1/def/NDB$BLOB_14_3  | db1/def/#sql-7f4b_4 | User table        |       15 |      44404 |   471920 |                  24576 |
| sys/def/14/val1$unique | db1/def/#sql-7f4b_4 | Unique hash index |       44 |      33448 |        0 |                  24576 |
| sys/def/14/val1$unique | db1/def/#sql-7f4b_4 | Unique hash index |       44 |      34176 |        0 |                  24576 |
| sys/def/14/val1$unique | db1/def/#sql-7f4b_4 | Unique hash index |       44 |      33532 |        0 |                  24576 |
| sys/def/14/val1$unique | db1/def/#sql-7f4b_4 | Unique hash index |       44 |      34176 |        0 |                  24576 |
| sys/def/14/PRIMARY     | db1/def/#sql-7f4b_4 | Ordered index     |       42 |      15904 |        0 |                      0 |
| sys/def/14/PRIMARY     | db1/def/#sql-7f4b_4 | Ordered index     |       42 |      16992 |        0 |                      0 |
| sys/def/14/PRIMARY     | db1/def/#sql-7f4b_4 | Ordered index     |       42 |      15904 |        0 |                      0 |
| sys/def/14/PRIMARY     | db1/def/#sql-7f4b_4 | Ordered index     |       42 |      16992 |        0 |                      0 |
| sys/def/14/val1        | db1/def/#sql-7f4b_4 | Ordered index     |       43 |      15904 |        0 |                      0 |
| sys/def/14/val1        | db1/def/#sql-7f4b_4 | Ordered index     |       43 |      16992 |        0 |                      0 |
| sys/def/14/val1        | db1/def/#sql-7f4b_4 | Ordered index     |       43 |      15904 |        0 |                      0 |
| sys/def/14/val1        | db1/def/#sql-7f4b_4 | Ordered index     |       43 |      16992 |        0 |                      0 |
| db1/def/t1             | NULL                | User table        |       45 |     110792 |   775260 |                  40960 |
| db1/def/t1             | NULL                | User table        |       45 |     108712 |   760568 |                  40960 |
| db1/def/t1             | NULL                | User table        |       45 |     110792 |   775260 |                  40960 |
| db1/def/t1             | NULL                | User table        |       45 |     108712 |   760568 |                  40960 |
| db1/def/#sql-7f4b_4    | NULL                | User table        |       14 |      47536 |   332412 |                  24576 |
| db1/def/#sql-7f4b_4    | NULL                | User table        |       14 |      48656 |   340252 |                  24576 |
| db1/def/#sql-7f4b_4    | NULL                | User table        |       14 |      47696 |   333532 |                  24576 |
| db1/def/#sql-7f4b_4    | NULL                | User table        |       14 |      48656 |   340252 |                  24576 |
+------------------------+---------------------+-------------------+----------+------------+----------+------------------------+
40 rows in set (0.86 sec)

The columns with information about the node ID, block instance, and fragment number have been left out. This is why it looks like there are duplicate rows. It is also worth noticing that there are several “child tables” for the indexes and a blob column.

There are three memory columns. The first is for the fixed size column format, the second for the variable width columns format, and the last for hash indexes.

MySQL NDB Cluster supports two storage formats for the columns. The fixed format uses less memory for columns that are fixed width in nature (such as integers), however variable (called DYNAMIC in CREATE TABLE and ALTER TABLE statements) is more flexible. The variable/dynamic column format is also the only one supported when adding a column inplace (online). See also the manual page for CREATE TABLE for more information about the column format.

The hash memory is the memory used by hash indexes (for the primary key and unique indexes).

For the fixed and variable element memory usages there is both allocated and free bytes. Here the free bytes is used as a measure of the amount of fragmentation. A copying ALTER TABLE defragments the table, so it is necessary to the the fragmentation into consideration when estimating the progress. In reality it is more complicated than the query suggest, so the memory values in the query result will not end up matching 100%, however in most cases it should be a reasonable estimate.

You can also choose to aggregate the memory, for example:

mysql> SELECT IF(fq_name LIKE 'db1/def/%'
                    AND fq_name NOT LIKE 'db1/def/NDB$BLOB%',
                 fq_name,
                 parent_fq_name
              ) AS FqName,
              sys.format_bytes(
                 SUM(fixed_elem_alloc_bytes-fixed_elem_free_bytes)
              ) AS FixedBytes,
              sys.format_bytes(
                 SUM(var_elem_alloc_bytes-var_elem_free_bytes)
              ) AS VarBytes,
              sys.format_bytes(
                 SUM(hash_index_alloc_bytes)
              ) AS HashBytes
         FROM ndbinfo.memory_per_fragment
        WHERE fq_name = 'db1/def/t1' OR fq_name LIKE 'db1/def/#sql-%'
              OR parent_fq_name = 'db1/def/t1' OR parent_fq_name LIKE 'db1/def/#sql-%'
        GROUP BY FqName;
+---------------------+------------+----------+------------+
| FqName              | FixedBytes | VarBytes | HashBytes  |
+---------------------+------------+----------+------------+
| db1/def/#sql-7f4b_4 | 629.20 KiB | 3.08 MiB | 288.00 KiB |
| db1/def/t1          | 1.39 MiB   | 6.92 MiB | 480.00 KiB |
+---------------------+------------+----------+------------+
2 rows in set (0.69 sec)

This aggregate query also uses the sys schema function format_bytes() to convert the number of bytes into human readable numbers. The sys schema is installed by default for MySQL NDB Cluster 7.5 and later and is available from MySQL’s repository on GitHub for MySQL NDB Cluster 7.3 and 7.4.

This way of estimating the progress of a copying ALTER TABLE is not perfect, but at least it can give an idea of how the operation progresses.

More automated control in MySQL Cluster 7.6.7

$
0
0
Apart from bug fixes the 7.6.7 version of MySQL Cluster also brings
a major improvement of restart times through adaptively controlling
checkpoint speed.

Many DBMSs work hard on automating management of the database nodes.
In NDB automated management was a design point from the very first
version. This means that nodes crash and restart without operator
assistance.

For the last years we have also worked on developing algorithms that
require less configuration. This will greatly simplify the configuration
of NDB Cluster.

In 7.6.7 we have made it much easier to configure handling of checkpoints
(LCPs) and REDO logging.

In earlier versions of NDB the checkpoint speed has been controlled by
two things. The first is based on the following configuration variables:

MinDiskWriteSpeed: This is the minimum disk write speed we will attempt
to write during a checkpoint even in the presence of CPU overload and
disk overload. This defaults to 10 MByte per second, this is the sum
on all LDM threads, thus on the entire data node.

MaxDiskWriteSpeed: This is the maximum disk write speed we will attempt
to write during a checkpoint, if no CPU overload or disk overload is
seen, this is the checkpoint speed that will be used in normal operation.
Defaults to 20 MByte per second.

MaxDiskWriteSpeedOtherNodeRestart: This is the maximum disk write speed
we will write during a checkpoint when another node is restarting. It
defaults to 50 MByte per second.

MaxDiskWriteSpeedOwnNodeRestart: This is the maximum disk write speed
we will write during a checkpoint when our node is restarting. It defaults to
200 MByte per second.

The actual disk write speed achieved is using those configuration variables
in combination with an adaptive algorithm that will decrease the checkpoint
speed when the CPU or the disk is overloaded.

These parameters exists also in 7.6.7, but there is very little reason to
change them from their default value if the new configuration variable
EnableRedoControl is set to 1. By default this variable is set to 0 to
avoid changes of behaviour in a GA released version of MySQL Cluster.

In earlier versions of NDB it was necessary to have very large REDO logs.
The reason is that earlier versions (7.5 and earlier) wrote the entire
database to disk in each checkpoint. This meant that checkpoints during
massive inserts got larger and larger and to ensure successful insertion
of the entire data set it was necessary to have REDO logs that was about
twice the size of the DataMemory.

Now in 7.6.7 it should be quite enough to have 2-4 GByte of REDO log per
REDO log part (normally equal to the number of LDM threads). This REDO
log size works perfectly even when loading TBytes of data into NDB.
Remember that EnableRedoControl needs to be set to 1 for this to work.

Thus in MySQL Cluster 7.6.7 one can simplify the configuration of REDO logs
and checkpointing.

In earlier versions we need to set the following variables:
NoOfFragmentLogParts (always set equal to number of LDM threads)
NoOfFragmentLogFiles
FragmentLogFileSize
MinDiskWriteSpeed
MaxDiskWriteSpeed
MaxDiskWriteSpeedOtherNodeRestart
MaxDiskWriteSpeedOwnNodeRestart

The product of NoOfFragmentLogParts, NoOfFragmentLogFiles and
FragmentLogFileSize is the size of the REDO log. In earlier versions
this product should be roughly two times the setting of DataMemory.

The default setting of FragmentLogFileSize is 16 MByte. Personally I always
increase this setting to 256 MByte (set to 256M).

So e.g. with a DataMemory of 100 GByte and 8 LDM threads one can set those to
NoOfFragmentLogParts=8
NoOfFragmentLogFiles=50
FragmentLogFileSize=256M

This gives a REDO log size of 200 GByte.

The setting of disk write speed will be discussed a bit more in a coming blog.

In 7.6.7 one can instead configure as follows.

EnableRedoControl=1
NoOfFragmentLogParts=8
NoOfFragmentLogFiles=8
FragmentLogFileSize=256M

The setting of disk write speed variables need not be considered. The setting
of NoOfFragmentLogFiles to 8 and FragmentLogFileSize to 256M should work for
almost all setups of NDB. Only when dealing with data nodes larger than
one terabyte could it be considered to increase the REDO log size. The
NoOfFragmentLogParts should still be set to the number of LDM threads.

Thus in 7.6.7 a lot less thought has to go into configuration of REDO logs
and disk write speeds. Disk write speed still affects backup write speeds
as well though, so it could be a good idea to consider how fast you want to
write your backups using the variables MinDiskWriteSpeed and MaxDiskWriteSpeed.

The reason that disk write speeds for checkpoints is less important to consider
is that we calculate how fast we need to write the checkpoints based on the
write activity in NDB. This means that when setting EnableRedoControl the write
speed to the disk can be quite substantial. So this setting will not work very
well unless the disk subsystem is able to handle the load. The disk subsystem
should be able to handle around 100 MByte of disk writes per LDM thread.

With modern HW this should not be an issue, in particular not when using NVMe
drives. In our benchmarking we are using a RAID 0 setup of 6 SSD drives. With
8 LDM threads inserting at full speed we use about 50% of the disk bandwidth
in this case (500 MByte per second).

Analysis of restart improvements in MySQL Cluster 7.6.7

$
0
0
To test restart times I am using the DBT2 test suite that
I developed based on DBT2 0.37 since 2006.

The following test setup is used:
DataMemory: 100 GByte (90 Gbyte in 7.5 and earlier)
IndexMemory: 10 GByte in 7.5 and earlier versions)
NoOfFragmentLogParts=8
NoOfFragmentLogFiles=50
FragmentLogFileSize=256M
MinDiskWriteSpeed=20M (two times the default)
MaxDiskWriteSpeed=40M (two times the default)

I load the database using LOAD DATA FROM INFILE with
precreated CSV files that contains the data for one
warehouse and table. This means that there are 8 CSV
files per warehouse. I load 600 warehouses into NDB.
This means a database size of around 60 GByte with
around 280 million rows.

Next I run the DBT2 from 2 MySQL servers using 16 threads
each for two minutes. This creates a load of almost
100k TPM.

The next step is to restart one of the 2 data nodes and
measure the time it takes to restart the data node.

Obviously an even more realistic benchmark would be to
restart while running the benchmark, but the effect on
the restart analysis would not be substantial.

I tested using the latest GA versions from 7.4 and 7.5
and also both 7.6 GA versions (7.6.6 and 7.6.7).

First the results:
7.4.21: 31 minutes 29 seconds
7.5.11: 44 minutes 9 seconds
7.6.6:  18 minutes 2 seconds
7.6.7:  4 minutes 45 seconds

In order to understand better these numbers we will look
at the most important restart phases and analyse numbers
there.

It takes about 3 seconds to stop a node and start it again.
This time is constant in all versions.

The next step is allocating memory and touching the memory.
Allocating memory doesn't actually commit the memory to RAM.
It only ensures that there is space in RAM or in swap file
allocated for the memory. So in order to commit the memory
to RAM, it is necessary to touch the memory (read or write
from it). The speed of this touching of memory is fairly
constant and depends on Linux version (slight speedup in
newer Linux versions). My measurements shows that this
touching of memory handles about 2.5-3.5 GByte of memory
per second. Thus the restart time is dependent on the
DataMemory size and other memory consuming parts of the
NDB data node.

NDB data nodes always allocate and commit all the memory
as part of the restart. It is even possible to lock the
memory to RAM through setting LockPagesInMemory to 1 in
the configuration.

This step takes 26 seconds for all versions.

A major step in the recovery is to recreate the database that
was in the data node at the time of the node stop. This is
performed in 3 phases.

1) Restore data from a checkpoint
2) Execute REDO log
3) Rebuild ordered indexes

The time of all these phases are dependent on the version.
In 7.3 and earlier versions there was also a lot of time
spent waiting for the metadata lock when copying the
metadata to the starting node. This meant waiting for the
current checkpoint to complete (a checkpoint in 7.3 with
these settings and the database size takes about 20 minutes).

Thus 7.3 would add approximately 10 minutes to the restart times.

After restoring the local database from disk the next major
phase is the synchronisation phase. This phase will take longer
time if there has been updates during the restart. The time spent
in this phase is not expected to have changed in any material
fashion in 7.6.

The final phase is to wait for one checkpoint to complete to
ensure that the node is recoverable even if the other node
should fail completely.

Restore phase in 7.4.21 and 7.5.11 only takes about 5-10 seconds.
The reason is that the last checkpoint completed happened early
in the load phase. Thus almost the entire database has to be
loaded from the REDO log.

The numbers on the restore phase plus the REDO phase is
3 minutes and 48 seconds in 7.4.21 and 3 minutes and 30 seconds
in 7.5.11.

In 7.6.6 the restore phase takes considerably longer and thus
the REDO phase is shortened. 7.6.6 makes partial checkpoints and
can thus write checkpoints a bit faster. But the disk write speed
is too slow to keep up with the insert rate. Actually setting
the MaxDiskWriteSpeed to 100M in 7.6.6 speeds up restarts by a
factor of 3. The time for the restore phase in 7.6.6 is
1 minute and 12 seconds and the REDO phase is 2 minutes and
20 seconds. Thus the total time of these two phases are
3 minutes and 32 seconds.

So what we can conclude here is that 7.6.6 requires a higher
setting of the disk write speed to materially improve the
restart times in these two phases for restarts during massive
inserts.

Now the restore phase in 7.6.7 recovers almost all the data
since checkpoints are executed with 15-20 seconds intervals.
The restore phase consumes 2 minutes and 48 seconds and the
REDO phase takes less than one second. The speed up of these
two comes from that the restore phase is faster per row
compared to executing the REDO log although more data has to
be restored in 7.6.

Next we analyse the phase that rebuilds the ordered indexes.
The change here actually comes from configuration changes and
the ability to lock the index build threads to more CPUs.

In 7.6 we changed the default of BuildIndexThreads to 128.
This means that each fragment that requires rebuild of an
index for a table can be executed in parallel. The default
in 7.5 and earlier meant that all rebuild of indexes happened
in LDM threads. Thus in this case the 7.5 and 7.4 versions
could use 8 CPUs to rebuild indexes while 7.6 could use
16 CPUs to rebuild indexes. The parallelisation of
rebuild indexes can happen in 7.5, but 7.6 ensures that we
lock to more CPUs compared in 7.5.

This change meant that the times of 7.4.21 (2 minutes
19 seconds) and 7.5.11 (2 minutes 12 seconds) was significantly
improved. In 7.6.6 the time was 1 minutes 20 seconds and in
7.6.7 1 minutes and 19 seconds. Thus a significant improvement
of this phase in 7.6.

The synchronisation phase takes 1-2 seconds in all versions.
Since no changes happened during the restart this second is spent
in scanning the 280 million rows to see if any changes have
occurred during the restart (happens in live node).

Now we come to the phase where the big change in 7.6.6 happened
and where it happens even more in 7.6.7. This phase is where
we wait for a checkpoint to complete that we participated in.

Actually this means first waiting for the ongoing checkpoint to
complete and next to participate in another checkpoint.

The time to execute a checkpoint is fairly constant in 7.5 and
earlier versions. In this particular setup it takes about 22
minutes. This means that this wait phase can take anywhere
between 22 minutes and 44 minutes dependent on the timing of
the restart.

This is the reason why 7.4.21 restarted so much faster than
7.5.11. It was pure luck in timing the checkpoints.

In 7.6.6 the time to execute checkpoints is much lower compared
to in 7.5. Thus this phase is much shorter here. However the time
for a checkpoint varies dependent on how much changes have happened
since the last checkpoint. In this particular case the phase took
12 minutes and 26 seconds.

With 7.6.7 we adapt checkpoint speed to ensure that we can
survive even with very small REDO logs. The second reason
for this is to make checkpoints much faster. During execution of
this benchmark no checkpoints took more than 25 seconds and most
of them took about 15-20 seconds and when the DBT2 benchmark
executed it took about 10-15 seconds. In idle mode a checkpoint
is executed within a few seconds.

This means that waiting for a checkpoint to complete and execute
a new one is very fast. In this benchmark it took only 5 seconds.

Thus in this particular restarts of 7.6.7 was almost 10 times
faster compared to 7.4 and 7.5 and even 4 times faster than
restarts of 7.6.6.

Thus most of the restart times are now linearly dependent on the
database size and the size of the ordered indexes.

There are more phases in NDB restarts that can consume time, for
instance with disk data we have to UNDO disk data changes. This
phase was improved 5 times with 4 LDM threads in 7.6. There
are also various steps where occasionally the restart can be blocked
due to metadata locks and other reasons.

GLB: GitHub’s open source load balancer

$
0
0

At GitHub, we serve tens of thousands of requests every second out of our network edge, operating on GitHub’s metal cloud. We’ve previously introduced GLB, our scalable load balancing solution for bare metal datacenters, which powers the majority of GitHub’s public web and git traffic, as well as fronting some of our most critical internal systems such as highly available MySQL clusters. Today we’re excited to share more details about our load balancer’s design, as well as release the GLB Director as open source.

GLB Director is a Layer 4 load balancer which scales a single IP address across a large number of physical machines while attempting to minimise connection disruption during any change in servers. GLB Director does not replace services like haproxy and nginx, but rather is a layer in front of these services (or any TCP service) that allows them to scale across multiple physical machines without requiring each machine to have unique IP addresses.

GLB

Scaling an IP using ECMP

The basic property of a Layer 4 load balancer is the ability to take a single IP address and spread inbound connections across multiple servers. To scale a single IP address to handle more traffic than any single machine can process, we need to not only split amongst backend servers, but also need to be able to scale up the servers that handle the load balancing themselves. This is essentially another layer of load balancing.

Typically we think of an IP address as referencing a single physical machine, and routers as moving a packet to the next closest router to that machine. In the simplest case where there’s always a single best next hop, routers pick that hop and forward all packets there until the destination is reached.

Next Hop Routing

In reality, most networks are far more complicated. There is often more than a single path available between two machines, for example where multiple ISPs are available or even when two routers are joined together with more than one physical cable to increase capacity and provide redundancy. This is where Equal-Cost Multi-Path (ECMP) routing comes in to play - rather than routers picking a single best next hop, where they have multiple hops with the same cost (usually defined as the number of ASes to the destination), they instead hash traffic so that connections are balanced across all available paths of equal cost.

ECMP with the same destination server

ECMP is implemented by hashing each packet to determine a relatively consistent selection of one of the available paths. The hash function used here varies by device, but typically it’s a consistent hash based on the source and destination IP address as well as the source and destination port for TCP traffic. This means that multiple packets for the same ongoing TCP connection will typically traverse the same path, meaning that packets will arrive in the same order even when paths have different latencies. Notably in this case, the paths can change without any disruption to connections because they will always end up at the same destination server, and at that point the path it took is mostly irrelevant.

An alternative use of ECMP can come in to play when we want to shard traffic across multiple servers rather than to the same server over multiple paths. Each server can announce the same IP address with BGP or another similar network protocol, causing connections to be sharded across those servers, with the routers blissfully unaware that the connections are being handled in different places, not all ending on the same machine as would traditionally be the case.

ECMP with multiple destination servers

While this shards traffic as we had hoped, it has one huge drawback: when the set of servers that are announcing the same IP change (or any path or router along the way changes), connections must rebalance to maintain an equal balance of connections on each server. Routers are typically stateless devices, simply making the best decision for each packet without consideration to the connection it is a part of, which means some connections will break in this scenario.

ECMP redistribution breaking connections

In the above example on the left, we can imagine that each colour represents an active connection. A new proxy server is added to announce the same IP. The router diligently adjusts the consistent hash to move 1/3 connections to the new server while keeping 2/3 connections where they were. Unfortunately for those 1/3 connections that were already in progress, the packets are now arriving on a server that doesn’t know about the connection, and so they fail.

Split director/proxy load balancer design

The issue with the previous ECMP-only solution is that it isn’t aware of the full context for a given packet, nor is it able to store data for each packet/connection. As it turns out, there are commonly used patterns to help out with this situation by implementing some stateful tracking in software, typically using a tool like Linux Virtual Server (LVS). We create a new tier of “director” servers that take packets from the router via ECMP, but rather than relying on the router’s ECMP hashing to choose the backend proxy server, we instead control the hashing and store state (which backend was chosen) for all in-progress connections. When we change the set of proxy tier servers, the director tier hopefully hasn’t changed, and our connection will continue.

ECMP redistribution with LVS director missing state

Although this works well in many cases, it does have some drawbacks. In the above example, we add both a LVS director and backend proxy server at the same time. The new director receives some set of packets, but doesn’t have any state yet (or has delayed state), so hashes it as a new connection and may get it wrong (and cause the connection to fail). A typical workaround with LVS is to use multicast connection syncing to keep the connection state shared amongst all LVS director servers. This still requires connection state to propagate, and also still requires duplicate state - not only does each proxy need state for each connection in the Linux kernel network stack, but every LVS director also needs to store a mapping of connection to backend proxy server.

Removing all state from the director tier

When we were designing GLB, we decided we wanted to improve on this situation and not duplicate state at all. GLB takes a different approach to that described above, by using the flow state already stored in the proxy servers as part of maintaining established Linux TCP connections from clients.

For each incoming connection, we pick a primary and secondary server that could handle that connection. When a packet arrives on the primary server and isn’t valid, it is forwarded to the secondary server. The hashing to choose the primary/secondary server is done once, up front, and is stored in a lookup table, and so doesn’t need to be recalculated on a per-flow or per-packet basis. When a new proxy server is added, for 1/N connections it becomes the new primary, and the old primary becomes the secondary. This allows existing flows to complete, because the proxy server can make the decisions with its local state, the single source of truth. Essentially this gives packets a “second chance” at arriving at the expected server that holds their state.

ECMP redistribution with GLB

Even though the director will still send connections to the wrong server, that server will then know how to forward on the packet to the correct server. The GLB director tier is completely stateless in terms of TCP flows: director servers can come and go at any time, and will always pick the same primary/secondary server providing their forwarding tables match (but they rarely change). To change proxies, some care needs to be taken, which we describe below.

Maintaining invariants: rendezvous hashing

The core of the GLB Director design comes down to picking that primary and secondary server consistently, and to allow the proxy tier servers to drain and fill as needed. We consider each proxy server to have a state, and carefully adjust the state as a way of adding and removing servers.

We create a static binary forwarding table, which is generated identically on each director server, to map incoming flows to a given primary and secondary server. Rather than having complex logic to pick from all available servers at packet processing time, we instead use some indirection by creating a table (65k rows), with each row containing a primary and secondary server IP address. This is stored in memory as flat array of binary data, taking about 512kb per table. When a packet arrives, we consistently hash it (based on packet data alone) to the same row in that table (using the hash as an index into the array), which provides a consistent primary and secondary server pair.

GLB Forwarding Table with active servers

We want each server to appear approximately equally in both the primary and secondary fields, and to never appear in both in the same row. When we add a new server, we desire some rows to have their primary become secondary, and the new server become primary. Similarly, we desire the new server to become secondary in some rows. When we remove a server, in any rows where it was primary, we want the secondary to become primary, and another server to pick up secondary.

This sounds complex, but can be summarised succinctly with a couple of invariants:

  • As we change the set of servers, the relative order of existing servers should be maintained.
  • The order of servers should be computable without any state other than the list of servers (and maybe some predefined seeds).
  • Each server should appear at most once in each row.
  • Each server should appear approximately an equal number of times in each column.

Reading the problem that way, Rendezvous hashing is an ideal choice, since it can trivially satisfy these invariants. Each server (in our case, the IP) is hashed along with the row number, the servers are sorted by that hash (which is just a number), and we get a unique order for servers for that given row. We take the first two as the primary and secondary respectively.

Relative order will be maintained because the hash for each server will be the same regardless of which other servers are included. The only information required to generate the table is the IPs of the servers. Since we’re just sorting a set of servers, the servers only appear once. Finally, if we use a good hash function that is pseudo-random, the ordering will be pseudo-random, and so the distribution will be even as we expect.

Draining, filling, adding and removing proxies

Adding or removing proxy servers require some care in our design. This is because a forwarding table entry only defines a primary/secondary proxy, so the draining/failover only works with at most a single proxy host in draining. We define the following valid states and state transitions for a proxy server:

GLB Proxy server state machine

When a proxy server is active, draining or filling, it is included in the forwarding table entries. In a stable state, all proxy servers are active, and the rendezvous hashing described above will have an approximately even and random distribution of each proxy server in both the primary and secondary columns.

As a proxy server transitions to draining, we adjust the entries in the forwarding table by swapping the primary and secondary entries we would have otherwise included:

GLB Forwarding Table with a draining server

This has the effect of sending packets to the server that was previously secondary first. Since it receives the packets first, it will accept SYN packets and therefore take any new connections. For any packet it doesn’t understand as relating to a local flow, it forwards it to the other server (the previous primary), which allows existing connections to complete.

This has the effect of draining the desired server of connections gracefully, after which point it can be removed completely, and proxies can shuffle in to fill the empty secondary slots:

GLB Forwarding Table with removed server

A node in filling looks just like active, since the table inherently allows a second chance:

GLB Forwarding Table with filling server

This implementation requires that no more than one proxy server at a time is in any state other than active, which in practise has worked well at GitHub. The state changes to proxy servers can happen as quickly as the longest connection duration that needs to be maintained. We’re working on extensions to the design that support more than just a primary and secondary, and some components (like the header listed below) already include initial support for arbitrary server lists.

Encapsulation within the datacenter

We now have an algorithm to consistently pick backend proxy servers and operate on them, but how do we actually move packets around the datacenter? How do we encode the secondary server inside the packet so the primary can forward a packet it doesn’t understand?

Traditionally in the LVS setup, an IP over IP (IPIP) tunnel is used. The client IP packet is encapsulated inside an internal datacenter IP packet and forwarded on to the proxy server, which decapsulates it. We found that it was difficult to encode the additional server metadata inside IPIP packets, as the only standard space available was the IP Options, and our datacenter routers passed packets with unknown IP options to software for processing (which they called “Layer 2 slow path”), taking speeds from millions to thousands of packets per second.

To avoid this, we needed to hide the data inside a different packet format that the router wouldn’t try to understand. We initially adopted raw Foo-over-UDP (FOU) with a custom Generic Route Encapsulation (GRE) payload, essentially encapsulating everything inside a UDP packet. We recently transitioned to Generic UDP Encapsulation (GUE), which is a layer on top FOU which provides a standard for encapsulating IP protocols inside a UDP packet. We place our secondary server’s IP inside the private data of the GUE header. From a router’s perspective, these packets are all internal datacenter UDP packets between two normal servers.

 0                   1                   2                   3  
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+\
|          Source port          |        Destination port       | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ UDP
|             Length            |            Checksum           | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+/
| 0 |C|   Hlen  |  Proto/ctype  |             Flags             | GUE
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     Private data type (0)     |  Next hop idx |   Hop count   |\
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
|                             Hop 0                             | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ GLB
|                              ...                              | private
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ data
|                             Hop N                             | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+/

Another benefit to using UDP is that the source port can be filled in with a per-connection hash so that they are flow within the datacenter over different paths (where ECMP is used within the datacenter), and received on different RX queues on the proxy server’s NIC (which similarly use a hash of TCP/IP header fields). This is not possible with IPIP because most commodity datacenter NICs are only able to understand plain IP, TCP/IP and UDP/IP (and a few others). Notably, the NICs we use cannot look inside IP/IP packets.

When the proxy server wants to send a packet back to the client, it doesn’t need to be encapsulated or travel back through our director tier, it can be sent directly to the client (often called “Direct Server Return”). This is typical of this sort of load balancer design and is especially useful for content providers where the majority of traffic flows outbound with a relatively small amount of traffic inbound.

This leaves us with a packet flow that looks like the following:

GLB second chance packet flow

DPDK for 10G+ line rate packet processing

Since we first publicly discussed our initial design, we’ve completely rewritten glb-director to use DPDK, an open source project that allows very fast packet processing from userland by bypassing the Linux kernel. This has allowed us to achieve NIC line rate processing on commodity NICs with commodity CPUs, and allows us to trivially scale our director tier to handle as much inbound traffic as our public connectivity requires. This is particularly important during DDoS attacks, where we do not want our load balancer to be a bottleneck.

One of our initial goals with GLB was that our load balancer could run on commodity datacenter hardware without any server-specific physical configuration. Both GLB director and proxy servers are provisioned like normal servers in our datacenter. Each server has a bonded pair of network interfaces, and those interfaces are shared between DPDK and Linux on GLB director servers.

Modern NICs support SR-IOV, a technology that enables a single NIC to act like multiple NICs from the perspective of the operating system. This is typically used by virtual machine hypervisors to ask the real NIC (“Physical Function”) to create multiple pretend NICs for each VM (“Virtual Functions”). To enable DPDK and the Linux kernel to share NICs, we use flow bifurcation, which sends specific traffic (destined to GLB-run IP addresses) to our DPDK process on a Virtual Function while leaving the rest of the packets with the Linux kernel’s networking stack on the Physical Function.

We’ve found that the packet processing rates of DPDK on a Virtual Function are acceptable for our requirements. GLB Director uses a DPDK Packet Distributor pattern to spread the work of encapsulating packets across any number of CPU cores on the machine, and since it is stateless this can be highly parallelised.

GLB Flow Paths

GLB Director supports matching and forwarding inbound IPv4 and IPv6 packets containing TCP payloads, as well as inbound ICMP Fragmentation Required messages used as part of Path MTU Discovery, by peeking into the inner layers of the packet during matching.

Bringing test suites to DPDK with Scapy

One problem that typically arises in creating (or using) technologies that operate at high speeds due to using low-level primitives (like communicating with the NIC directly) is that they become significantly more difficult to test. As part of creating the GLB Director, we also created a test environment that supports simple end-to-end packet flow testing of our DPDK application, by leveraging the way DPDK provides an Environment Abstraction Layer (EAL) that allows a physical NIC and a libpcap-based local interface to appear the same from the view of the application.

This allowed us to write tests in Scapy, a wonderfully simple Python library for reading, manipulating and writing packet data. By creating a Linux Virtual Ethernet Device, with Scapy on one side and DPDK on the other, we were able to pass in custom crafted packets and validate what our software would provide on the other side, being a fully GUE-encapsulated packet directed to the expected backend proxy server.

GLB's Scapy test setup

This allows us to test more complex behaviours such as traversing layers of ICMPv4/ICMPv6 headers to retrieve the original IPs and TCP ports for correct forwarding of ICMP messages from external routers.

Healthchecking of proxies for auto-failover

Part of the design of GLB is to handle server failure gracefully. The current design of having a designated primary/secondary for a given forwarding table entry / client means that we can work around single-server failure by running health checks from the perspective of each director. We run a service called glb-healthcheck which continually validates each backend server’s GUE tunnel and arbitrary HTTP port.

When a server fails, we swap the primary/secondary entries anywhere that server is primary. This performs a “soft drain” of the server, which provides the best chance for connections to gracefully fail over. If the healthcheck failure is a false positive, connections won’t be disrupted, they will just traverse a slightly different path.

Second chance on proxies with iptables

The final component that makes up GLB is a Netfilter module and iptables target that runs on every proxy server and allows the “second chance” design to function.

This module provides a simple task deciding whether the inner TCP/IP packet inside every GUE packet is valid locally according to the Linux kernel TCP stack, and if it isn’t, forwards it to the next proxy server (the secondary) rather than decapsulating it locally.

In the case where a packet is a SYN (new connection) or is valid locally for an established connection, it simply accepts it locally. We then use the Linux kernel 4.x GUE support provided as part of the fou module to receive the GUE packet and process it locally.

Available today as open source

When we started down the path of writing a better datacenter load balancer, we decided that we wanted to release it open source so that others could benefit from and share in our work. We’re excited to be releasing all the components discussed here as open source at github/glb-director. We hope this will allow others to reuse our work and contribute to a common standard software load balancing solution that runs on commodity hardware in physical datacenter environments.

GLB component overview

Also, we’re hiring!

GLB and the GLB Director has been an ongoing project designed, authored, reviewed and supported by various members of GitHub’s Production Engineering organisation, including @joewilliams, @nautalice, @ross, @theojulienne and many others. If you’re interested in joining us in building great infrastructure projects like GLB, our Data Center team is hiring production engineers specialising in Traffic Systems, Network and Facilities.

Loading data into TByte sized NDB data nodes

$
0
0
One of the main design goals with MySQL Cluster 7.6 was to
support much larger data sets in each data node. The
checkpoint algorithm makes it hard to manage data nodes in
7.5 and earlier versions with many hundreds of GBytes of
data. Using MySQL Cluster 7.6.7 the algorithms scale to
multiple TBytes per data node.

However the largest machines currently at my disposal have
1 TByte of RAM. Thus I went about testing to load data into
a cluster on two such machines. There is no comparison to
older version, it is possible to load data into earlier
versions, but it is not very practical with such large
data nodes in earlier versions.

7.6.7 comes with both partial checkpoints and adaptive
control of disk write speed. This means that we can load
terabytes of data into the cluster even with moderately
sized REDO logs.

To experiment with these I used the same test as in the
previous blog.

Here is the setup.

2 data nodes, one per machine where each machine is equipped
with 1 TByte of RAM, 60 CPU cores distributed on 4 CPU sockets.

The NDB configuration used 750 GByte of DataMemory, 8 LDM threads,
the data node only used CPUs from one CPU socket. It was essential
to set Numa=1 in the configuration to be able to use memory from
all four CPU sockets.

We used the default setup for MinDiskWriteSpeed and MaxDiskWriteSpeed*
configuration parameters. The REDO log size was 2 GByte per LDM thread.
We thus set the following REDO log parameters:
NoOfFragmentLogParts=8 (equal to number of LDM threads)
NoOfFragmentLogFiles=8
FragmentLogFileSize=256M

To enable the new algorithm for adaptive disk write speed it is
necessary to set EnableRedoControl=1 in configuration.

Interestingly in loading the data we were limited by the disk read
bandwidth of the disk where the CSV files of DBT2 were stored. We also
went ahead and created one CSV file per table using SELECT INTO OUTFILE
and used ndb_import to load data into NDB. Using ndb_import we were able
to load data faster than using LOAD DATA INFILE since we were then able
to use files stored on RAID:ed SSD drives.

One challenge with loading data through DBT2 is that we are using warehouse
id as the partition key. This means that all loads from one CSV file goes
into one LDM thread. This means that the load among the LDM threads isn't
balanced. This complicates the checkpointing scheme a bit. We ensure that
any REDO log part that becomes critical also ensures that all other LDM
threads in the cluster knows that we are in a critical state.

Loading through ndb_import is thus easier since the load is balanced over
the LDM threads.

With 8 LDM threads we load 1-2 warehouses per second and one warehouse is
around 100 MByte of data spread into about 370k rows.

In this test we loaded 6000 DBT2 warehouses, thus around 600 GByte in
database size.

The restart is completely similar to the recovery when loading 600 warehouses.
Touching the memory now takes a bit longer, it takes 4 minutes and 44 seconds
plus another 28 seconds when starting the first recovery phases.

The restore phase took 27 minutes and 40 seconds, the REDO phase was still less
than a second and finally rebuilding the ordered indexes took 12 minutes and
13 seconds.

The copy fragment increased to 17 seconds since we needed to scan more data
this time. The phase waiting for checkpoints to complete was still 5 seconds
and the phase that waited for replication subscriptions to be configured also
took 5 seconds. This is usually around 5-6 seconds unless a MySQL server is
down when it can take up to 2 minutes before this phase is completed.

Total restart time thus became 45 minutes and 32 seconds.

As we can see we can recover a TByte sized data node within an hour and this is
with 8 LDM threads. If we instead use 24 LDM threads the restore and rebuild
index phase will go about 3 times faster and thus we would cut restart time by
another 25 minutes and thus restart time would be about 20 minutes in this case
and we would even be able to restart more than 2 TBytes within an hour.

Scheduling challenges of checkpoints in NDB

$
0
0
The NDB data nodes are implemented using asynchronous programming. The model is
quite simple. One can send asynchronous messages on two priority levels, the
A-level is high priority messages that are mainly used for various management
actions. The B-level represents the normal priority level where all normal
messages handling transactions are executed.

It is also possible to send delayed signals that will wait for a certain
number of milliseconds before being delivered to the receiver.

When developing MySQL Cluster 7.4 we noted a problem with local checkpoints. If
transaction load was extremely high, the checkpoints almost stopped. If such a
situation stays for too long, we will run out of REDO log.

To handle this we introduced a special version of delayed signals. This new
signal will be scheduled such that at most around 75 messages are executed
before this message is delivered. There can be thousands of messages waiting
in queue, so this gives a higher priority to this signal type.

This feature was used to get control of checkpoint execution and introduced in
MySQL Cluster 7.4.7. With this feature each LDM thread will at least be able
to deliver 10 MBytes of checkpoint writes per second.

With the introduction of adaptive checkpoint speed this wasn't enough. In a
situation where we load data into NDB Cluster we might need to write much
more data to the checkpoints.

To solve this we keep track of how much data we need to write per second to
ensure that we don't run out of REDO log.

If the REDO log comes to a critical point where the risk of running out of
REDO log is high, we will raise priority of checkpointing even higher such
that we can ensure that we don't run out of REDO log.

This means that during a critical situation, normal transaction throughput
will decrease since we will put a lot of effort into ensuring that we don't
get into a situation of a complete stop due to running out of REDO log.

We solve this by executing checkpoint scans without real-time breaks for a
number of rows and if we need to continue writing checkpoints we send a
message on A-level to ourself to continue without giving transactions a
chance to come in. When we written enough we will give the transactions a
chance again by sending the new special delayed signal.

The challenge that we get here is that checkpoints must be prioritised over
normal transactions in many situations. At the same time we want the
prioritisation to be smooth to avoid start and stop situations that can
easily cause ripple effects in a large cluster.

This improved scheduling of checkpoints was one part of the solution to
the adaptive checkpoint speed that is introduced in MySQL Cluster 7.6.7.

Optimising scan filter for checkpoints in NDB

$
0
0
When loading massive amounts of data into NDB when testing the new
adaptive checkpoint speed I noted that checkpoints slowed down as the
database size grew.

I could note in debug logs that the amount of checkpoint writes was
dropping significantly at times. After some investigation I discovered
the root cause.

The checkpoint algorithm in NDB requires all changed rows to be written
to the checkpoint even if it is not a part that is fully checkpointed.
This means that each row has to be scanned to discover if it has been
written.

When loading 600 GByte of DBT2 data we have more than two billion rows
in the database. Scanning two billion rows takes around 15-20 seconds
when simultaneously handling lots of inserts.

This slowed down checkpoints and in addition it uses a lot of CPU.
Thus we wanted a more efficient scanning algorithm in this case.

The solution is based on dividing the database into larger segments.
When updating a row, one has to ensure that a flag on the larger
segment is also updated. A simple first approach is to implement
this on page level for our fixed size pages. Every row has an entry
in the fixed size. This part contains the row header and all fixed
size columns that are not defined as using DYNAMIC storage.

In DBT2 this means that most fixed size pages have around 300 row
entries. Thus we can check one page and if no row has been changed
we can skip checking 300 row entries.

When data size grows to TBytes and we checkpoint every 10-20 seconds,
the risk of a row in a page being updated is actually fairly low.
Thus this simple optimisation brings down the slowdown of the
checkpoints to small parts of a second.

Obviously it is possible to use smaller regions and also larger regions
to control this if required.

This is an important improvement of the checkpointing in
MySQL Cluster 7.6.7.
Viewing all 18801 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>