Quantcast
Channel: Planet MySQL
Viewing all 18787 articles
Browse latest View live

Historical - design doc for semisync replication

$
0
0
This can be read along with the initial semisync post. This post was shared on code.google.com many years ago but code.google has been shutdown. It describes work done by my team at Google. I am interested in the history of technology and with some spare time have been enable to republish it.

Semisync replication was designed and implemented by Wei Li. He did a lot of work to make replication better for web-scale and then moved away from MySQL. Upstream reimplemented the feature which was a good decision given the constraints on our implementation time.

Introduction

Semi-sync replication blocks return from commit on a master until at least one slave acknowledges receipt of all replication events for that transaction. Note that the transaction is committed on the master first.

Background

MySQL replication is asynchronous. If a master fails after committing a transaction but before a slave copied replication events for that transaction, the transaction might be lost forever. For some deployments, we prefer to reduce the chance of this.

The asynchronous replication model might lose user-visible transactions during a unplanned failover. If the master crashes and we let a slave take over, then the application must be prepared to check which transactions actually made it to the slave, and rerun the ones that did not.

Overview

To solve the asynchronous problem, we can add different degrees of synchronicity: fully synchronous replication would wait for the slave to process the transaction first, before telling the client that it has been committed. The downside: delays in commits.

We propose to do semi-synchronous replication: before telling the client that a transaction has been committed, make sure that the slave receives its replication events first. This is also called 2-safe replication. Below is the picture of a semi-synchronous replication:


MySQL commit protocol

The commit protocol is different between MySQL-4.x and MySQL-5.0. The main reason is because MySQL-5.0 uses two phase commit to make sure binlog status conforms to transactional storage engines' internal status.
  • MySQL-4.x
    • write the transaction in the binlog file
    • commit the transaction in InnoDB or other storage engine
  • MySQL-5.0:
    • prepare the transaction in InnoDB or other storage engines
    • write the transaction in the binlog file - this is considered as the commit point
    • commit the transaction in InnoDB or other storage engines

Semi-synchronous commit protocol

Our proposed semi-synchronous replication works in the following way:
  • commit the transaction
  • wait for the replica databases acknowledge that they already received the transaction - this step has a timeout
  • tell the client that the commit has been processed

The committed transaction would not wait indefinitely for the replication thread to send the binlog event. If so, the transaction would never commit if there are network issues or the slave database is down. In step-2, the committed transaction will timeout after a predefined waiting time.

After the timeout, the semi-synchronous replication will be disabled. A new replication thread can catch up in replication and enables it again.

During the wait for network acknowledgment, other transaction would not be blocked and can still continue.

The following global counters will be added:
  • transaction failure timeouts
  • transactions without going through semi-synchronous replication
  • network timeouts

TCP/IP is not enough for acknowledgment

The tricky thing is that replication thread calls TCP/IP to send the replication events. Note that TCP/IP, even with the TCP_NODELAY option does not guarantee that the slave has received the data. Thus, to make sure that the slave database has got the transaction, the slave database must give us an reply to indicate that. This means a transaction commit requires at least a TCP round-trip time. Considering that the round-trip time in one data center is 0.5ms, this should not prevent MySQL from achieving above hundreds of transaction per second.

We will also provide the option of sending the transaction without waiting for the confirmation. We can measure the performance difference to understand the network overhead in the synchronous replication. A parameter will be provided to dynamically change the timeout.

Replication protocol changes

To guarantee that a slave database has got the transaction, the slave database must send one reply message back. This is the situation:
  • the master database need to know when to wait for the reply from the slave database; right now, the master database never waits
  • the slave database need to know when it should send reply message to the master master database
  • we can not do the ping-pong process for every replication event; it can only work for one transaction to minimize the network overhead

In this way, we must have a way for both the master and the slave know when to start this confirmation process. So, any design without replication event changes or replication protocol changes is not possible because the slave database can only figure out the information from the received message. Initially, we wanted to make the replication event change that one special event is appended after a transaction to indicate the slave waiting. However, since replication logs will be served at least once for each replica, this turns out to be a bad idea because we wait only once during transaction commit time.

The only solution after this is to make replication protocol changes. This is the current MySQL replication login process:
  • on the slave database side:
    • a slave database calls safe_connect() to login to the master database
    • COM_BINLOG_DUMP command is sent to the master database to request for binlogs with the following information: binlog_filename, binlog_pos, binlog_flag, server_id
  • on the master database side:
    • COM_BINLOG_DUMP is handled to recognize the requested dump information
    • mysql_binlog_send() is called to send the requested binlog events

Because binlog_flag is sent from the slave database and processed in the master database, semi-synchronous replication will be initiated by the slave and the replication thread will trigger the synchronous operation in the master database. We add one bit in binlog_flag so that the slave database can register itself as synchronous replication target.

    * #define BINLOG_SEMI_SYNC 0x02

If BINLOG_SEMI_SYNC is set for the replication thread, then every event sent from the master database to the slave database will always have one byte extra header. The one byte indicates whether the replication thread is expecting the reply from the slave database. In this way, the new replication protocol's usage is session-based.

Work on the master database side

We will create a search tree that records all waiting transactions. The tree will be keyed on (binlog_filename, binlog_pos). At transaction commit time, after all transaction events have been written into the binlog file, we insert the (binlog_filename, binlog_pos) into the search tree. The purpose of the search tree is for the replication thread to recognize the current waiting transactions. When a transaction stops waiting for the reacknowledgment of the binlog events, the transaction's position should be removed from the tree.

The replication thread reads a binlog event from the file and probe the binlog position into the search tree. Depending on whether the position in the search tree, the replication thread will set the one byte extra header before sending the event.

Work on the slave database side

If a slave database is connecting with the semi-synchronous replication mode, it will check the first byte header to decide whether to reply the replication event. Otherwise, it work as original.

Currently, the master database uses one mutex LOCK_log to synchronize all operations on the binlog:
  • a transaction acquires LOCK_log before writing transaction events to a binlog
  • the transaction releases LOCK_log after committing and flushing the binlog to the file system
  • replication thread acquires LOCK_log before reading each event and release the lock afterwards

In semi-synchronous replication, we are planning to add one mutex and one condition variable:
  • innobase_repl_semi_cond: this variable is signaled when enough binlog has been sent to slave, so that a waiting transaction can return the 'ok' message to the client for a commit
  • innobase_repl_semi_cond_mutex: the mutex that is associated with the above condition variable

Code flow for each MySQL session during transaction commit
  • write all binlog events, append the transaction-end event and flush the file to the filesystem
  • commit the transaction inside InnoDB
  • acquire innobase_repl_semi_cond_mutex
  • while true:
    • if semi-synchronous replication has been disabled by timeout:
      • update the asynchronous transaction counter
      • release innobase_repl_semi_cond_mutex and return from the commit
    • check the current binlog sending status
    • if the binlog sending status is ahead of my transaction's waiting position
      • release innobase_repl_semi_cond_mutex and return from the commit
    • set my binlog waiting position to my commited transaction position
    • wait for innobase_repl_semi_cond with a timeout
    • if timeout occurs with waiting innobase_repl_semi_cond or if semi-synchronous replication is disabled after wake-up
      • print the error message
      • update failed timeout counter
      • disable the semi-synchronous replication until the replication thread enables it again
      • release innobase_repl_semi_cond_mutex and return from the commit


Code flow for replication thread

This is the work done by the replication thread when sending binlog events to support the semi-synchronous protocol.
  • if the replication is not the semi-synchronous target, then do nothing and simply return
  • if the most recent sent event is NOT an transaction-end event, then do nothing and simply return
  • wait for the confirmation from the slave database with a network timeout
  • remember whether network timeout occurs
  • acquire innobase_repl_semi_cond_mutex
  • if the network timeout occurs:
    • update failed timeout counter
    • disable the semi-synchronous replication until the replication thread enables it again
    • release innobase_repl_semi_cond_mutex and return
  • if the semi-synchronous replication is disabled, then enable the semi-synchronous replication again
  • check whether any session is waiting for the current sending position
  • if there exist such sessions, wake them up through innobase_repl_semi_cond
  • release innobase_repl_semi_cond_mutex and return

The only one mutex/condition variable create one synchronize point because every committed transaction needs to wait for innobase_repl_semi_cond. When the replication thread wakes up innobase_repl_semi_cond, it has to use broadcast. This might be changed in the future if there are performance issues around the single mutex wait.

Codeflow for replication I/O thread connection to the primary database

When a replica connects to the primary database, it is an opportunity for the primary database to understand the replica's progress. Based on the progress, the primary database will adjust semi-synchronous replication's progress. If the replica's status is too behind, semi-synchronous replication might be suspened until the replica is fully caught up.

If there is only one semi-synchronous target, meaning just one thread is sending the binlog to the slave for which we want synchronous replication, then the replication position should increase monotonically. However, we want to have more than one semi-synchronous replica target to increase the primary database's transaction availability. In that sense, a falling behind replica should not affect the status on the primary if others are caught up.

Network group commit

Replication threads can do group commit to minimize network overhead. When the thread finds the current sending event is a end of transaction event, it would request for a reply from the slave database immediately. Instead, it look for the tail of the binlog file to check whether there are more transaction. Or, it can wait for a while to make the check. If there are more transactions in the file, the replication thread can send all waiting transactions and only waits for one reply. This looks like that we are doing group commit on the network.

The benefit is that we can reduce network round trip by batching transaction replies. However, it also reduces the reliability of semi-synchronous replication. If we acknowledge each transaction, we can only lose at most one transaction during failure. If we do group commit, we might lose all transactions in the batch. We need to trade off between performance and reliability.

Historical - SHOW STATUS changes

$
0
0
This post was shared on code.google.com many years ago but code.google has been shutdown. It describes work done by my team at Google. I am interested in the history of technology and with some spare time have been enable to republish it.

MySQL circa 2008 was hard to monitor so we added many things to SHOW STATUS and SHOW INNODB STATUS along with support for user, table and index statistics.

I added a counter for failures of calls to gettimeofday. That used to be a thing. We also changed mysqld to catch cross-socket differences in hardware clocks on old AMD motherboards. Fun times.

Overview

We have added extra values for monitoring. Much of the data from SHOW INNODB STATUS is now available in SHOW STATUS.

We have also added rate limiting for both SHOW STATUS and SHOW INNODB STATUS to reduce the overhead from overzealous monitoring tools. This limits how frequently the expensive operations are done for these SHOW commands.

Changes

General
  • Binlog_events - number of replication events written to the binlog
  • Binlog_largest_event - larget event in the current binlog
  • Denied_connections - number of connection attempts that fail because of the max_connections limit
  • Malloc_sbrk_bytes_alloc, Malloc_chunks_free, Malloc_mmap_chunks_alloc, Malloc_mmap_bytes_alloc, Malloc_bytes_used, Malloc_bytes_free - values reported from mallinfo()
  • Gettimeofday_errors - errors for gettimeofday calls (yes, this happens)
  • Sort_filesort_old - number of times the old filesort algorithm is used
  • Sort_filesort_new - number of times the new filesort algorithm is used

Replication
  • Replication_fail_io_connections - on a slave, number of times the IO thread has disconnected from the master because of an error
  • Replication_total_io_connections - number of connections made by the IO thread to the master
  • Replication_last_event_buffered - on a slave, time when last replication event received
  • Replication_last_event_done - on a slave, time when last replication event replayed

Semi-synchronous replication
  • Rpl_semi_sync_clients - number of semi-sync clients connected to a master
  • Rpl_semi_sync_net_avg_wait_time(us) - average time to wait for an acknowledgement of a replication event from a semi-sync slave
  • Rpl_semi_sync_net_wait_time - total time waiting for acknowledgement
  • Rpl_semi_sync_net_waits
  • Rpl_semi_sync_no_times  
  • Rpl_semi_sync_no_tx - number of transactions not acknowledged by semi-sync slaves
  • Rpl_semi_sync_status - indicates whether semi-sync is enabled
  • Rpl_semi_sync_slave_status 
  • Rpl_semi_sync_timefunc_failures
  • Rpl_semi_sync_tx_avg_wait_time(us) - average time a sessions waits for commit to finish
  • Rpl_semi_sync_tx_wait_time
  • Rpl_semi_sync_tx_waits
  • Rpl_semi_sync_wait_pos_backtraverse
  • Rpl_semi_sync_wait_sessions
  • Rpl_semi_sync_yes_tx - number of transactions acknowledged by semi-sync slaves
  • Rpl_transaction_support

Innodb
  • Innodb_dict_size - number of bytes used for the InnoDB dictionary
  • Innodb_have_atomic_builtins - indicates whether InnoDB uses atomic memory operations in place of pthreads synchronization functions
  • Innodb_heap_enabled - indicates  whether the InnoDB malloc heap was enabled -- see bug 38531
  • Innodb_long_lock_wait - set when there is a long lock wait on an internal lock. These usually indicate an InnoDB bug. They also occur because the adaptive hash latch is not always released when it should be (such as during an external sort).
  • Innodb_long_lock_waits - incremented once for each internal long lock wait
  • Innodb_os_read_requests - from SHOW INNODB STATUS
  • Innodb_os_write_requests - from SHOW INNODB STATUS
  • Innodb_os_pages_read - from SHOW INNODB STATUS
  • Innodb_os_pages_written - from SHOW INNODB STATUS
  • Innodb_os_read_time - from SHOW INNODB STATUS
  • Innodb_os_write_time - from SHOW INNODB STATUS
  • Innodb_time_per_read - average microseconds per read
  • Innodb_time_per_write - average microseconds per write
  • Innodb_deadlocks - application deadlocks, detected automatically
  • Innodb_transaction_count - from SHOW INNODB STATUS
  • Innodb_transaction_purge_count - from SHOW INNODB STATUS
  • Innodb_transaction_purge_lag - count of work to be done by the InnoDB purge thread, see this post
  • Innodb_active_transactions - from SHOW INNODB STATUS
  • Innodb_summed_transaction_age - from SHOW INNODB STATUS
  • Innodb_longest_transaction_age - from SHOW INNODB STATUS
  • Innodb_lock_wait_timeouts - count of lock wait timeouts
  • Innodb_lock_waiters - from SHOW INNODB STATUS
  • Innodb_summed_lock_wait_time - from SHOW INNODB STATUS
  • Innodb_longest_lock_wait - from SHOW INNODB STATUS
  • Innodb_pending_normal_aio_reads - from SHOW INNODB STATUS
  • Innodb_pending_normal_aio_writes - from SHOW INNODB STATUS
  • Innodb_pending_ibuf_aio_reads - from SHOW INNODB STATUS
  • Innodb_pending_log_ios - from SHOW INNODB STATUS
  • Innodb_pending_sync_ios - from SHOW INNODB STATUS
  • Innodb_os_reads - from SHOW INNODB STATUS
  • Innodb_os_writes - from SHOW INNODB STATUS
  • Innodb_os_fsyncs - from SHOW INNODB STATUS
  • Innodb_ibuf_inserts - from SHOW INNODB STATUS
  • Innodb_ibuf_size - counts work to be done by the insert buffer, see here
  • Innodb_ibuf_merged_recs - from SHOW INNODB STATUS
  • Innodb_ibuf_merges - from SHOW INNODB STATUS
  • Innodb_log_ios_done - from SHOW INNODB STATUS
  • Innodb_buffer_pool_hit_rate - from SHOW INNODB STATUS

Historical - SHOW INNODB STATUS

$
0
0
This post was shared on code.google.com many years ago but code.google has been shutdown. It describes work done by my team at Google. I am interested in the history of technology and with some spare time have been enable to republish it.

MySQL circa 2008 was hard to monitor so we added many things to SHOW STATUS and SHOW INNODB STATUS along with support for user, table and index statistics. Most of the changes we made to SHOW INNODB STATUS are not listed here. I am not sure whether I ever described them. The most important changes were:
  • list transactions last in the output in case the output was too long and truncated by InnoDB
  • report average and worst-case IO latencies

Introduction

We have added more output to SHOW INNODB STATUS, reordered the output so that the list of transactions is printed list and increased the maximum size of the output that may be returned.

Background threads:
  • srv_master_thread_loops - counts work done by main background thread
  • spinlock delay displays the number of milliseconds that the spinlock will spin before going to sleep
  • fsync callers displays the source of calls to fsync()
----------
BACKGROUND THREAD
----------
srv_master_thread loops: 28488 1_second, 28487 sleeps, 2730 10_second, 1182 background, 761 flush
srv_master_thread log flush: 29146 sync, 2982 async
srv_wait_thread_mics 0 microseconds, 0.0 seconds
spinlock delay for 5 delay 20 rounds is 5 mics
fsync callers: 1034231 buffer pool, 39227 other, 73053 checkpoint, 10737 log aio, 80994 log sync, 0 archive
Semaphores

New output includes:
  • lock wait timeouts counter
  • number of spinlock rounds per OS wait for a mutex
----------
SEMAPHORES
----------
Lock wait timeouts 0
...
Spin rounds per wait: 2.90 mutex, 1.27 RW-shared, 3.04 RW-excl

Disk IO

New output includes:
  • number of pages read/written
  • number of read/write system calls used to read/write those pages
  • time in milliseconds to complete the IO requests
--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread) reads 24 writes 0 requests 14 io secs 0.033997 io msecs/request 2.428357 max_io_wait 18.598000
I/O thread 1 state: waiting for i/o request (log thread) reads 0 writes 10737 requests 10737 io secs 30.626824 io msecs/request 2.852456 max_io_wait 710.588000
I/O thread 2 state: waiting for i/o request (read thread) reads 136659 writes 0 requests 118296 io secs 3093.412099 io msecs/request
26.149761 max_io_wait 2631.029000
I/O thread 3 state: waiting for i/o request (read thread) reads 91262 writes 0 requests 71709 io secs 1900.155508 io msecs/request 26.498145 max_io_wait 1626.209000
I/O thread 6 state: waiting for i/o request (write thread) reads 0 writes 1847360 requests 7065434 io secs 1063.904923 io msecs/request 0.150579 max_io_wait 2569.244000

This is from another post

There are more details on InnoDB status in the output from SHOW INNODB STATUS and SHOW STATUS.

New details for SHOW INNODB STATUS include:
  • frequency at which the main background IO thread runs
  • IO latency for each background IO thread
  • per-file IO statistics
  • insert buffer prefetch reads
  • statistics on checkpoint related IO
  • statistics on prefetches
  • statistics on sources of background IO

Main background IO thread

This includes:

  • srv_master_thread loops - number of iterations of the main background loop including the tasks per second (1_second) and the tasks per 10 seconds (10_second).
  • Seconds in background IO thread: number of seconds performing different background IO tasks

BACKGROUND THREAD
----------
srv_master_thread loops: 1623 1_second, 1623 sleeps, 162 10_second, 1 background, 1 flush
srv_master_thread log flush: 1785 sync, 1 async
srv_wait_thread_mics 0 microseconds, 0.0 seconds
spinlock delay for 5 delay 20 rounds is 2 mics
Seconds in background IO thread: 5.10 insert buffer, 49.02 buffer pool, 0.00 adaptive checkpoint, 52.34 purge
fsync callers: 0 buffer pool, 189 other, 1323 checkpoint, 263 log aio, 5179 log sync, 0 archive

Background IO thread statistics

This includes:
  • reads, writes - number of pages read and written
  • requests - number of pwrite/pread system calls. There may be fewer of these than reads and writes because of request merging.
  • msecs/r - average number of milliseconds per *request*. For the *io:* section this is the time for the pwrite/pread system call. For the *svc:* section this is the time from when the page is submitted to the background thread until it is completed.
  • secs - total seconds for all pread/pwrite calls
  • old - number of pages for which the service time is greater than 2 seconds
  • Sync reads, Sync writes - IO operations done synchronously. These share code with the background IO threads, but the IO calls are done directly rather than begin put in the request array and handled by a background IO thread.

FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread) reads 2177 writes 0 io: requests 125 secs 2.99 msecs/r 23.90 max msecs 82.07 svc: 106.58 msecs/r 48.96 max msecs 128.76 old 0
I/O thread 1 state: waiting for i/o request (log thread) reads 0 writes 263 io: requests 263 secs 0.13 msecs/r 0.49 max msecs 30.43 svc: secs 0.14 msecs/r 0.54 max msecs 30.48 old 0
I/O thread 2 state: doing file i/o (read thread) reads 116513 writes 0 io: requests 35777 secs 564.96 msecs/r 15.79 max msecs 251.04 svc: secs 7643.21 msecs/r 65.60 max msecs 2492.18 old 111 ev set
I/O thread 6 state: waiting for i/o request (write thread) reads 0 writes 391586 io: requests 256597 secs 1169.16 msecs/r 4.56 max msecs 336.70 svc: secs 104498.79 msecs/r 266.86 max msecs 3001.04 old 169
Sync reads: requests 10126259, pages 10126278, bytes 165912465408, seconds 171656.02, msecs/r 16.95
Sync writes: requests 2849234, pages 3029512, bytes 11289789952, seconds 77.81, msecs/r 0.03

File IO statistics

This includes statistics per file. It is much more useful when InnoDB is run with innodb_file_per_table. The first two columns are the tablespace name and tablespace ID. There are separate sections for reads and writes per file:
  • pages - number of pages read or written
  • requests - number of pwrite/pread system calls. There may be fewer of these than reads and writes because of request merging
  • msecs/r - average number of milliseconds per request
  • secs - total seconds for all pread/pwrite calls

File IO statistics
  ./test/warehouse.ibd 10 -- read: 3 requests, 3 pages, 0.01 secs, 4.36 msecs/r, write: 30 requests, 30 pages, 0.11 secs, 3.70 msecs/r
  ./ibdata1 0 -- read: 1123 requests, 3349 pages, 22.97 secs, 20.46 msecs/r, write: 2662 requests, 86526 pages, 32.86 secs, 12.34 msecs/r
  ./test/orders.ibd 29 -- read: 26301 requests, 28759 pages, 450.63 secs, 17.13 msecs/r, write: 82089 requests, 101564 pages, 425.44 secs, 5.18 msecs/r
  ./test/customer.ibd 28 -- read: 333186 requests, 338048 pages, 5955.39 secs, 17.87 msecs/r, write: 185378 requests, 200494 pages, 883.61 secs, 4.77 msecs/r
  ./test/stock.ibd 27 -- read: 902675 requests, 1179864 pages, 16036.91 secs, 17.77 msecs/r, write: 577970 requests, 790063 pages, 2473.27 secs, 4.28 msecs/r
  ./test/order_line.ibd 25 -- read: 74232 requests, 92644 pages, 1217.65 secs, 16.40 msecs/r, write: 141432 requests, 274155 pages, 643.97 secs, 4.55 msecs/r
  ./test/new_orders.ibd 22 -- read: 4642 requests, 4960 pages, 81.02 secs, 17.45 msecs/r, write: 11482 requests, 60368 pages, 103.86 secs, 9.05 msecs/r
  ./test/history.ibd 21 -- read: 8006 requests, 11323 pages, 123.86 secs, 15.47 msecs/r, write: 24640 requests, 52809 pages, 119.01 secs, 4.83 msecs/r
  ./test/district.ibd 18 -- read: 14 requests, 14 pages, 0.14 secs, 10.35 msecs/r, write: 39 requests, 249 pages, 0.43 secs, 10.96 msecs/r
  ./test/item.ibd 16 -- read: 2892 requests, 3033 pages, 51.96 secs, 17.97 msecs/r, write: 0 requests, 0 pages, 0.00 secs, 0.00 msecs/r
  ./ib_logfile0 4294967280 -- read: 6 requests, 9 pages, 0.00 secs, 0.02 msecs/r, write: 314701 requests, 316680 pages, 6.73 secs, 0.02 msecs/w

Insert Buffer Statistics

New output includes:
  • Ibuf read pages - number of requested and actual prefetch reads done to merge insert buffer records. InnoDB chooses entries to merge at random. If the number requested is much higher than the actual number, then the random algorithm is inefficient.
  • Ibuf merge - rate at which work is done for the insert buffer

INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 3776, free list len 1895, seg size 5672,
984975 inserts, 454561 merged recs, 58782 merges
Ibuf read pages: 32960 requested, 36280 actual
Ibuf merge: 1229.9 requested_io/s, 39269.4 records_in/s, 19639.5 records_out/s, 2210.6 page_reads/s

Log Statistics

This includes:
  • Foreground (Background) page flushes - page flushes done synchronously (asynchronously) by user sessions to maintain a small number of clean buffer pool pages. 

LOG
---
Foreground page flushes:  sync 0 async 0
Background adaptive page flushes: 0
Foreground flush margins: sync 3025130459 async 2823455095
Space to flush margin:     sync 3000381113 async 2798705749
Current_LSN - Min_LSN     24749346
Checkpoint age            25858470
Max checkpoint age        3226805822

Buffer Pool Statistics

This includes:
  • LRU_old pages - number of *old* pages on the LRU list
  • Total writes - sources of pages for dirty page writes
  • Write sources - callers from which dirty page write requests are submitted
  • Foreground flushed dirty - number of dirty page writes submitted from a user session because the main background IO thread was too slow
  • Read ahead - number of prefetch read requests submitted because random or sequential access to an extent was detected
  • Pct_dirty - percent of pages in the buffer pool that are dirty

BUFFER POOL AND MEMORY
----------------------
LRU_old pages      48109
Total writes: LRU 23271, flush list 1491363, single page 0
Write sources: free margin 23271, bg dirty 552374, bg lsn 0, bg extra 2742, recv 0, preflush 0
Foreground flushed dirty 935903
Read ahead: 44312 random, 355282 sequential
Pct_dirty 25.83

Historical - changes to my.cnf

$
0
0
This post was shared on code.google.com many years ago but code.google has been shutdown. It describes work done by my team at Google. I am interested in the history of technology and with some spare time have been enable to republish it.

TODO - find the linked pages including:
  • MysqlHttp - we added an http server to mysql for exporting monitoring. This was work by Nick Burrett
  • InnodbAsyncIo - this explains perf improvements we made for InnoDB
  • InnoDbIoTuning - explains more perf improvements we made for InnoDB

We added these options:
  • http_enable - start the embedded HTTP demon when ON, see MysqlHttp
  • http_port - port on which HTTP listens, see MysqlHttp
  • innodb_max_merged_io - max number of IO requests merged into one large request by a background IO thread
  • innodb_read_io_threads, innodb_write_io_threads - number of background IO threads for prefetch reads and dirty page writes, see InnodbAsyncIo
  • show_command_compatible_mysql4 - make output from some SHOW commands match that used by MySQL4
  • show_default_global - make SHOW STATUS use global statistics
  • global_status_update_interval - the interval at which per-thread stats are read for SHOW STATUS. When SHOW STATUS is run more frequently cached values are used rather than locking and reading data from each thread.
  • google_profile[=name] - enable profiling using Google Perftools and write output to this file. Server must have been compiled to use Google Perftools.
  • equality_propagation - enables use of equality propagation in the optimizer because the overhead was too much in a few releases (bug filed & fixed)
  • trim_trailing_blanks - trim trailing blanks on varchar fields when set
  • allow_view_trigger_sp_subquery - allow use of views, triggers, stored procedures and subqueries when set
  • allow_delayed_write - allow use of delayed insert and replace statements
  • local-infile-needs-file - LOAD DATA LOCAL INFILE requires the FILE privilege when set  
  • audit_log[=name] - log logins, queries against specified tables, and startup
  • audit_log_tables=name - log queries that use these tables to the audit log (comma separated)
  • log_root - log DML done by users with the SUPER privilege
  • repl_port[=#] - extra port on which mysqld listens for connections from users with SUPER and replication privileges
  • rpl_always_reconnect_on_error - slave IO thread always tries to reconnect on error when set
  • rpl_always_enter_innodb - slave SQL thread always enter innodb when set regardless of innodb concurrency ticket count
  • rpl_event_buffer_size=# - size of the per-connection buffer used on the master to copy events to a slave. Avoids allocating/deallocating a buffer for each event.
  • reserved_super_connections=# - number of reserved connections for users with SUPER privileges.
  • rpl_always_begin_event - always add a BEGIN event at the beginning of each transaction block written to the binlog. This fixes a bug.
  • rpl_semi_sync_enabled - enable semisync replication on a master
  • rpl_semi_sync_slave_enabled - semisync replication on a slave
  • rpl_semi_sync_timeout - timeout in milliseconds for semisync replication in the master
  • rpl_semi_sync_trace_level - trace level for debugging for semisync replication
  • rpl_transaction_enabled - use transactional replication on a slave
  • innodb_crash_if_init_fails - crash if InnoDB initialization fails
  • innodb_io_capacity - number of disk IOPs the server can do, see InnodbIoTuning
  • innodb_extra_dirty_writes - flush dirty buffer pages when dirty pct is less than max dirty pct
  • connect_must_have_super - only connections with SUPER_ACL, REPL_SLAVE_ACL or REPL_CLIENT_ACL are accepted (yes, this is dynamic)
  • readonly_databases - prevents writes to any DB except for mysql
  • readonly_mysql - prevents writes to mysql DB will fail.
  • fixup_binlog_end_pos - fix for MySQL bug 23171 which updates the end_log_pos of  binlog events as they are written to the  bin log
  • log_slave_connects - log connect and disconnect messages for replication slaves
  • mapped_users - use the mapped_user table to map users to roles
  • xa_enabled - enable support for XA transactions (I like to disable this)

Historical - Adding Roles to MySQL

$
0
0
This post was shared on code.google.com many years ago but code.google has been shutdown. It describes work done by my team at Google. I am interested in the history of technology and with some spare time have been enable to republish it.

I added support for roles to MySQL circa 2008. They arrived upstream with MySQL 8 in 2018. I wasn't able to wait. I enjoyed the project more than expected. It wasn't hard in terms of algorithms or performance but I had to avoid mistakes to avoid security bugs and the upstream code was well written. I had a similar experience implementing BINARY_FLOAT and BINARY_DOUBLE at Oracle. There I got to learn about the IEEE754 standard and had to go out of my way to catch all of the corner cases. Plus I enjoyed working with Minghui Yang who did the PL/SQL part of it.

MySQL roles and mapped users

The access control model in MySQL does not scale for a deployment with thousands of accounts and thousands of tables. The problems are that similar privileges are specified for many accounts and that the only way to limit an account from accessing a table is to grant privileges at the table or column level in which case the mysql.user table has millions of entries.

Privileges may be associated once with a role, and then many accounts may be mapped to that role. When many accounts have the same privileges, this avoids the need to specify the privileges for each account.

We have implemented mapped users in the MySQL access control model. These are used to simulate roles and solve one of these problems. A mapped user provides authentication credentials and is mapped to a _role_ for access control. A new table, mysql.mapped_user, has been added to define mapped users. Entries in an existing table, mysql.user, are reused for roles when there are entries from mysql.mapped_user that reference them.

To avoid confusion:
  • mapped user - one row in mysql.mapped_user
  • role - one row in mysql.user referenced by at least one row in mysql.mapped_user

This provides several features:
  • multiple passwords per account
  • manual password expiration
  • roles
  • transparent to users (mysql -uuser -ppassword works regardless of whether authentication is done using entries in mysql.mapped_user or mysql.user)

Use Case

Create a role account in mysql.user. Create thousands of private accounts in mysql.mapped_user that map to the role. By map to I mean that the value of mysql.mapped_user.Role is the account name for the role.

Implementation

Authentication in MySQL is implemented using the _mysql.user_ table. mysqld sorts these entries and when a connection is attempted, the first entry in the sorted list that matches the account name and hostname/IP of the client is used for authentication. A challenge response protocol is done using the password hash for that entry.
A new table is added to support mapped users. This table does not have columns for privileges. Instead, each row references an account name from mysql.user that provides the privileges. The new table has a subset of the columns from mysql.user:
  • User - the name for this mapped user
  • Role - the name of the account in mysql.user from which this account gets its privileges
  • Password - the password hash for authenticating a connection
  • PasswordChanged - the timestamp when this entry was last updated or created. This is intended to support manual password expiration via a script that deletes all entries where PasswordChanged less than the cutoff.
  • ssl_type, ssl_cipher, x509_issuer, x509_subject - values for SSL authentication, note that code has yet to be added in the server to handle these values

DDL for the new table:
CREATE TABLE mapped_user (
  User char(16) binary DEFAULT '' NOT NULL,
  Role char(16) binary DEFAULT '' NOT NULL,
  Password char(41) character set latin1 collate latin1_bin DEFAULT '' NOT NULL,
  PasswordChanged Timestamp DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP NOT NULL,
  ssl_type enum('','ANY','X509','SPECIFIED') character set utf8 NOT NULL default '',
  ssl_cipher blob NOT NULL,
  x509_issuer blob NOT NULL,
  x509_subject blob NOT NULL,
  PRIMARY KEY (User, Role, Password)
) engine=MyISAM
CHARACTER SET utf8 COLLATE utf8_bin
comment='Mapped users';

Authentication

Entries from mysql.mapped_user are used to authenticate connection attempts only when authentication fails with entries in mysql.user. The failure may have occurred because there was no entry in mysql.user for the user/host or because the password was wrong. If authentication succeeds using an entry in mysql.mapped_user, the mysql.mapped_user.Role column in that entry and the client's hostname/IP are used to search mysql.user for a matching entry. And if one is found, that entry provides the privileges for the connection. By provides the privileges I mean that:
  • the values of mysql.user.User and mysql.user.Host are used to search the other privilege tables
  • the global privileges stored in mysql.user for the matching entry are used

The mysql.mapped_user table supports multiple passwords per account. When a user tries to create a connection with a username that is in the mysql.mapped_user table and there are multiple entries with a matching value in mysql.mapped_user.User, then authentication is attempted for one entry at a time using the password hash in mysql.mapped_user.Password until authentication succeeds or there are no more entries. Note that the order in which the entries from mysql.mapped_user are checked is *not* defined, but this is only an issue when there are entries in mysql.mapped_user with the same value for _User_ and different values for _Role_ and that deployment model should not be used. Also note that this does not require additional RPCs during client authentication.

Entries are ignored from mysql.mapped_user when:
  • Role is the empty string
  • User is the empty string
  • Password is the empty string

There is no constraint between the values in mysql.mapped_user.User and mysql.user.User.  Thus, a bogus mapping (Role references an account that does not exist in mysql.user) can be created. In that case, the entry in mysql.mapped_user cannot be used to create connections and will get access denied errors.

There is a primary key index on mysql.mapped_user, but that is not sufficient to enforce all of the integrity constraints that are needed. Entries with the same values for User and Role but different passwords are allowed, and the primary key forces the password to be different. Entries with the same value for User but different values for _Role_ should not be allowed. However, this can only be enforced with a check constraint on the table and MySQL does not enforce check constraints. We can write a tool to find such entries.

SQL Interfaces

Roles can be added via the _create mapped user_ command that is similar to create user but extended to support options for SSL connections. Roles can be dropped by the drop mapped user command that is similar to drop user. These commands update internal data structures and update the mysql.mapped_user table. There is no need to run flush privileges with these commands.

The following have been changed to print the value of mysql.mapped_user.User rather than the value of mysql.user.User when a role is used to create a connection.
  • error messages related to access control
  • select current_user()
  • select user()
  • show user_statistics
  • show processlist

The output of show grants has not been changed and will display the privileges for the role (the entry in _mysql.user).

set password = password(STRING)_ fails for accounts that use a role. The only way to change a password for an entry in mysql.mapped_user is by an insert statement.

how processlist with roles displays the role for connections from mapped users rather than the mapped user name. show processlist displays the value from mysql.mapped_user.

show user_statistics with roles displays statistics aggregated by role for connections from mapped users. show user_statistics displays values aggregated by the value from mysql.mapped_user.

Mapped users can be created by inserting into mysql.mapped_user and then running FLUSH PRIVILEGES. They are also created by the _create mapped user_ command. An example is create mapped user mapped_readonly identified by 'password' role readonly.

Mapped users can be dropped by deleting from mysql.mapped_user and then running FLUSH PRIVILEGES. They are also dropped by the _drop mapped user_ command. An example is *drop mapped user foo*. This drops all entries from mysql.mapped_user with that user name. A delete statement must be used to drop an entry matching either (username, role) or (username, role, password).

select user() displays the value of the mapped user name when connected as a mapped user. select current_user() displays the value of the role when connected as a mapped user. This is done because current_user() is defined to return the name of the account used for access control.

make user delayed is done on the value of the account name. It does not matter whether the account is listed in mysql.user or mysql.mapped_user.

mysql.mapped_user does not have columns for resource limits such as max connections and max QPS. Limits are enforced per role.

This feature is only supported when the configuration variable mapped_users is used (add to /etc/my.cnf). This feature is disabled by default. Also, the mysql.mapped_user table must exist. This table does not exist in our current deployment. It must be created before the feature is enabled. The scripts provided by MySQL to create the system databases will create the table, but we do not use those scripts frequently.

The value of the mysql.user.Host column applies to any mapped users trying to create a connection. This can be used to restrict clients to connect from prod or corp hosts.

Open Requests
  • Add a unique index on (User, Password)
  • Add an email column to mysql.mapped_user
  • Inherit limits (hostname/IP address from which connections are allowed, connection limits, max queries per minute limit) from the mysql.user table.
  • Implement support for SSL -- the mysql.mapped_user table has columns for SSL authentication. Code has not been added to the server to handle them.


Historical - Make User Delayed

$
0
0
This post was shared on code.google.com many years ago but code.google has been shutdown. It describes work done by my team at Google. I am interested in the history of technology and with some spare time have been enable to republish it.

I added support to rate limit DBMS accounts that were too busy. It wasn't successful in production for the obvious reason that it just shifts the convoy from the database to the app server -- the problem still exists. The better solution is to fix the application or improve DBMS capacity but that takes time.

This describes SQL commands added to rate limit queries per account and per client IP.

Per account rate limiting

Per-account query delays use new SQL commands to set a query delay for an account. The delay is the number of milliseconds to sleep before running a SQL statement for the account. These values are transient and all reset to zero delay on server restart. The values are set by the command MAKE USER 'user' DELAYED 100 where the literals user and 100 are the account and number of milliseconds to sleep. There is no delay when the value is 0. The values are displayed by the command SHOW DELAYED USER.

MySQL had a feature to limit the number of queries per hour for an account. This is done by setting the _user.max_questions_ column for the account. We have changed this to be the max queries per minute so that when an account reaches the limit, it doesn't have to wait for an hour for the reset.

These don't change the behavior for existing connections. There must be a reconnect to get the new values.

Per client IP rate limiting

Per-client rate limiting is done by the command MAKE CLIENT 'IP-address' DELAYED 100 where the literal IP-address is the exact match for the client IP that should be delayed and 100 is the number of milliseconds to delay each statement. The delays are displayed by the command SHOW DELAYED CLIENT.

Historical - Patch for MySQL 5.0

$
0
0
This post was shared on code.google.com many years ago but code.google has been shutdown. It describes work done by my team at Google. I am interested in the history of technology and with some spare time have been enable to republish it.

This describes the patch for MySQL 5.0 provided by my team at Google. The early patches from us were difficult for others because they tended to include too many diffs. I didn't have time to better.

Introduction

The code has been changed to make MySQL more manageable, available and scaleable. Many problems remain to be solved to improve SMP performance. This is a good start. The v3 patch and all future patches will be published with a BSD license which applies to code we have added and changed. Original MySQL sources has a GPL license.

I am not sure if the patches were lost after the googlecode shutdown.

These have the same functionality as the MySQL 4 patches. There are several patch sets:
  • v1 patch published in 2007
  • v2 patch with all of our changes for MySQL 5.0.37
  • v3 patch with all of our changes for MySQL 5.0.37 as of May 6, 2009. This adds global transaction IDs, row-change logging and more InnoDB SMP performance fixes.
  • v4 patch [http://google-mysql-tools.googlecode.com/svn/trunk/old/mysql-as of June 1, 2009
  • semisync v1 patch published in 2007
  • mutexstats patch MySQL 5.1.26
  • SMP perf patch for MySQL 5.0.67. This has two changes:
    • use atomic memory instructions for the InnoDB mutex and rw-mutex. This is only done for x86 platforms that use a recent (>= 4.1) version of GCC.
    • disable the InnoDB memory heap. This is done for all platforms
  • SMP plugin for the InnoDB 1.0.1 plugin in MySQL 5.1
  • Patch to enable/disable IO to InnoDB files for MySQL 5.0.37
  • Patch to use pthread_mutex_t instead of mutex_t for rw_lock_struct::mutex in InnoDB
  • Patch for global transaction IDs and binlog event checksums stand-alone patch] extracted out of the big V3 patch and ported to mysql-5.0.68 as of 

Feedback, Problems and Comments

Use the deprecated Google group.

Disclaimer

We have changed a lot of code. Not all of the changes are described here and some of the changes to default behavior from new my.cnf variables can break your applications. Unless your name rhymes with Domas, it might be better to take pieces of the patch rather than try to use all of it.

The code has been tested on 32-bit and 64-bit Linux x86. We may have broken the build for other platforms.

The embedded server, *--with-embedded-server*, cannot be built with these changes. We have broken the build for it.

Many of the Makefile.in and Makefile.am files have been changed in the big patch because we changed InnoDB to use the top-level configure.

If you try to install the big patch, treat it like [http://dev.mysql.com/doc/refman/5.0/en/installing-source-tree.html installing from a source tree].

Authors

Many people contributed to this:
  • Wei Li
  • Gene Pang
  • Eric Rollins
  • Ben Handy
  • Justin Tolmer
  • Larry Zhou
  • Yuan Wei
  • Robert Banz
  • Chip Turner
  • Steve Gunn
  • Mark Callaghan

The v2 patch

This has many new features and a few non-features. Embedded MySQL will not work with this patch.
  • SqlChanges
  • SemiSyncReplication
  • InnodbSmp
  • NewShowStatus
  • NewShowInnodbStatus
  • NewConfiguration
  • UserTableMonitoring
  • TransactionalReplication
  • MysqlRoles
  • MysqlRateLimiting
  • MoreLogging
  • InnodbAsyncIo
  • FastMasterPromotion
  • MirroredBinlogs
  • InnodbSampling
  • NewSqlFunctions
  • InnodbStatus
  • LosslessFloatDump
  • MysqlHttp
  • InnodbIoTuning
  • MutexContentionStats
  • FastMutexes
  • InnodbFreeze

The v3 patch

This has many new features and a few non-features. Embedded MySQL will not work with this patch. Also, I generated the patch after running 'make distclean' so there are some files that must be regenerated after this patch is applied, including sql_yacc.cc and sql_yacc.h. By doing this, the patch diff is smaller but maybe a bit confusing. Also, I did not update any of the files in libmysqld/ that are copied from sql/.
  • GlobalTransactionIds
  • OnlineDataDrift
  • BatchKeyAccess
  • InnodbMutexContention2
  • BinlogEventChecksums

The v4 patch

This makes InnoDB much faster on IO bound workloads and fixes bugs in new features.
  • InnodbIoPerformance

Not yet released
  • MysqlThreadPool

Historical - InnoDB IO Performance

$
0
0
This post was shared on code.google.com many years ago but code.google has been shutdown. It describes work done by my team at Google. I am interested in the history of technology and with some spare time have been enable to republish it.

This is a collection from several posts about InnoDB IO performance

Max dirty pages

InnoDB provides a my.cnf variable, innodb_max_dirty_pages_pct, to set the maximum percentage of buffer pool pages that should be dirty. It then appears to ignores said variable for IO bound workloads (see this post from DimitriK). It doesn't really ignore the value. The problem is that it does not try hard enough to flush dirty pages even when there is available IO capacity. Specific problems include:
  • one thread uses synchronous IO to write pages to disk. When write latency is significant (because O_DIRECT is used, SATA write cache is disabled, network attached storage is used, ext2 is used) then this thread becomes a bottleneck. This was fixed in the v2 Google patch and is also fixed in MySQL 5.4 and Percona builds.
  • rate limits are too small. InnoDB has a background thread that schedules writes for 100 dirty pages at a time when there are too many dirty pages. The limit of 100 is reasonable for a single-disk server. It must be larger for a high IOPs server. The v2 Google patch, MySQL 5.4 (maybe) and Percona branches use the innodb_io_capacity my.cnf variable to determine the number of pages per second that should be written for this case and the amount of IO that should be done in other cases. All work is expressed as a fraction of this variable, rather than as a fixed number of IO operations.
  • request arrays are too small. On my servers, each array has 256 slots. For a server that can do 1000 IOPs, this is too small. The v4 patch makes the size of the array a function of the value of innodb_io.capacity
  • user sessions are not constrained. InnoDB delay's user sessions when the purge thread gets too far behind. Otherwise, not much is done to delay a user session. The v4 patch adds code to force user session's to stop and flush dirty pages when the maximum number of dirty pages has been exceeded. Hopefully, this code does nothing as the background thread is more likely to keep up given other changes in the v4 patch.

IO Performance

This provides performance results for work done to improve InnoDB IO performance. TODO - fix the links:

It is one thing to publish performance results. It is another to understand them. The results here need more analysis and the code needs to be tested by others in the community.

This describes work to make InnoDB faster on IO bound workloads. The goal is to make it easy to use InnoDB on a server that can do 1000 to 10000 IOPs. Many of problems must be fixed for that to be possible, but this is a big step towards that goal. These changes improve performance by 20% to more than 400% on several benchmarks. At a high level, these changes make InnoDB:
  • more efficient when processing IO requests
  • more likely to use available IO capacity
  • better at balancing different IO tasks
  • easier to monitor

One day, Heikki will write the Complete Guide to InnoDB (edit - Jeremy Cole did a lot to explain them), until then you need to consult multiple sources to understand the internals. It also helps to read the source code. These may help you to understand it:

Features

  • Changes the computation of the percentage of dirty buffer pool pages. Before this change the percentage exluded pages borrowed from the buffer pool for other uses. While that may be more accurate, it also requires the caller to lock/unlock a hot mutex. It also made the percentage vary a bit too much as the insert buffer grew and shrank. The v4 patch doesn't exclude the borrowed pages. As most of the borrowed pages should be used in the insert buffer and the insert buffer should be smaller (thanks to ibuf_max_pct_of_buffer), this is probably a good thing.
  • (edit removed many links to other project pages)

Background IO

InnoDB starts a thread, the main background IO thread, to perform background IO operations. This has operations that run once per second, once per 10 seconds and only when the server is idle. This is implemented with a for loop that iterates 10 times. Each time through the loop, the thread sleeps for 1 second unless too much work was done on the previous iteration of the loop. At the end of 10 iterations, the once per 10 seconds tasks are run.

It is hard to understand the behavior of this loop because the sleep is optional dependent on the amount of work done on the previous iteration of the loop. And there are costs from this complexity. For example, one of the 1 second tasks is to flush the transaction log to disk to match the expected behavior from innodb_flush_log_at_trx_commit=2. However, when the 1 second loop runs much more frequently than once per second there will be many more fsync calls then expected.

In the v4 patch, the sleep is not optional. Other changes to the main background IO thread make it possible for each loop iteration to do enough work that there is no need to skip the sleep.

In the v4 patch all of the code that submits a large number of async IO requests makes sure that the number of requests does not exceed the number of free slots in the array. Otherwise, the async IO requests block until there are free slots.

CPU overhead from IO

There are several factors that consume CPU time during IO processing:
  • checksum computation and verification - the v4 patch does not make this faster. Using -O3 rather than -O2 with gcc makes this faster. On a server that does 10,000 IOPs, this will consume a lot of CPU time. Domas wrote about this. We may need to consider alternative checksum algorithms and machine-specific optimizations.
  • request array iteration - InnoDB maintains requests for IO in an array. It frequently iterates on the array and called a function to get the next element in the array. That has been changed to use pointer arithmetic. This makes a big difference when the array is large.
  • request merging - InnoDB merges requests for adjacent blocks so that one large IO operation is done instead of several page size operations. Up to 64 page requests can be merged into one large (1MB) request. The merging algorithm was O(N*N) on the size of the request array and has been changed to be O(N). This will merge fewer requests but use much less CPU. A better change might be to replace each array with two lists: one that maintains requests in file order and the other in arrival order. But that must wait for another day.

    my.cnf options for IO performance

    These InnoDB my.cnf variables are new in the Google patches:
    • innodb_max_merged_io - maximum number of IO requests merged to issue large IO from background IO threads
    • innodb_read_io_threads - number of background read I/O threads in InnoDB
    • innodb_write_io_threads - number of background write I/O threads in InnoDB
    • innodb_adaptive_checkpoint - makes the background IO thread flush dirty pages when are there old pages that will delay a checkpoint. OFF provides traditional behavior
    • innodb_check_max_dirty_foreground - make user sessions flush some dirty pages when innodb_max_dirty_pages_pct has been exceeded. OFF provides traditional behavior
    • innodb_file_aio_stats - compute and export per-file IO statistics for InnoDB
    • innodb_flush_adjacent_background - when background IO threads flush dirty pages, flush adjacent dirty pages from the same extent. ON provides traditional behavior.
    • innodb_flush_adjacent_foreground - when user sessions flush dirty pages, flush adjacent dirty pages from the same extent. ON provides traditional behavior
    • innodb_ibuf_flush_pct - percent of innodb_io_capacity that should be used for prefetch reads used to merge insert buffer entries
    • innodb_ibuf_max_pct_of_buffer - soft limit for the percent of buffer cache pages that can be used for the insert buffer. When this is exceeded background IO threads work harder to merge insert buffer entries. The hard limit is 50%. The traditional value is 50%.
    • innodb_ibuf_reads_sync - use sync IO to read blocks for insert buffer merges. ON provides traditional behavior. 
    • innodb_io_capacity - maximum number of concurrent IO requests that should be done to flush dirty buffer pool pages. CAUTION -- setting this too high will use a lot of CPU to schedule IO requests and more than 1000 might be too high. The traditional value is 100.

    Insert Buffer Improvements

    InnoDB performance on many IO bound workloads is much better than expected because of the insert buffer. Unfortunately, InnoDB does not try hard enough to keep the insert buffer from getting full. And when it gets full it kills performance because it continues to use memory from the buffer pool but cannot be used to defer IO for secondary index maintenance.

    The v4 patch has several changes to fix this:
    • the my.cnf variable innodb_ibuf_max_pct_of_buffer specifies a soft limit on the size of the buffer pool. The hard limit is 50%. When the hard limit is reached no more inserts are done to the insert buffer. When the soft limit is reached, the main background IO thread aggressively requests prefetch reads to merge insert buffer records.
    • the my.cnf variable innodb_ibuf_flush_pct specifies the number of prefetch reads that can be submitted at a time as a percentage of innodb_io_capacity. Prior to the v4 patch, InnoDB did 5 prefetch read requests at a time and this was usually done once per second.
    • the my.cnf variable innodb_ibuf_reads_sync determines whether async IO is used for the prefetch reads. Prior to the v4 patch, sync IO was used for the prefetch reads done to merge insert buffer records. This variable was added for testing as the default value (skip_innodb_ibuf_reads_sync) should be used in production.
    • code is added to delay user sessions and make them merge insert buffer records when the size of the insert buffer exceeds the soft limit.

    Freeze InnoDB IO

    This feature wasn't useful in production. It added the commands:
    • set global innodb_disallow_writes=ON
    • set global innodb_disallow_writes=OFF

    These enable and disable all Innodb file system activity except for reads. If you want to take a database backup without stopping the server and you don't use LVM, ZFS or some other storage software that provides snapshots, then you can use this to halt all destructive file system activity from InnoDB and then backup the InnoDB data files. Note that it is not sufficient to run FLUSH TABLES WITH READ LOCK as there are background IO threads used by InnoDB that may still do IO.

    Async IO for InnoDB

    InnoDB supports asynchronous IO for Windows. For Linux, it uses 4 threads to perform background IO tasks and each thread uses synchronous IO. There is one thread for each of:
    • insert buffer merging
    • log IO
    • read prefetch requests
    • writing dirty buffer cache pages
    InnoDB issues prefetch requests when it detects locality in random IO and when it detects a sequential scan. However, it only uses one thread to execute these requests. Multi-disk servers are best utilized when more IO requests can be issued concurrently.

    For deployments that use buffered IO rather than direct IO or some type of remote disk (SAN, NFS, NAS), there is not much of a need for more write threads because writes complete quickly into the OS buffer cache. However, as servers with many GB of RAM are used, it is frequently better to use direct IO.

    We changed InnoDB to support a configurable number of background IO threads for read and write requests. This is controlled by the parameters:
    • innodb_max_merged_io - Max number of IO requests merged to issue large IO from background IO threads
    • innodb_read_io_threads - the number of background IO threads for read prefetch requests
    • innodb_write_io_threads - the number of background IO threads for writing dirty pages from the buffer cache


    Historical - summary of the Google MySQL effort

    $
    0
    0
    This summarizes work we did on MySQL at Google. These posts used to be shared at code.google.com but it was shutdown. After reformatting most of these (it was a fun day for me, but sorry for the spam) I remembered that somone had already done that in 2015. Thank you.

    My reformatted posts:

    Posts from the git wiki pages via my fork of upstream:

    Implementing the Internet of Things with MySQL

    $
    0
    0

    Author: Robert Agar

    The Internet of Things (IoT) has grown from an interesting concept to a paradigm that is changing the way individuals and businesses operate in the 21st Century. It is based on connecting IP-capable devices so they can communicate with each other in a variety of ways. They range from automated industrial assembly lines to smart appliances that promise to make life easier and more convenient for consumers.

    A common aspect of all IoT implementations is that they make use of large amounts of data collected from network-connected devices. As with most data-centric applications, IoT systems rely on databases to store and process the accumulated information that drives them. MySQL is a valid choice in database platforms when you are designing a system that interacts with the IoT.

    Fundamental Aspects of IoT Systems

    The point of the Internet of Things is to gather or exchange data to facilitate physical processes or gain insight into complex abstract or tangible systems. IoT implementations accomplish this by deploying network-connected devices that have the ability to measure and collect relevant information.

    Attaching devices to a network is nothing new. Factors that differentiate IoT systems from traditional networks are the intelligence and capabilities that are built into the connected devices. Complicated tasks can be carried out without direct human intervention through communication between IoT sensors, monitors, and actuators in industrial applications. Consumers can talk to their home and have it respond to them in a variety of ways. Marketers obtain data that helps them fine-tune their offerings and increase the efficiency of their sales initiatives.

    Behind the enhanced functionality that the IoT affords when compared to previous networks is an incredible amount of data. The concepts of Big Data and the IoT are intertwined. Storing and manipulating the tremendous amounts of data generated by IoT devices effectively is vitally important to the productivity of the system. The whole point of an IoT implementation is to generate and make use of information gathered from its sensors.

    Why MySQL Makes Sense as Your Platform of Choice

    The selection of the backend database for IoT systems is critical to their eventual success. You need a platform that can handle the demands of processing large amounts of data efficiently. Most IoT implementations also require high availability and excellent performance. Choosing the wrong database can have detrimental effects on the whole system.

    MySQL is one of the most popular database platforms in the world and is used in a wide variety of business and scientific solutions. It has several advantages that make it a viable solution for IoT implementations. Here is a list of features that make it attractive:

    • Data security is critical for many types of applications including IoT systems. MySQL is well known as an exceptionally secure database platform that is used by many popular websites.

    • Scalability is another factor that is important when designing an IoT system. As the system evolves it may need to expand in unexpected directions. MySQL has a proven track record of handling the large amounts of data generated by the IoT.

    • High performance and availability are hallmarks of MySQL that make it a perfect fit as the backend of an IoT system. It provides excellent performance for demanding applications and can easily meet the data processing demands of the IoT. Maintaining availability in industrial IoT systems is of paramount importance. One of the reasons for the popularity of MySQL is its high availability solutions.

    • Open source flexibility means that MySQL can be tailored in whatever ways are needed to address the requirements of an IoT system. Its wide adoption throughout the IT world also ensures that skilled technical professionals are available to support the implementation.

    A Tool for Managing Your MySQL Instances

    SQL Diagnostic Manager for MySQL is the perfect tool for managing the MySQL databases powering your IoT systems. It provides the capability to perform real-time monitoring that helps you quickly pinpoint issues and resolve them before they become serious problems. The tool comes with over 600 pre-built monitors that continuously check the health of your MySQL servers and send alerts when defined thresholds are exceeded. Customizable dashboards and charts provide visibility into your systems that allow you to take proactive measures to avoid running low on capacity.

    If you choose MySQL as the database engine for your IoT system, you should strongly consider SQL Diagnostic Manager for MySQL as your management solution. It’s a valuable addition to your DBA’s toolbox and will quickly become their favorite tool when working with MySQL databases.

    The post Implementing the Internet of Things with MySQL appeared first on Monyog Blog.

    Still web scale

    $
    0
    0
    I am excited to start a new job next week working on performance at MongoDB. I have been a fan of the people and product for years and I look forward to contributing from the inside. The reasons I have been a fan include the rate at which the product has improved, WiredTiger and their contribution to MongoRocks.

    I look forward to learning the modern performance analysis tool chain courtesy of Brendan Gregg. His BPF book should be ready soon and there is much content on his web site. When I had to understand off-cpu stalls from IO and mutex contention there wasn't much available 10 years ago, thus PMP was born. While it served me well it is time to move on.

    I will continue to blog including performance comparisons between database engines. I look forward to writing about MongoDB, especially WiredTiger internals, but I will exclude MongoDB from the performance comparisons on my blog.

    I left FB in January and spent most of the year reading applied math books and being a caregiver. I appreciate the opportunity I had there and the people who helped me -- both managers and many awesome technical contributors. I wouldn't have made it here without Domas.

    FB asked one trick question in my job interview -- do I want to be hands on or an architect. I gave the correct answer -- hands on. Small & fast-growing companies need people who are hands-on. More architects will be needed post-IPO when the company gets larger. I interviewed at FB because the MySQL team at Google was ended and I had to find a new project. That was a source of stress but it turned out OK as MySQL at FB was my new project.

    MyRocks and RocksDB have been in capable hands for a long time, so I expect more good things from those teams.

    The MySQL community has been wonderful to me. My career wouldn't have been possible without the contributions of so many people who made MySQL better. Fortunately, clever people arrive to replace the people who leave and MySQL will continue to improve. Perhaps one day upstream will get a write-optimized storage engine. Dare to dream.

    Slides for talks I have given on MySQL, MongoDB and RocksDB

    $
    0
    0
    It all started for me with the Google patch for MySQL in April 2007. The Register summary of that included a Cringely story repeated by Nick Carr that Google might have shared the patch as part of a plan to dominate IT via cloud computing. I thought that was ridiculous. AWS said hold my beer and brought us Aurora.

    I donated to the Wayback Machine to offset my HW consumption for those links.

    A list of talks from the RocksDB team is here.

    This is an incomplete list of slide decks and videos from me:

    Quick hack for GTID_OWN lack

    $
    0
    0

    One of the benefits of MySQL GTIDs is that each server remembers all GTID entries ever executed. Normally these would be ranges, e.g. 0041e600-f1be-11e9-9759-a0369f9435dc:1-3772242 or multi-ranges, e.g. 24a83cd3-e30c-11e9-b43d-121b89fcdde6:1-103775793, 2efbcca6-7ee1-11e8-b2d2-0270c2ed2e5a:1-356487160, 46346470-6561-11e9-9ab7-12aaa4484802:1-26301153, 757fdf0d-740e-11e8-b3f2-0a474bcf1734:1-192371670, d2f5e585-62f5-11e9-82a5-a0369f0ed504:1-10047.

    One of the common problems in asynchronous replication is the issue of consistent reads. I've just written to the master. Is the data available on a replica yet? We have iterated on this, from reading on master, to heuristically finding up-to-date replicas based on heartbeats (see presentation and slides) via freno, and now settled, on some parts of our apps, to using GTID.

    GTIDs are reliable as any replica can give you a definitive answer to the question: have you applied a given transaction or not?. Given a GTID entry, say f7b781a9-cbbd-11e9-affb-008cfa542442:12345, one may query for the following on a replica:

    mysql> select gtid_subset('f7b781a9-cbbd-11e9-affb-008cfa542442:12345', @@global.gtid_executed) as transaction_found;
    +-------------------+
    | transaction_found |
    +-------------------+
    |                 1 |
    +-------------------+
    
    mysql> select gtid_subset('f7b781a9-cbbd-11e9-affb-008cfa542442:123450000', @@global.gtid_executed) as transaction_found;
    +-------------------+
    | transaction_found |
    +-------------------+
    |                 0 |
    +-------------------+
    

    Getting OWN_GTID

    This is all well, but, given some INSERT or UPDATE on the master, how can I tell what's the GTID associated with that transaction? There\s good news and bad news.

    • Good news is, you may SET SESSION session_track_gtids = OWN_GTID. This makes the MySQL protocol return the GTID generated by your transaction.
    • Bad news is, this isn't a standard SQL response, and the common MySQL drivers offer you no way to get that information!

    At GitHub we author our own Ruby driver, and have implemented the functionality to extract OWN_GTID, much like you'd extract LAST_INSERT_ID. But, how does one solve that without modifying the drivers? Here's a poor person's solution which gives you an inexact, but good enough, info. Following a write (insert, delete, create, ...), run:

    select gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), @@global.gtid_executed)) as master_generated_gtid;
    

    The idea is to "clean" the executed GTID set from irrelevant entries, by filtering out all ranges that do not belong to the server you've just written to (the master). The number 1000000000000000 stands for "high enough value that will never be reached in practice" - set to your own preferred value, but this value should take you beyond 300 years assuming 100,000 transactions per second.

    The value you get is the range on the master itself. e.g.:

    mysql> select gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), @@global.gtid_executed)) as master_generated_gtid;
    +-------------------------------------------------+
    | master_generated_gtid                           |
    +-------------------------------------------------+
    | dc103953-1598-11ea-82a7-008cfa5440e4:1-35807176 |
    +-------------------------------------------------+
    

    You may further parse the above to extract dc103953-1598-11ea-82a7-008cfa5440e4:35807176 if you want to hold on to the latest GTID entry. Now, this entry isn't necessarily your own. Between the time of your write and the time of your GTID query, other writes will have taken place. But the entry you get is either your own or a later one. If you can find that entry on a replica, that means your write is included on the replica.

    One may wonder, why do we need to extract the value at all? Why not just select @@global.gtid_executed? Why filter only the master's UUID? Logically, the answer is the same if you do that. But in practice, your query may be unfortunate enough to return some:

    select @@global.gtid_executed \G
    
    e71f0cdb-b8ef-11e9-9361-008cfa542442:1-83331,
    e742d87f-dea7-11e9-be6d-008cfa542c9e:1-18485,
    e7880c0e-ac54-11e9-865a-008cfa544064:1-7331973,
    e82043c6-c7d9-11e9-9413-008cfa5440e4:1-61692,
    e902678b-b046-11e9-a281-008cfa542c9e:1-83108,
    e90d7ff9-e35e-11e9-a9a0-008cfa544064:1-18468,
    e929a635-bb40-11e9-9c0d-008cfa5440e4:1-139348,
    e9351610-ef1b-11e9-9db4-008cfa5440e4:1-33460918,
    e938578d-dc41-11e9-9696-008cfa542442:1-18232,
    e947f165-cd53-11e9-b7a1-008cfa5440e4:1-18480,
    e9733f37-d537-11e9-8604-008cfa5440e4:1-18396,
    e97a0659-e423-11e9-8433-008cfa542442:1-18237,
    e98dc1f7-e0f8-11e9-9bbd-008cfa542c9e:1-18482,
    ea16027a-d20e-11e9-9845-008cfa542442:1-18098,
    ea1e1aa6-e74a-11e9-a7f2-008cfa544064:1-18450,
    ea8bc1bd-dd06-11e9-a10c-008cfa542442:1-18203,
    eae8c750-aaca-11e9-b17c-008cfa544064:1-85990,
    eb1e41e9-af81-11e9-9ceb-008cfa544064:1-86220,
    eb3c9b3b-b698-11e9-b67a-008cfa544064:1-18687,
    ec6daf7e-b297-11e9-a8a0-008cfa542c9e:1-80652,
    eca4af92-c965-11e9-a1f3-008cfa542c9e:1-18333,
    ecd110b9-9647-11e9-a48f-008cfa544064:1-24213,
    ed26890e-b10b-11e9-a79d-008cfa542c9e:1-83450,
    ed92b3bf-c8a0-11e9-8612-008cfa542442:1-18223,
    eeb60c82-9a3d-11e9-9ea5-008cfa544064:1-1943152,
    eee43e06-c25d-11e9-ba23-008cfa542442:1-105102,
    eef4a7fb-b438-11e9-8d4b-008cfa5440e4:1-74717,
    eefdbd3b-95b3-11e9-833d-008cfa544064:1-39415,
    ef087062-ba7b-11e9-92de-008cfa5440e4:1-9726172,
    ef507ff0-98b3-11e9-8b15-008cfa5440e4:1-928030,
    ef662471-9a3b-11e9-bd2e-008cfa542c9e:1-954800,
    f002e9f7-97ee-11e9-bed0-008cfa542c9e:1-5180743,
    f0233228-e9a1-11e9-a142-008cfa542c9e:1-18583,
    f04780c4-a864-11e9-9f28-008cfa542c9e:1-83609,
    f048acd9-b1d2-11e9-a0b6-008cfa544064:1-70663,
    f0573d8c-9978-11e9-9f73-008cfa542c9e:1-85642135,
    f0b0a37c-c89c-11e9-804c-008cfa5440e4:1-18488,
    f0cfe1ac-e5af-11e9-bc09-008cfa542c9e:1-18552,
    f0e4997c-cbc9-11e9-9179-008cfa542442:1-1655552,
    f24e481c-b5c4-11e9-aff0-008cfa5440e4:1-83015,
    f4578c4b-be6d-11e9-982e-008cfa5440e4:1-132701,
    f48bce80-e99f-11e9-94f4-a0369f9432f4:1-18460,
    f491adf1-9b04-11e9-bc71-008cfa542c9e:1-962823,
    f5d3db74-a929-11e9-90e8-008cfa5440e4:1-75379,
    f6696ba7-b750-11e9-b458-008cfa542c9e:1-83096,
    f714cb4c-dab7-11e9-adb9-008cfa544064:1-18413,
    f7b781a9-cbbd-11e9-affb-008cfa542442:1-18169,
    f81f7729-b10d-11e9-b29b-008cfa542442:1-86820,
    f88a3298-e903-11e9-88d0-a0369f9432f4:1-18548,
    f9467b29-d78c-11e9-b1a2-008cfa5440e4:1-18492,
    f9c08f5c-e4ea-11e9-a76c-008cfa544064:1-1667611,
    fa633abf-cee3-11e9-9346-008cfa542442:1-18361,
    fa8b0e64-bb42-11e9-9913-008cfa542442:1-140089,
    fa92234c-cc90-11e9-b337-008cfa544064:1-18324,
    fa9755eb-e425-11e9-907d-008cfa542c9e:1-1668270,
    fb7843d5-eb38-11e9-a1ff-a0369f9432f4:1-1668957,
    fb8ceae5-dd08-11e9-9ed3-008cfa5440e4:1-18526,
    fbf9970e-bc07-11e9-9e4f-008cfa5440e4:1-136157,
    fc0ffaee-98b1-11e9-8574-008cfa542c9e:1-940999,
    fc9bf1e4-ee54-11e9-9ce9-008cfa542c9e:1-18189,
    fca4672f-ac56-11e9-8a83-008cfa542442:1-82014,
    fcebaa05-dab5-11e9-8356-008cfa542c9e:1-18490,
    fd0c88b1-ad1b-11e9-bf3a-008cfa5440e4:1-75167,
    fd394feb-e4e4-11e9-bd09-008cfa5440e4:1-18574,
    fd687577-b048-11e9-b429-008cfa542442:1-83479,
    fdb18995-a79f-11e9-a28d-008cfa542442:1-82351,
    fdc72b7f-b696-11e9-ade9-008cfa544064:1-57674,
    ff1f3b6b-c967-11e9-ae04-008cfa544064:1-18503,
    ff6fe7dc-c186-11e9-9bb4-008cfa5440e4:1-103192,
    fff9dd94-ed95-11e9-90b7-008cfa544064:1-911039
    

    This can happen when you fail over to a new master, multiple times; it happens when you don't recycle UUIDs, when you provision new hosts and let MySQL pick their UUID. Returning this amount of data per query is an excessive overhead, hence why we extract the master's UUID only, which is guaranteed to be limited in size.

    NDB Parallel Query, part 5

    $
    0
    0
    In this part we are going to analyze a bit more complex query than before.
    This query is a 6-way join.

    The query is:
    SELECT
            supp_nation,
            cust_nation,
            l_year,
            SUM(volume) AS revenue
    FROM
            (
                    SELECT
                            n1.n_name AS supp_nation,
                            n2.n_name AS cust_nation,
                            extract(year FROM l_shipdate) as l_year,
                            l_extendedprice * (1 - l_discount) AS volume
                    FROM
                            supplier,
                            lineitem,
                            orders,
                            customer,
                            nation n1,
                            nation n2
                    WHERE
                            s_suppkey = l_suppkey
                            AND o_orderkey = l_orderkey
                            AND c_custkey = o_custkey
                            AND s_nationkey = n1.n_nationkey
                            AND c_nationkey = n2.n_nationkey
                            AND (
                                    (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE')
                                    OR (n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY')
                            )
                            AND l_shipdate BETWEEN '1995-01-01' AND '1996-12-31'
            ) AS shipping
    GROUP BY
            supp_nation,
            cust_nation,
            l_year
    ORDER BY
            supp_nation,
            cust_nation,
            l_year;

    It is the inner SELECT that is the 6-way join. The outer part only deals with the
    GROUP BY aggregation and ORDER BY of the result set from the inner
    SELECT. As mentioned before the GROUP BY aggregation and ORDER BY
    parts are handled by the MySQL Server. So the NDB join pushdown only deals
    with the inner select.

    In the previous queries we analysed the join order was pretty obvious. In this
    case it isn't that obvious. But the selection of join order is still fairly
    straightforward. The selected join order is
    n1 -> supplier -> lineitem -> orders -> customer -> n2.

    Query analysis

    The query starts by reading 2 rows from the nation table. This creates a new scan
    on the supplier table, these 2 rows are either coming from the same TC thread or
    from separate TC threads. This scan generates data for the next scan in the supplier
    table. The supplier table will return 798 rows that is used in the scan against the
    lineitem table. This assumes scale factor 1.

    This represents a new thing to discuss. If this query would have been executed in the
    MySQL Server we would only be able to handle one row from the supplier table at a
    time. There have been some improvement in the storage engine API to handle this
    using read multi range API in the storage engine API. This means a lot of
    communication back and forth and starting up new scans. With the NDB join
    processing we will send a multi-range scan to the lineitem table. This means that we
    will send one scan message that contains many different ranges. There will still be a
    new walking through the index tree for each range, but there is no need to send the
    scan messages again and again.

    Creation of these multi-ranges is handled as part of the join processing in the
    DBSPJ module.

    The join between supplier table and the lineitem contains one more interesting
    aspect. Here we join towards the column l_orderkey in the lineitem table. In many
    queries in TPC-H the join against the lineitem table uses the order key as the join
    column. The order key is the first part of the primary key and is thus a candidate to
    use as partition key. The TPC-H queries definitely improves by using the order key as
    partition key instead of the primary key. This means that the orders and all lineitems
    for the order are stored in the same LDM thread.

    The scan on the lineitem will produce 145.703 to join with the orders table. The rest of
    the joins are joined through the primary key. Thus we will perform 145.703 key lookups
    in the orders table, there will be 145.703 key lookups in the customer table and finally
    there will be 145.703 lookups against the nations table. The only filtering here will be
    on the last table that will decrease the amount of result rows to the MySQL Server,
    the end result will be 5.924 rows.

    This gives another new point that it would be possible to increase parallelism in this
    query by storing the result rows in the DBSPJ. However this would increase the
    overhead, so it would improve parallelism at the cost of efficiency.

    Scalability impact

    If we make sure that the lineitem table is partitioned on the order key this query will
    scale nicely. There will be fairly small impact with more partitions since only the scan
    against the supplier table will be more costly in a larger cluster.

    One thing that will make the query cost more is when the primary key lookups are
    distributed instead of local. One table that definitely will be a good idea to use
    FULLY REPLICATED for is the nations table. This means that all those 145.703 key
    lookups will be handled inside a data node instead of over the network.

    The supplier table has only 10.000 rows compared to the lineitem table that has
    6M rows. Thus it should definitely be possible to use FULLY REPLICATED also
    for this table. The customer table has 150.000 rows and is another candidate to use
    for FULLY REPLICATED.

    Since the MySQL Server will have to handle more than 300.000 rows in this query,
    this will be the main bottleneck for parallelism. This means that the query will have a
    parallelism of about 5. This is also the speed up we see compared to single threaded
    storage engine for this query. This bottleneck will be about the same even with
    larger clusters.

    Next Part

    I will take a break in this sequence of blogs for now and come back later with a
    description of a bit more involved queries and how NDB handles pushing down
    subqueries and parts of join query.

    MySQL GTID: restore a master from a replica’s backup

    $
    0
    0

    To avoid infinite replication loops MySQL doesn’t allow you to have log_slave_updates and replicate-same-server-id.

    When using GTIDs that may lead to something not expected that you may not be aware of.

    In this scenario, we have 2 MySQL servers using GTID. The sever uuid part of the GTID has been modified in the illustration to make it more clear. Both servers have log_slave_updates enabled too:

    So far nothing unusual. So let’s write data on the master (MySQL A):

    We can see that this first transaction is identified by its GTID where the uuid matches MySQL A and the sequence number is 1. Let’s write some data again:

    All good.

    Now let’s take a backup on the replica (MySQL B):

    Backup is consistent and matches the data on both servers.

    Now let’s write again some data:

    Of course the backup that was taken earlier does not change.

    All suddenly, MySQL A crashes and goes away… We promote MySQL B as new writer and we use it to write again some data:

    We can notice that the GTID changed to use the uuid of MySQL B. (This information is contained in the variable gtid_executed).

    It’s time to restore our backup on MySQL A:

    And we configure MySQL A to become replica of MySQL B:

    Wow ! All the transactions happened after the backup on MySQL A have been ignored !

    In fact, this is again to protect our user to have problems like circular replication and infinite loop.

    To be able to replicate the missing transaction is to change the server_id to a new unique value after the restore and before starting replication. Then replication will work as expected:

    In summary, if you have enabled log_slave_updates and you want to recreate a master from a backup taken on a replica, you must change the server_id. Even if you use GTIDs, server_id is for the moment still to be taken in consideration.

    If you plan to write back on MySQL A, it’s always safer to also change its server_uuid to avoid any split-brain situation.


    MySQL 8.0 and Magento

    $
    0
    0

    In my road trip of the Open Source projects using MySQL, after having tested WordPress, Drupal and Joomla, let’s try to install Magento using MySQL 8.0 !

    In Magento’s manual, we can see that the project requires MySQL 5.6 and supports 5.7.x since versoin 2.1.2.

    In my test, I will use Magento 2.3.3, the latest stable when writing this article.

    The manual stipulates that we should use ROW based replication but not GTID because Magento 2 is using CREATE TEMPORARY TABLE inside transactions. In fact, this limitation doesn’t exist anymore since MySQL 8.0.13.

    From MySQL 8.0.13, when binlog_format is set to ROW or MIXED, CREATE TEMPORARY TABLE and DROP TEMPORARY TABLE statements are allowed inside a transaction, procedure, function, or trigger when GTIDs are in use. The statements are not written to the binary log and are therefore not replicated to slaves. The use of row-based replication means that the slaves remain in sync without the need to replicate temporary tables. If the removal of these statements from a transaction results in an empty transaction, the transaction is not written to the binary log. See the MySQL Manual.

    So this is not a limitation anymore. And for this installation I will of course use the latest version of MySQL 8.0 available right now, MySQL 8.0.18.

    Setup the database

    So MySQL 8.0.18 is installed and now we need to create the database/schema and setup the credentials to install Magento:

    mysql> CREATE DATABASE magento2;
    
    mysql> CREATE USER magento2 IDENTIFIED BY 'magento2';
    
    mysql> GRANT ALL ON magento2.* TO magento2;

    We can see that we are using MySQL 8.0 and the default authenticatio plugin:

    mysql> SELECT Host, User, plugin, @@version FROM mysql.user 
           WHERE user='magento2';
     +------+----------+-----------------------+-----------+
     | Host | User     | plugin                | @@version |
     +------+----------+-----------------------+-----------+
     | %    | magento2 | caching_sha2_password | 8.0.18    |
     +------+----------+-----------------------+-----------+

    As in the future we will use any kind of replication, we will also enable GTID:

    mysql> SET persist enforce_gtid_consistency=on;
     
    mysql> SET persist_only gtid_mode=on;
    
    mysql> RESTART;

    Magento Installation

    We can now start the installation wizard of Magento:

    The first checks are not related to MySQL and all necessary packages are installed. FYI, this is Oracle Linux 8:

    Now it’s time to insert the MySQL information:

    And when we press on Next, we have our first small issue:

    This is exactly what the error message is telling. The new authentication plugin is now supported. We can also see that in the php error log:

    [11-Dec-2019 20:30:49 UTC] PHP Fatal error:  Uncaught PDOException: PDO::__construct(): 
    The server requested authentication method unknown to the client [caching_sha2_password]
    in /var/www/html/vendor/magento/zendframework1/library/Zend/Db/Adapter/Pdo/Abstract.php:128

    The PHP version is 7.2.11 which doesn’t support MySQL 8’s new default secure authentication plugin: caching_sha2_password.

    It was supported in 7.2.8 but removed on 7.2.11, for more info check:

    So now, we just need to modify the authentication method of our user:

    mysql> ALTER USER magento2 IDENTIFIED 
           WITH 'mysql_native_password' BY 'magento2';

    And we can press Next again in Magento’s installation wizard and fill all the next Steps until Step 6:

    Let’s go an press Install Now !

    The installation process seems to hang at 58%… and when we check the Console Log we can see the following:

    We followed the manual, but it seems our user needs more privileges. But let’s try to not provide that and try the second option:

    mysql> SET PERSIST log_bin_trust_function_creators=1;

    We can refresh the page and click again on Install Now.

    The process continues to 91% and then another strange error:

    I checked the status of the table and nothing seems wrong, so I refreshed the table and restart the process that resumes and goes to the end successfully:

    And of course, we can now access the Site we just deployed:

    Summary

    In summary, Magento works fine with MySQL 8.0, you have very few changes to perform, they are related to the user Magento that connect to MySQL and one global variable:

    mysql> CREATE USER magento2 IDENTIFIED 
           WITH 'mysql_native_password' BY 'magento2';
    mysql> GRANT ALL ON magento2.* TO magento2;
    mysql> SET PERSIST log_bin_trust_function_creators=1;
    

    And of course, there is no more reason to not use GTIDs !

    MySQL 8.0 & PHP

    $
    0
    0

    MySQL and PHP is a love story that started long time ago. However the love story with MySQL 8.0 was a bit slower to start… but don’t worry it rules now !

    The support of MySQL 8.0’s new default authentication method in PHP took some time and was added in PHP 7.2.8 but removed in PHP 7.2.11.

    Now it’s fully supported in PHP 7.4 !

    If you have installed PHP 7.4, you can see that the new plugin auth_plugin_caching_sha2_passwordis now available:

    # php -i | grep "Loaded plugins|PHP Version " | tail -n2
    PHP Warning:  Module 'mysql_xdevapi' already loaded in Unknown on line 0
    PHP Version => 7.4.0
    Loaded plugins => mysqlnd,debug_trace,auth_plugin_mysql_native_password,
                      auth_plugin_mysql_clear_password,
                      auth_plugin_caching_sha2_password,
                      auth_plugin_sha256_password

    So no need to create a user with mysql_native_passwordas authentication method in MySQL 8.0 as explained in the following posts:

    In summary, if you want to use a more secure method to connect to your MySQL 8.0 form your PHP application, make sure you upgrade to PHP 7.4

    A beginner’s guide to SQL CROSS JOIN

    $
    0
    0

    Introduction In this article, we are going to see how a CROSS JOIN works, and we will also make use of this SQL join type to build a poker card game. Database table model For our poker card game application, we have created the ranks and suits database tables: The ranks table defines the ranking of cards, as well as the name and symbol used for each card rank: The suits table describes the four possible categories used by the French playing cards: Cartesian product In the set theory, the Cartesian product... Read More

    The post A beginner’s guide to SQL CROSS JOIN appeared first on Vlad Mihalcea.

    Percona Server for MySQL 8.0 – New Data Masking Feature

    $
    0
    0

    Data Masking in Percona Server for MySQL 8.0.17

    Database administrators are responsible for maintaining the privacy and integrity of data. When the data contains confidential information, your company has a legal obligation to ensure that privacy is maintained. Even so, being able to access the information contained in that dataset, for example for testing or reporting purposes, has great value so what to do? MySQL Enterprise Edition offers data masking and de-identification, so I decided to contribute similar functionality to Percona Server for MySQL. In this post, I provide some background context and information on how to use these new functions in practice.

    Some context

    One of the most important assets of any company is data. Having good data allows engineers to build better systems and user experiences.

    Even through our most trivial activities, we continuously generate and share great volumes of data. I’m walking down the street and if I take a look at my phone it’s quite straightforward to get recommendations for a place to have lunch. The platform knows that it’s almost lunch time and that I have visited this nearby restaurant, or a similar one, a few times in the past. Sounds cool, right?

    But this process could be more manual than we might think at first. Even if the system has implemented things like AI or Machine Learning, a human will have validated the results; they might have taken a peek to ensure that everything is fine; or perhaps they are developing some new cool feature that must be tested… And this means that someone, somewhere has the ability to access my data. Or your data.

    Now, that is not so great, is it?

    In the last decade or so, governments around the world have taken this challenge quite seriously. They have enforced a series of rules to guarantee that the data is not only safely stored, but also safely used. I’m sure you will have heard terms like PCI, GDPR or HIPAA. They contain mandatory guidelines for how our data can be used, for primary or secondary purposes, and if it can be used at all.

    Data masking and de-identification

    One of the most basic safeguarding rules is that if the data is to be used for secondary purposes – such as for data analytics – it has to be de-identified in a way that it would make impossible identify the original individual.

    Let’s say that the company ACME is storing employee data.

    We will use the example database of employees that’s freely available.

    Employee number
    First name
    Last name
    Birth date
    Gender
    Hire date
    Gross salary
    Salary from date
    Salary to date

    We can clearly see that all those fields can be classified as private information. Some of these directly identify the original individual, like employee number or first + last name. Others could be used for indirect identification: I could ask my co-workers their birthday and guess the owner of that data using

    birth date
    .

    So, here is where de-identification and data-masking come into play. But what are the differences?

    De-identification transforms the original data into something different that could look more or less real. For example, I could de-identify

    birth date
    and get a different date.

    However, this method would make that information unusable if I want to see the relationship between salary and employee’s age.

    On the other hand, data-masking transforms the original data leaving some part untouched. I could mask

    birth date
    replacing the month and day for January first. That way, the year would be retained and that would allow us to identify that salary–employee’s age relationship.

    Of course, if the dataset I’m working with is not big enough, certain methods of data-masking would be inappropriate as I could still deduce who the data belonged to.

    MySQL data masking

    Oracle’s MySQL Enterprise Edition offers a de-identification and data-masking solution for MySQL, using a flexible set of functions that cover most of our needs.

    Percona Server for MySQL 8.0.17 introduces that functionality as an open source plugin, and is compatible with Oracle’s implementation. You no longer need to code slow and complicated stored procedures to achieve data masking, and you can migrate the processes that were written for the MySQL Enterprise Edition to Percona Server for MySQL. Go grab a cup of coffee and contribute something cool to the community with all that time you have got back. ☺

    In the lab

    Put on your thinking cap and let’s see how it works.

    First we need an instance of Percona MySQL Server 8.0.17 or newer. I think containers are the most flexible way to test new stuff so I will be using that, but you could use a virtual server or just a traditional setup. Let’s download the latest version of Percona MySQL Server in a ready to run container:

    docker pull percona:8.0.17-1

    Eventually that command should work but sadly, Percona hadn’t built this version of the docker image when this article was written. Doing it yourself is quite simple, though, and by the time you read this it will likely be already there.

    Once in place, Running an instance of Percona MySQL Server has never been so easy:

    docker run --name ps -e MYSQL_ROOT_PASSWORD=secret -d percona:8.0.17-8

    We’ll logon to the new container:

    docker exec -ti ps mysql -u root -p

    Now is the time to download the test database employees from GitHub and load it into our Percona Server. You can follow the official instructions in the project page.

    Next step is to enable the data de-identification and masking feature. Installing the data masking module in Percona MySQL Server is easier than in Oracle.

    mysql> INSTALL PLUGIN data_masking SONAME 'data_masking.so';
    Query OK, 0 rows affected (0.06 sec)

    This automatically defines a set of global functions in our MySQL instance, so we don’t need to do anything else.

    A new concept: Dictionaries

    Sometimes we will like to generate new data selecting values from a predefined collection. For example we could want to have

    first name
      values that are really first names and not a random alphanumeric. This will make our masked data looks real, and it’s perfect for creating demo or QA environments.

    For this task we have dictionaries. They are nothing more than text files containing a value per line that are loaded into MySQL memory. You need to be aware that the contents of the file are fully loaded into memory and that the dictionary only exists while MySQL is running. So keep this in mind before loading any huge file or after restarting the instance.

    For our lab we will load two dictionaries holding first and last names. You can use these files or create different ones: first names and last names

    Store the files in a folder of your database server (or container) readable by the

    mysqld
      process.
    wget https://raw.githubusercontent.com/philipperemy/name-dataset/master/names_dataset/first_names.all.txt
    docker cp first_names.all.txt ps:/tmp/
    wget https://raw.githubusercontent.com/philipperemy/name-dataset/master/names_dataset/last_names.all.txt
    docker cp last_names.all.txt ps:/tmp/

    Once the files are in our server we can map them as MySQL dictionaries.

    mysql> select gen_dictionary_load('/tmp/first_names.all.txt', 'first_names');
    +----------------------------------------------------------------+
    | gen_dictionary_load('/tmp/first_names.all.txt', 'first_names') |
    +----------------------------------------------------------------+
    | Dictionary load success                                        |
    +----------------------------------------------------------------+
    1 row in set (0.04 sec)
    
    mysql> select gen_dictionary_load('/tmp/last_names.all.txt', 'last_names');
    +--------------------------------------------------------------+
    | gen_dictionary_load('/tmp/last_names.all.txt', 'last_names') |
    +--------------------------------------------------------------+
    | Dictionary load success                                      |
    +--------------------------------------------------------------+
    1 row in set (0.03 sec)

    Masking some data

    Now let’s take another look at our

    employees
    table
    mysql> show columns from employees;
    +------------+---------------+------+-----+---------+-------+
    | Field      | Type          | Null | Key | Default | Extra |
    +------------+---------------+------+-----+---------+-------+
    | emp_no     | int(11)       | NO   | PRI | NULL    |       |
    | birth_date | date          | NO   |     | NULL    |       |
    | first_name | varchar(14)   | NO   |     | NULL    |       |
    | last_name  | varchar(16)   | NO   |     | NULL    |       |
    | gender     | enum('M','F') | NO   |     | NULL    |       |
    | hire_date  | date          | NO   |     | NULL    |       |
    +------------+---------------+------+-----+---------+-------+

    Ok, it’s very likely we will want to de-identify everything in this table. You can apply different methods to achieve your security requirements, but I will create a view with the following transformations:

    • emp_no: get a random value from 900.000.000 to 999.999.999
    • birth_date: set it to January 1st of the original year
    • first_name: set a random first name from a list of names that we have in a text file
    • last_name: set a random last name from a list of names that we have in a text file
    • gender: no transformation
    • hire_date: set it to January 1st of the original year

    CREATE VIEW deidentified_employees
    AS
    SELECT
      gen_range(900000000, 999999999) as emp_no,
      makedate(year(birth_date), 1) as birth_date,
      gen_dictionary('first_names') as first_name,
      gen_dictionary('last_names') as last_name,
      gender,
      makedate(year(hire_date), 1) as hire_date
    FROM employees;

    Let’s check how the data looks in our de-identified view.

    mysql> SELECT * FROM employees LIMIT 10;
    +--------+------------+------------+-----------+--------+------------+
    | emp_no | birth_date | first_name | last_name | gender | hire_date  |
    +--------+------------+------------+-----------+--------+------------+
    |  10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |
    |  10002 | 1964-06-02 | Bezalel    | Simmel    | F      | 1985-11-21 |
    |  10003 | 1959-12-03 | Parto      | Bamford   | M      | 1986-08-28 |
    |  10004 | 1954-05-01 | Chirstian  | Koblick   | M      | 1986-12-01 |
    |  10005 | 1955-01-21 | Kyoichi    | Maliniak  | M      | 1989-09-12 |
    |  10006 | 1953-04-20 | Anneke     | Preusig   | F      | 1989-06-02 |
    |  10007 | 1957-05-23 | Tzvetan    | Zielinski | F      | 1989-02-10 |
    |  10008 | 1958-02-19 | Saniya     | Kalloufi  | M      | 1994-09-15 |
    |  10009 | 1952-04-19 | Sumant     | Peac      | F      | 1985-02-18 |
    |  10010 | 1963-06-01 | Duangkaew  | Piveteau  | F      | 1989-08-24 |
    +--------+------------+------------+-----------+--------+------------+
    10 rows in set (0.00 sec)
    
    mysql> SELECT * FROM deidentified_employees LIMIT 10;
    +-----------+------------+------------+---------------+--------+------------+
    | emp_no    | birth_date | first_name | last_name     | gender | hire_date  |
    +-----------+------------+------------+---------------+--------+------------+
    | 930277580 | 1953-01-01 | skaidrīte  | molash        | M      | 1986-01-01 |
    | 999241458 | 1964-01-01 | grasen     | cessna        | F      | 1985-01-01 |
    | 951699030 | 1959-01-01 | imelda     | josephpauline | M      | 1986-01-01 |
    | 985905688 | 1954-01-01 | dunc       | burkhardt     | M      | 1986-01-01 |
    | 923987335 | 1955-01-01 | karel      | wanamaker     | M      | 1989-01-01 |
    | 917751275 | 1953-01-01 | mikrut     | allee         | F      | 1989-01-01 |
    | 992344830 | 1957-01-01 | troyvon    | muma          | F      | 1989-01-01 |
    | 980277046 | 1958-01-01 | aliziah    | tiwnkal       | M      | 1994-01-01 |
    | 964622691 | 1952-01-01 | dominiq    | legnon        | F      | 1985-01-01 |
    | 948247243 | 1963-01-01 | sedale     | tunby         | F      | 1989-01-01 |
    +-----------+------------+------------+---------------+--------+------------+
    10 rows in set (0.01 sec)

    The data looks quite different, but remains good enough to apply some analytics and get meaningful results.

    Let’s de-identify the table

    salaries
      this time.
    mysql> show columns from salaries;
    +-----------+---------+------+-----+---------+-------+
    | Field     | Type    | Null | Key | Default | Extra |
    +-----------+---------+------+-----+---------+-------+
    | emp_no    | int(11) | NO   | PRI | NULL    |       |
    | salary    | int(11) | NO   |     | NULL    |       |
    | from_date | date    | NO   | PRI | NULL    |       |
    | to_date   | date    | NO   |     | NULL    |       |
    +-----------+---------+------+-----+---------+-------+

    We could use something like this:

    CREATE VIEW deidentified_salaries
    AS
    SELECT
    gen_range(900000000, 999999999) as emp_no,
    gen_range(40000, 80000) as salary,
    mask_inner(date_format(from_date, '%Y-%m-%d'), 4, 0) as from_date,
    mask_outer(date_format(to_date, '%Y-%m-%d'), 4, 2, '0') as to_date
    FROM salaries;

    We are using again the function

    gen_range
    . For the dates this time we are using the very flexible functions
    mask_inner
      and
    mask_outer
      that replace some characters in the original string. Let’s see how the data looks now.

    In a real life exercise we would like to have the same values for emp_no across all the tables to keep referential integrity. This is where I think the original MySQL data-masking plugin falls short, as we don’t have deterministic functions using the original value as seed.

    mysql> SELECT * FROM salaries LIMIT 10;
    +--------+--------+------------+------------+
    | emp_no | salary | from_date  | to_date    |
    +--------+--------+------------+------------+
    |  10001 |  60117 | 1986-06-26 | 1987-06-26 |
    |  10001 |  62102 | 1987-06-26 | 1988-06-25 |
    |  10001 |  66074 | 1988-06-25 | 1989-06-25 |
    |  10001 |  66596 | 1989-06-25 | 1990-06-25 |
    |  10001 |  66961 | 1990-06-25 | 1991-06-25 |
    |  10001 |  71046 | 1991-06-25 | 1992-06-24 |
    |  10001 |  74333 | 1992-06-24 | 1993-06-24 |
    |  10001 |  75286 | 1993-06-24 | 1994-06-24 |
    |  10001 |  75994 | 1994-06-24 | 1995-06-24 |
    |  10001 |  76884 | 1995-06-24 | 1996-06-23 |
    +--------+--------+------------+------------+
    10 rows in set (0.00 sec)
    
    mysql> SELECT * FROM deidentified_salaries LIMIT 10;
    +-----------+--------+------------+------------+
    | emp_no    | salary | from_date  | to_date    |
    +-----------+--------+------------+------------+
    | 929824695 | 61543  | 1986XXXXXX | 0000-06-00 |
    | 954275265 | 63138  | 1987XXXXXX | 0000-06-00 |
    | 948145700 | 53448  | 1988XXXXXX | 0000-06-00 |
    | 937927997 | 54704  | 1989XXXXXX | 0000-06-00 |
    | 978459605 | 78179  | 1990XXXXXX | 0000-06-00 |
    | 993464164 | 75526  | 1991XXXXXX | 0000-06-00 |
    | 946692434 | 51788  | 1992XXXXXX | 0000-06-00 |
    | 979870243 | 54807  | 1993XXXXXX | 0000-06-00 |
    | 958708118 | 70647  | 1994XXXXXX | 0000-06-00 |
    | 945701146 | 76056  | 1995XXXXXX | 0000-06-00 |
    +-----------+--------+------------+------------+
    10 rows in set (0.00 sec)

    Clean-up

    Remember that when you’re done, you can free up memory by removing the dictionaries. Restarting the instance will also remove the dictionaries.

    mysql> SELECT gen_dictionary_drop('first_names');
    +------------------------------------+
    | gen_dictionary_drop('first_names') |
    +------------------------------------+
    | Dictionary removed                 |
    +------------------------------------+
    1 row in set (0.01 sec)
    
    mysql> SELECT gen_dictionary_drop('last_names');
    +------------------------------------+
    | gen_dictionary_drop('last_names') |
    +------------------------------------+
    | Dictionary removed                 |
    +------------------------------------+
    1 row in set (0.01 sec)

    If you use the MySQL data-masking plugin to define different levels of access to the data, remember that you will need to load the dictionaries each time the instance is restarted. With this usage, for example, you could control the data that someone in support has access to, very much like a bargain-basement virtual private database solution. (I’m not proposing this for production systems!)

    Other de-identification and masking functions

    Percona Server for MySQL Data-Masking includes more functions that the ones we’ve seen here.

    We have specialized functions for Primary Account Numbers (PAN), Social Security Numbers (SSN), phone numbers, e-Mail addresses… And also generic functions that will allow us to de-identify types without a specialized method.

    Being an open source plugin it should be quite easy to implement any additional methods and contribute it to the broader community.

    Next Steps

    Using these functions we can de-identify and mask any existing dataset. But if you are populating a lower level environment using production data you would want to store the transformed data only. To achieve this you could choose between various options.

    • Small volumes of data: use “de-identified” views to export the data and load into a new database using mysqldump or mysqlpump.
    • Medium volumes of data: Clone the original database and de-identify locally the data using updates.
    • Large volumes of data option one: using replication, create a master -> slave chain with
      STATEMENT
      binlog format and define triggers de-identifying the data on the slave. Your master can be a slave to the master (using log_slave_updates), so you don’t need to run your primary master in
      STATEMENT
      mode.
    • Large volumes of data option two: using multiplexing in ProxySQL, configure ProxySQL to send writes to a clone server where you have defined triggers to de-identify the data.

    Future developments

    While de-identifying complex schemas we could find that, for example, the name of a person is stored in multiple tables (de-normalized tables). In this case, these functions would generate different names and the resulting data will look broken. You can solve this using a variant of the dictionary functions that will obtain the value based on the original value and passed as parameter:

    gen_dictionary_deterministic('Francisco', 'first_names')

    This not-yet-available function would always return the same value using that dictionary file, but in such a way that the de-identification cannot be reversed.

    Oracle doesn’t currently support this, so we will expand Percona Data-Masking plugin to introduce this as a unique feature. However, that will be in another contribution, so stay tuned for more exciting changes to Percona Server for MySQL Data Masking.


    Image: Photo by Finan Akbar on Unsplash

    The content in this blog is provided in good faith by members of the open source community. Percona has not edited or tested the technical content (although in this case, of course, we have tested the data masking feature incorporated into Percona Server for MySQL 8.0/17, just not the examples in this blog). Views expressed are the authors’ own. When using the advice from this or any other online resource test ideas before applying them to your production systems, and always secure a working back up.

    The post Percona Server for MySQL 8.0 – New Data Masking Feature appeared first on Percona Community Blog.

    MySQL/MariaDB: Using views to grant or deny row-level privileges

    $
    0
    0
    Relational DBMSs allow to grant users permissions on certain tables or columns. Here we'll discuss how to restrict access to a certain set of rows.
    Viewing all 18787 articles
    Browse latest View live


    <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>