JDBC Driver Connection URL strings

February 14, 2018, 2:08 am

≫ Next: Convert Galera Node to Async Slave And Vice-versa With Galera Cluster

Introduction Ever wanted to connect to a relational database using Java and didn’t know the URL connection string? Then, this article is surely going to help you from now on. Oracle The JDBC connection properties look as follows: JDBC Driver oracle.jdbc.OracleDriver JDBC Url jdbc:oracle:thin:@localhost:1521/orclpdb1 Hibernate Dialect org.hibernate.dialect.Oracle12cDialect And, if you want to connect using a … Continue reading JDBC Driver Connection URL strings →

The post JDBC Driver Connection URL strings appeared first on Vlad Mihalcea's Blog.

↧

Convert Galera Node to Async Slave And Vice-versa With Galera Cluster

February 14, 2018, 5:12 am

≫ Next: Amazon Aurora MySQL Monitoring with Percona Monitoring and Management (PMM)

≪ Previous: JDBC Driver Connection URL strings

Convert Galera Node to Async Slave And Vice-versa With Galera Cluster Nilnandan Joshi Wed, 02/14/2018 - 08:12

Recently, I was working with one of our customers and this was their requirement as they wanted to automate this process for converting a galera node to async slave and make async slave to galera node without shutting down any servers. This blog post will provide a step-by-step instruction on how to accomplish this. Here, for the testing purpose, I've used a sandbox and installed a 3-node Galera cluster on the same server with different ports.

The following are steps to make a one node to async slave.

Step 1: Stop galera node with wsrep_on=0 and wsrep_cluster_address='dummy://'.

MariaDB [nil]> SET GLOBAL wsrep_on=0; SET GLOBAL wsrep_cluster_address='dummy://';

Step 2: Collect the value of wsrep_last_committed which is xid,.

MariaDB [nil]> show global status like '%wsrep_last_committed%';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| wsrep_last_committed | 40455 |
+----------------------+-------+

Step 3: On the basis of that xid, find binlog file and end log position.

[nil@centos68 data]$ mysqlbinlog --base64-output=decode-rows --verbose mysql-bin.000012  | grep -i "Xid = 40455"
#180113  5:35:49 server id 112  end_log_pos 803         Xid = 40455
[nil@centos68 data]$

Step 4: Start replication with it from Galera Cluster.

CHANGE MASTER TO MASTER_HOST='127.0.0.1',
MASTER_PORT=19223,
MASTER_USER='repl_user' ,
MASTER_PASSWORD='replica123' ,
MASTER_LOG_FILE='mysql-bin.000012',
MASTER_LOG_POS=803;

DO NOT FORGET to edit my.cnf for these dynamic parameters for permanent effect. i.e

[mysqld]
GLOBAL wsrep_on=0;
wsrep_cluster_address=’dummy://’;

Meanwhile for the vice-versa process, follow these steps to make an async slave to a Galera node.

Step 1: Stop slave, collect Master_Log_File and Exec_Master_Log_Pos.

MariaDB [nil]> stop slave;
Query OK, 0 rows affected (0.01 sec)

MariaDB [nil]> show slave status \G
...
Master_Log_File: mysql-bin.000013
Exec_Master_Log_Pos: 683

Step 2: On the basis of that information, you can get xid from the binlog.

[nil@centos68 data]$ mysqlbinlog --base64-output=decode-rows --verbose mysql-bin.000013 | grep -i "683"
#180113  5:38:06 server id 112  end_log_pos 683         Xid = 40457
[nil@centos68 data]$

Step 3: Just combine wsrep_cluster_state_uuid with xid,.

wsrep_cluster_state_uuid     | afdac6cb-f7ee-11e7-b1c5-9e96fe6fb1e1

so wsrep_start_position = ‘afdac6cb-f7ee-11e7-b1c5-9e96fe6fb1e1:40457’

Step 4: Set it as a wsrep_start_position and add that server as a node of Galera Cluster.

MariaDB [nil]> set global wsrep_start_position='afdac6cb-f7ee-11e7-b1c5-9e96fe6fb1e1:40457';
Query OK, 0 rows affected (0.00 sec)
MariaDB [nil]> SET GLOBAL wsrep_on=1; SET GLOBAL wsrep_cluster_address='gcomm://127.0.0.1:4030,127.0.0.1:5030';
Query OK, 0 rows affected (0.00 sec)

DO NOT FORGET to edit my.cnf for these dynamic parameters for permanent effect. i.e

[mysqld]
GLOBAL wsrep_on=1;
wsrep_cluster_address=’gcomm://127.0.0.1:4030,127.0.0.1:5030‘;

In case of heavy loads on the server or slave lagging, you may need to speed up this process.

For a full step-by-step guide, you can check out my original blog post here.

This blog covers the process for converting a Galera node to async slave and make async slave to the Galera node without shutting down any servers.

↧

Amazon Aurora MySQL Monitoring with Percona Monitoring and Management (PMM)

February 14, 2018, 7:27 am

≫ Next: Update on Percona Platform Lifecycle for Ubuntu “Stable” Versions

≪ Previous: Convert Galera Node to Async Slave And Vice-versa With Galera Cluster

In this blog post, we’ll review additional Amazon Aurora MySQL monitoring capabilities we’ve added in Percona Monitoring and Management (PMM) 1.7.0. You can see them in action in the MySQL Amazon Aurora Metrics dashboard.

Amazon Aurora MySQL Transaction Commits

This graph looks at the number of commits the Amazon Aurora engine performed, as well as the average commit latency. As you can see from this graph, latency does not always correlate with the number of commits performed and can be quite high in certain situations.

Amazon Aurora MySQL Load

In Percona Monitoring and Management, we often use the concept of “Load” – which roughly corresponds to the number of operations of a type in progress. This graph shows us what statements contribute the most load on the system, as well as what load corresponds to the Amazon Aurora transaction commits (which we observed in the graph before).

Amazon Aurora MySQL Memory Usage

This graph is pretty self-explanatory. It shows how much memory is used by the Amazon Aurora lock manager, as well as the amount of memory used by Amazon Aurora to store Data Dictionary.

Amazon Aurora MySQL Statement Latency

This graph shows the average latency for the most important types of statements. Latency spikes, as shown in this example, are often indicative of the instance overload.

Amazon Aurora MySQL Special Command Counters

Amazon Aurora MySQL allows a number of commands that are not available in standard MySQL. This graph shows the usage of such commands. Regular “unit_test” calls can be seen in the default Amazon Aurora install, and the rest depends on your workload.

Amazon Aurora MySQL Problems

This graph is where you want to see a flat line. It shows different kinds of internal Amazon Aurora MySQL problems, which in normal operation should generally be zero.

I hope you find these Amazon Aurora MySQL monitoring improvements useful. Let us know if there is any other Amazon Aurora information that would be helpful to display!

↧

Update on Percona Platform Lifecycle for Ubuntu “Stable” Versions

February 14, 2018, 2:56 pm

≫ Next: Preview: Top MySQL 8 Features

≪ Previous: Amazon Aurora MySQL Monitoring with Percona Monitoring and Management (PMM)

This blog post highlights changes to the Percona Platform Lifecycle for Ubuntu “Stable” Versions.

We have recently made some changes to our Percona Platform and Software Lifecycle policy in an effort to more strongly align with upstream Linux distributions. As part of this, we’ve set our timeframe for providing supported builds for Ubuntu “Stable” (non-LTS) releases to nine (9) months. This matches the current Ubuntu distribution upstream policy.

In the future, we will continue to shift as necessary to match the upstream policy specified by Canonical. Along with this, as we did with Debian 9 before, we will only produce 64-bit builds for this platform ongoing. It has been our intention for some time to slowly phase out 32-bit builds, as they are rarely downloaded and largely unnecessary in contemporary times.

If you have any questions or concerns, please feel free to contact Percona Support or post on our Community Forums.

↧

Preview: Top MySQL 8 Features

February 15, 2018, 12:46 am

≫ Next: How to Install Liferay CMS on Debian 9

≪ Previous: Update on Percona Platform Lifecycle for Ubuntu “Stable” Versions

Although there is no official software release for MySQL 8.0 as of yet, most insiders believe that it’s likely to arrive sometime in 2018. In the meantime, Oracle has officially announced a tantalizing list of over two hundred new features! We recently covered Replication Performance Enhancements. Today’s blog will cover some of the other exciting enhancements we can expect when the production release of MySQL 8 hits the market.

New Database Roles

A role is a named collection of privileges that define what a user can and cannot do within a database. Roles play a vital part of database security by limiting who can connect to the server, access the database, or even access individual database objects and data.

Although prior to version 8, MySQL did provide a set of Privileges and Administrative Roles, the up-coming release will also support a set of flexible and properly architected roles, thus allowing DBAs to:

Create and Drop Roles, Grant to Roles
Grant Roles to Roles, Grant Roles to Users
Limit Hosts that can use roles, Define Default Roles
Decide what roles are applicable during a session
And even visualize Roles with SQL function ROLES_GRAPHML()

Since each role packs multiple privileges, DBAs don’t have to remember exactly which permissions a user requires. Roles are also very easy to set up:

Creating a new role:

CREATE ROLE ‘app_developer’, ‘app_read’, ‘app_write’;

Assigning privileges to roles:

GRANT SELECT ON app_db.* TO ‘app_read’;

Assigning the role to a user:

GRANT ‘app_read’ TO ‘read_user1’@’localhost’, ‘read_user2’@’localhost’;

Index Hiding, a.k.a “Invisible” Indexes

Hidden indexes are similar to disabled indexes, except that, in the case of the former, index information remain fully up to date and maintained by Data Manipulation Language (DML); it’s just invisible to the MySQL Optimizer. This feature is useful in hiding an index you suspect you don’t need, without actually dropping it. By marking an index as invisible, the MySQL optimizer will no longer use it. You can then monitor your server and query performance to decide whether to delete it or re-activate it, if it turns out that the index does provide improved performance.

This feature has two main uses:

Soft Delete

This was the situation described above where you don’t think an index is utilized any more. In this case rendering the index invisible is akin to throwing it in the recycle bin. In that state it’s still possible to restore it.

First you would render the index invisible:

ALTER TABLE Country ALTER INDEX c INVISIBLE;

You can revert it – i.e. make it visible again – if need be:

ALTER TABLE Country ALTER INDEX c VISIBLE;

If it is safe to drop the index:

ALTER TABLE Country DROP INDEX c;

Staged Rollout

Adding a new index can not only change existing execution plans, like all changes, it also introduces the risk of regression. That’s where your database becomes unstable due to multiple changes and additions that may not have been fully tested as a whole.

Invisible indexes allow you to stage all changes by putting the database in a “prepared” state.

You can add an index invisibly at an opportune time:

ALTER TABLE Country ADD INDEX c (Continent) INVISIBLE;

Then activate the index after testing the changes to everyone’s satisfaction:

ALTER TABLE Country ALTER INDEX c VISIBLE;

Improved JSON and Document Support

MySQL 5.7 introduced JSON support in order to compete with NoSQL databases that use JSON natively. That included the introduction of a JSON data type, virtual columns and a set of approximately 20 SQL functions that allow you to manipulate and search JSON data on the server side. MySQL 8 continues to build on 5.7’s foundation by improving performance, as well as by adding:

functions to perform search operations on JSON values to extract data from them report whether data exists at a location within them, or report the path to data within them
aggregation functions that let MySQL-native structured data and semi-structured JSON data be merged in a query
document-store abilities

Searching JSON Data

Searching through JSON data is now easier thanks to the JSON_EXTRACT() function. It returns data from a JSON document (the first argument), selected from the parts of the document matched by subsequent path arguments. For example, here is a query the fetches the second data element from a JSON-formatted array:

mysql> SELECT JSON_EXTRACT('[10, 20, [30, 40]]', '$[1]');

+--------------------------------------------+
| JSON_EXTRACT('[10, 20, [30, 40]]', '$[1]') |
+--------------------------------------------+
| 20                                         |
+--------------------------------------------+

Aggregation functions

The MySQL 8.0 lab release added the JSON_ARRAYAGG() and JSON_OBJECTAGG() aggregation functions that can be utilized to combine data into JSON arrays/objects.

Consider the following table:

+------------------------+
| key  | group  | val    |
+------------------------+
| key1 | g1     | v1     |
+------------------------+
| key2 | g2     | v1     |
+------------------------+
| key3 | g3     | v2     |
+------------------------+

The following query selects the keys as a JSON array:

mysql> SELECT JSON_ARRAYAGG(`key`) AS `keys` FROM t1;

+--------------------------+
| keys                     |
+--------------------------+
| [ "key1",                |
|   "key2",                |
|   "key3" ]               |
|                          |
+--------------------------+

Document-store Abilities

Shortly after the JSON data type emerged came the MySQL Document Store feature. It was designed for developers who are not well versed in SQL but want to enjoy the many benefits that a relational database provides. In MySQL 8, reads and writes to the document store use transactions, so that changes to JSON data may be rolled back. Moreover, documents may be stored in the open GeoJSON format for geospatial data so that they can be indexed and searched according to proximity.

In order to function as a document store, MySQL employs the X Plugin and the MySQL Shell interface. It communicates with a MySQL using the X Protocol via the X DevAPI, a modern programming interface that provides support for established industry standard concepts such as CRUD operations. It is implemented in several programming languages, including Java, JavaScript, Node.JS, Python, and C++, with more on the way.

Say that you added the following JSON data to the document store:

{
    GNP: .6,

    IndepYear: 1967,

    Name: "Sealand",

    _id: "SEA",

    demographics: {
        LifeExpectancy: 79,

        Population: 27
    },

    geography: {

        Continent: "Europe",|

        Region: "British Islands",

        SurfaceArea: 193
    }
}

You could then retrieve the document by ID (the _id field) using the find() method. Here is the call using the JavaScript shell:

mysql-js> db.countryinfo.find("_id = 'SEA'")
[
    {
        "GNP": 351182,
...
             SurfaceArea: 193
        }
    }
]

Configuration Persistence

Changing configuration during MySQL runtime is commonly done using SET GLOBAL. This disadvantage of this technique is that the changes will not survive a server restart. As of MySQL 8, configuration changes applied via the SET PERSIST command will survive a MySQL server restart. For instance:

SET PERSIST max_connections = 500;

SET PERSIST works with any configuration variables, including offline_mode, read_only, etc…
One of the best things about SET PERSIST is that it does not require filesystem access, making it particularly useful when you don’t have system file access.

Unicode UTF-8 Encoding

With the precipitous rise of UTF-8 encoding in recent years, it has emerged as the dominating character encoding for the Web and modern applications. UTF-8’s dominance has been partially driven by “adopted words” from foreign languages, but more likely the main factor has been its support for emojis.

In a move that will do doubt make life easier for the vast majority of MySQL users, version 8 no longer uses latin1 as the default encoding, to discourage new users from choosing a problematic legacy option. The recommended default character set for MySQL 8 is now utf8mb4, which is intended to be faster than the now-deprecated utf8mb3 character set and also to support more flexible collations and case sensitivity.

A collection of UTF-8 emoji

Common Table Expressions

Derived tables have existed in MySQL for a while now (since version 4.1 in fact). So what is a derived table, you may ask? A derived table is a subquery in the FROM clause (shown in bold font below):

SELECT … FROM (SELECT …) AS derived_table;

You could think of Common Table Expressions (CTEs) as improved derived tables – at least in their non-recursive form. CTEs can be recursive as well, bet that’s getting a bit ahead of ourselves.
The purpose of CTEs is to simplify the writing of complex SQL. You can always recognize them by the “With” keyword at the start of the SQL statement. For instance:

WITH t1 AS (SELECT * FROM tbl_a WHERE a='b')

SELECT * FROM t1;

Here’s the same query rewritten using a derived table:

SELECT *

FROM (SELECT * FROM tbl_a) AS t1

WHERE t1.a='b';

Recursive CTEs

A recursive CTE is a set of rows that is built iteratively like a programming loop. An initial set of rows is fed into the process, each time producing more rows until the process ceases to produce any additional rows. Syntactically, a recursive CTE refers to itself in a subquery; the “seed” SELECT is executed once to create the initial data subset, then, the recursive SELECT is repeatedly executed to return subsets of data until the complete result set is obtained.

Similar to Oracle’s CONNECT BY, Recursive CTEs are useful to dig in hierarchies such as parent/child and part/subpart relationships.

Recursive CTEs typically take this form:

WITH RECURSIVE cte_name AS

(
  SELECT ...      <-- specifies initial set

  UNION ALL

  SELECT ...      <-- specifies how to derive new rows
)

Here’s a simple example that outputs 1 to 10:

WITH RECURSIVE qn AS

( SELECT 1 AS a

  UNION ALL

  SELECT 1+a FROM qn WHERE a<10
)

SELECT * FROM qn;

+------+
| a    |
+------+
|    1 |
|    2 |
|    3 |
|    4 |
|    5 |
|    6 |
|    7 |
|    8 |
|    9 |
|   10 |
+------+

CTEs can also be utilized within SELECT, INSERT, UPDATE, DELETE statements. For example, taking our 1-to-10 example, we can name the column using the my_cte(n) syntax, and use the result of my_cte to create a table called “numbers”:

INSERT INTO numbers

WITH RECURSIVE my_cte(n) AS

(
  SELECT 1

  UNION ALL

  SELECT 1+n FROM my_cte WHERE n<10
)
SELECT * FROM my_cte;

Querying the numbers table confirms that it contains numbers from 1 to 10:

SELECT * FROM numbers;

+------+
| n    |
+------+
|    1 |
|    2 |
|    3 |
|    4 |
|    5 |
|    6 |
|    7 |
|    8 |
|    9 |
|   10 |
+------+

Window Functions

An extremely useful feature, window functions have enjoyed support on many other database products for some time now. A window function performs a calculation across a set of rows that are related to the current row, similar to an aggregate function. However, unlike aggregate functions, a window function does not cause rows to become grouped into a single output row. This allows you to perform aggregate calculations across multiple rows while still having access to individual rows “in the vicinity” of the current row.

The currently supported functions include:

Name	Description
CUME_DIST()	Cumulative distribution value
DENSE_RANK()	Rank of current row within its partition, without gaps
FIRST_VALUE()	Value of argument from the first row of window frame
LAG()	Value of argument from row lagging current row within partition
LAST_VALUE()	Value of argument from the last row of window frame
LEAD()	Value of argument from row leading current row within partition
NTH_VALUE()	Value of argument from N-th row of window frame
NTILE()	Bucket number of the current row within its partition.
PERCENT_RANK()	Percentage rank value
RANK()	Rank of current row within its partition, with gaps
ROW_NUMBER()	Number of current row within its partition

For example, suppose we have a table that contains sales figures, we can aggregate total sales by country:

SELECT country, SUM(profit) AS country_profit

FROM sales

GROUP BY country

ORDER BY country;

+---------+----------------+
| country | country_profit |
+---------+----------------+
| Finland |           1610 |
| India   |           1350 |
| USA     |           4575 |
+---------+----------------+

By contrast, window operations do not collapse groups of query rows to a single output row. Instead, they produce a result for each row. Like the preceding queries, the following query uses SUM(), but this time as a window function:

SELECT year, country, product, profit,

       SUM(profit) OVER() AS total_profit,

       SUM(profit) OVER(PARTITION BY country) AS country_profit

FROM sales

ORDER BY country, year, product, profit;

+------+---------+------------+--------+--------------+----------------+
| year | country | product    | profit | total_profit | country_profit |
+------+---------+------------+--------+--------------+----------------+
| 2000 | Finland | Computer   |   1500 |         7535 |           1610 |
| 2000 | Finland | Phone      |    100 |         7535 |           1610 |
| 2001 | Finland | Phone      |     10 |         7535 |           1610 |
| 2000 | India   | Calculator |     75 |         7535 |           1350 |
| 2000 | India   | Calculator |     75 |         7535 |           1350 |
| 2000 | India   | Computer   |   1200 |         7535 |           1350 |
| 2000 | USA     | Calculator |     75 |         7535 |           4575 |
| 2000 | USA     | Computer   |   1500 |         7535 |           4575 |
| 2001 | USA     | Calculator |     50 |         7535 |           4575 |
| 2001 | USA     | Computer   |   1200 |         7535 |           4575 |
| 2001 | USA     | Computer   |   1500 |         7535 |           4575 |
| 2001 | USA     | TV         |    100 |         7535 |           4575 |
| 2001 | USA     | TV         |    150 |         7535 |           4575 |
+------+---------+------------+--------+--------------+----------------+

The main part of this query is SUM(profit) OVER (…), which are the window functions. PARTITION BY divides rows into groups while SUM() tallies the sales figures for the specified group (country).

Conclusion

From new database roles and index hiding to Recursive Common Table Expressions and Window Functions, MySQL 8 contains many long-awaited features and bug fixes. As to when it will be released, we can only guess at this point, since the production release was originally scheduled for October of 2017. There is one thing that we can say for sure: it will almost certainly have been worth the wait!

The post Preview: Top MySQL 8 Features appeared first on Monyog Blog.

↧

How to Install Liferay CMS on Debian 9

February 15, 2018, 9:28 am

≫ Next: Troubleshooting MySQL Crashes Webinar: Q&A

≪ Previous: Preview: Top MySQL 8 Features

Liferay is a free and open source content management software written in Java and uses MySQL to store their data. Liferay is a web based application portal that can be used to build websites and portals as an assembly of themes, pages, and a common navigation. In this tutorial, we will show you how to install Liferay on Debian 9 server.

↧

Troubleshooting MySQL Crashes Webinar: Q&A

February 15, 2018, 11:56 am

≫ Next: See You on the Road at 2018's Shows and Events!

≪ Previous: How to Install Liferay CMS on Debian 9

In this blog, I will provide answers to the Q & A for the Troubleshooting MySQL Crashes webinar.

First, I want to thank everybody for attending our January 25, 2018, webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I was unable to answer fully during the webinar.

Q: I have the 600 seconds “Long semaphore wait” assertion failure / crashing issue following DDL queries, sometimes on the master, sometimes just the slaves. Any hints for troubleshooting these? How can I understand what semaphore holding threads are doing?

A: These are hardest errors to troubleshoot. Especially because in some cases (like long-running

CHECK TABLE

commands) long semaphore waits could be expected and appropriate behavior. If you see long semaphore waits when performing DDL operations, it makes sense to consider using pt-online-schema-change or gh-ost utilities. Also, check the list of supported online DDL operations in the MySQL User Reference Manual.

But if you want to know how to analyze such messages, let’s check the output from page #17 in the slide deck used in the webinar:

2018-01-19T20:38:43.381127Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 139970010412800 has waited at ibuf0ibuf.cc line 3454 for 321.00 seconds the semaphore:
S-lock on RW-latch at 0x7f4dde2ea310 created in file buf0buf.cc line 1453
a writer (thread id 139965530261248) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: fffffffff0000000
Last time read locked in file ibuf0ibuf.cc line 3454
Last time write locked in file /mnt/workspace/percona-server-5.7-binaries-release/label_exp/
debian-wheezy-x64/percona-server-5.7.14-8/storage/innobase/btr/btr0btr.cc line 177
2018-01-19T20:38:43.381143Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 139965135804160 has waited at buf0buf.cc line 4196 for 321.00 seconds the semaphore:
S-lock on RW-latch at 0x7f4f257d33c0 created in file hash0hash.cc line 353
a writer (thread id 139965345621760) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0buf.cc line 4196
Last time write locked in file ...

The line

--Thread 139970010412800 has waited at ibuf0ibuf.cc line 3454 for 321.00 seconds the semaphore:

Shows that some transaction was waiting for a semaphore. The code responsible for this wait is located on line 3454 in file

ibuf0ibuf.cc

. I received this crash when I ran Percona Server for MySQL version 5.7.14-8. Therefore, to check what this code is doing, I need to use Percona Server 5.7.14-8 source code:

sveta@Thinkie:~/mysql_packages/percona-server-5.7.14-8$ vim storage/innobase/ibuf/ibuf0ibuf.cc
...
3454 btr_pcur_open(ibuf->index, ibuf_entry, PAGE_CUR_LE, mode, &pcur, &mtr);
...

A few lines above in the same file contain function definition and comment:

3334 /** Buffer an operation in the insert/delete buffer, instead of doing it
3335 directly to the disk page, if this is possible.
3336 @param[in] mode BTR_MODIFY_PREV or BTR_MODIFY_TREE
3337 @param[in] op operation type
3338 @param[in] no_counter TRUE=use 5.0.3 format; FALSE=allow delete
3339 buffering
3340 @param[in] entry index entry to insert
3341 @param[in] entry_size rec_get_converted_size(index, entry)
3342 @param[in,out] index index where to insert; must not be unique
3343 or clustered
3344 @param[in] page_id page id where to insert
3345 @param[in] page_size page size
3346 @param[in,out] thr query thread
3347 @return DB_SUCCESS, DB_STRONG_FAIL or other error */
3348 static MY_ATTRIBUTE((warn_unused_result))
3349 dberr_t
3350 ibuf_insert_low(
3351 ulint mode,
3352 ibuf_op_t op,
3353 ibool no_counter,
3354 const dtuple_t* entry,
3355 ulint entry_size,
3356 dict_index_t* index,
3357 const page_id_t& page_id,
3358 const page_size_t& page_size,
3359 que_thr_t* thr)
3360 {
...

The first line of the comment gives us an idea that InnoDB tries to insert data into change buffer.

Now, let’s check the next line from the error log file:

S-lock on RW-latch at 0x7f4dde2ea310 created in file buf0buf.cc line 1453
sveta@Thinkie:~/mysql_packages/percona-server-5.7.14-8$ vim storage/innobase/buf/buf0buf.cc
...
1446 /* If PFS_SKIP_BUFFER_MUTEX_RWLOCK is defined, skip registration
1447 of buffer block rwlock with performance schema.
1448
1449 If PFS_GROUP_BUFFER_SYNC is defined, skip the registration
1450 since buffer block rwlock will be registered later in
1451 pfs_register_buffer_block(). */
1452
1453 rw_lock_create(PFS_NOT_INSTRUMENTED, &block->lock, SYNC_LEVEL_VARYING);
...

And again let’s check what this function is doing:

1402 /********************************************************************//**
1403 Initializes a buffer control block when the buf_pool is created. */
1404 static
1405 void
1406 buf_block_init(

Even without knowledge of how InnoDB works internally, by reading only these comments I can guess that a thread waits for some global InnoDB lock when it tries to insert data into change buffer. The solution for this issue could be either disabling change buffer, limiting write concurrency, upgrading or using a software solution that allows you to scale writes.

Q: For the page cleaner messages, when running app using replication we didn’t get them. After switching to PXC we started getting them. Something we should look at particular to PXC to help resolve this?

A: Page cleaner messages could be a symptom of starving IO activity. You need to compare Percona XtraDB Cluster (PXC) and standalone server installation and check how exactly the write load increased.

Q: Hi, I have one question, we have a query we were joining on

BLOB

TEXT

fields that is causing system locks and high CPU alerts and causing a lot of system locks, can you please suggest how can we able to make it work? Can you please send the answer in a text I missed some information?

A: If you are joining on

BLOB

TEXT

fields you most likely don’t use indexes. This means that InnoDB has to perform a full table scan. It increases IO and CPU activity by itself, but also increases the number of locks that InnoDB has to set to resolve the query. Even if you have partial indexes on the

BLOB

and

TEXT

columns, mysqld has to compare full values for the equation, so it cannot use index only to resolve

ON

clause. It is a best practice to avoid such kinds of

JOIN

s. You can use surrogate integer keys, for example.

Q: Hi, please notice that “MySQL server has gone away” is the worst one, in my opinion, and there was no mention about that ….can you share some tips on this? Thank you.
Both MySQL from Oracle and Percona error log does not help on that, by the way …

A: “

MySQL Server has gone away

” error maybe the result of a crash. In this case, you need to handle it like any other crash symptom. But in most cases, this is a symptom of network failure. Unfortunately, MySQL doesn’t have much information why connection failures happen. Probably because, from mysqld’s point of view, a problematic network only means that the client unexpectedly disconnected after a timeout, and the client still waiting for a response receives “

MySQL Server has gone away

”. I discussed these kinds of errors in my “Troubleshooting hardware resource usage” webinar. A good practice for situations when you see this kind of error often is don’t leave idle connections open for a long time.

Q: I see that a lot of work is doing hard investigation about some possibilities of what is going wrong….is there a plan at development roadmap on improve error log output messages? If you can comment on that …

A: Percona Engineering does a lot for better diagnostics. For example, Percona Server for MySQL has an extended slow log file format, and Percona Server for MySQL 5.7.20 introduced a new

innodb_print_lock_wait_timeout_info

variable that allows log information about all InnoDB lock wait timeout errors (manual). More importantly, it logs not only blocked transaction, but also locking transaction. This feature was requested at lp:1657737 for one of our Percona Support customers and is now implemented

Oracle MySQL Engineering team also does a lot for better error logging. The start of these improvements happened in version 5.7.2, when variable log_error_verbosity was introduced. Version 8.0.4 added much better tuning control. You can read about it in the Release Notes.

Q: Hello, you do you using strace to find what exactly table have problems in case there is not clear information in mysql error log?

A: I am not a big fan of

strace

when debugging mysqld crashes, but Percona Support certainly uses this tool. I myself prefer to work with

strace

when debugging client issues, such as trying to identify why Percona XtraBackup behaves incorrectly.

Thanks everybody for attending the webinar. You can find the slides and recording of the webinar at the Troubleshooting MySQL Crashes web page.

↧

See You on the Road at 2018's Shows and Events!

February 15, 2018, 12:39 pm

≫ Next: Releasing ProxySQL 1.4.6

≪ Previous: Troubleshooting MySQL Crashes Webinar: Q&A

Calling all developers, DBAs, engineers, SREs, and database aficionados: we’re hitting the road and hope to see you along the way. It's always a highlight of our year when we get to meet up with our users and friends face-to-face, and we'll be attending events across the country (and abroad!) for the rest of 2018. If you see our booth, come say hello.

Map Img.jpeg
Image Credit

Below is a list of the places you'll be able to find us over the next nine months, plus descriptions of how we'll be involved at each event (and, in some cases, special ticket discount codes).

Event	Start	End	Location
DevOps Days Charlotte	2/22/2018	2/23/2018	Charlotte, NC
Strata Conference	3/5/2018	3/8/2018	San Jose, CA
DevOps Days Baltimore	3/21/2018	3/22/2018	Baltimore, MD
SRECon	3/27/2018	3/29/2018	Santa Clara, CA
PostgresConf US 2018	4/16/2018	4/20/2018	Jersey City, NJ
Percona Live Santa Clara	4/23/2018	4/25/2018	Santa Clara, CA
PHP[tek]	5/31/2018	6/1/2018	Atlanta, GA
Velocity San Jose	6/11/2018	6/14/2018	San Jose, CA
MongoDB World	6/26/2018	6/27/2018	New York, NY
AWS re:Invent	11/26/2018	11/30/2018	Las Vegas, NV

The Lineup

DevOps Days Charlotte — We're kicking off our 2018 tradeshow schedule by heading to DevOpsDays Charlotte. VividCortex engineer Preetam Jinka will be speaking on February 23rd at 3:00pm at the event. Check out his session here. DevOpsDays Charlotte will bring 200+ development, infrastructure, operations, information security, management and leadership professionals together to discuss the culture, processes and tools to enable better organizations and innovative products.

Strata Conf — Every year thousands of top data scientists, analysts, engineers, and executives converge at Strata Data Conference—the largest gathering of its kind. Baron Schwartz will be presenting, "Why nobody cares about your anomaly detection" during the event.

DevOps Days Baltimore — Join us for another DevOpsDays, this time in Baltimore! Baltimore will bring hundreds of development, infrastructure, operations, information security, management and leadership professionals together to discuss the latest tools and process improvements.

SRECon — SREcon18 Americas is a gathering of engineers who care deeply about engineering resilience, reliability, and performance into complex distributed systems, and the scalability of products, services, and infrastructure within their organizations. Baron Schwartz will be presenting, be sure to check out his session here.

PostgresConf US — We will start of spring at Postgres Conf US in Jersey City, NJ. Baron Schwartz will be delivering a keynote on April 20th at 1:50pm. Check out the session here. Stop by our booth to see Matt Culmone and Jon Stumpf for a free hat and live product demo.

Percona Live Santa Clara — We're sponsoring Percona Live Santa Clara again this year and once again Baron Schwartz will be headlining! Percona Live is where our community comes together to build the latest database technologies, and it's where opensource and commercial worlds meet. Much more exciting news and details to come...stay tuned! Use the discount code VividCortex20 at registration to save 20% off the ticket price

PHP [TEK] — Considered the premier PHP conference and annual homecoming for the PHP Community. Be sure to catch Baron's sessions here.

Velocity SJ — We’re heading back to the Bay Area on June 20th for Velocity. It's where practitioners fearlessly share their stories so that we can learn from their successes, failures, and new ideas. We’ll have some fun giveaways at our booth too, so stop by to grab some swag, see a product demo, or just to say hello!

MongoDB World — Join us in New York — in the heart of midtown— for MongoDB World. This is our second year sponsoring the event, and we're pumped to be part of the MongoDB crowd and its Giant Ideas.

AWS re:Invent — We'll wrap up our 2018 tradeshow schedule at AWS re:Invent on November 27 through December 1. More details to come!

↧

Releasing ProxySQL 1.4.6

February 15, 2018, 2:00 pm

≫ Next: A Tale of Three Computer Conferences, Two Communities

≪ Previous: See You on the Road at 2018's Shows and Events!

Proudly announcing the release of the latest stable release of ProxySQL 1.4.6 as of the 1st of February 2018.

ProxySQL is a high performance, high availability, protocol aware proxy for MySQL.
It can be downloaded here, and freely usable and accessible according to GPL license.

ProxySQL 1.4.6 includes a number of important improvements and bug fixes including:

SET statements could lead to crash as reported in #1342
Greatly reduced locking contention on the SQLite database for the ProxySQL Admin and Statistics tables

In addition, this release also contains the improvements and fixes from ProxySQL 1.4.5 which were not announced, they are included below for the sake of completeness:

Missing locks in Prepared Statements Manager caused crashes #1307
Unnecessary attempts to purge PS cache could lead to high CPU usage and slowdown #1312
Mirroring could cause crash #1305
ProxySQL Cluster caused crash if mysql_replication_hostgroups.comment is NULL #1304
SHOW MYSQL STATUS becomes very slow with millions of PS in cache #1333
The stats_mysql_query_rules.hits could have an integer overflow

IMPORTANT NOTE: If you are considering upgrading from a version prior to 1.4.5 it is recommended to install version 1.4.6 directly.

A special thanks to all the people that reports bugs: this makes each version of ProxySQL better than the previous one.

Please report any bugs or feature requests on github issue tracker

↧

A Tale of Three Computer Conferences, Two Communities

February 16, 2018, 1:03 am

≫ Next: Increasing functional testing velocity with pt-query-digest

≪ Previous: Releasing ProxySQL 1.4.6

Three conferences in three weeks! FOSDEM, SunshinePHP, and PHP UK are three excellent conferences that this year are back to back to back.

FOSDEM is to the computer world what Renaissance Fairs are to those who have their own maces and armor. FOSDEM is held on the campus of the Free University of Brussels, there is no registration -- pre, onsite, or post -- and they attempt to guess attendees by MAC addresses on devices that connect to network. No tickets, no badges, and no reserved seats but FOSDEM is free to attend. Rooms are requested by various groups including my MySQL Community Team partner LeFred for the MySQL ecosystem. The MySQL and Friends Devroom was packed from early morning to evening with engaging 30 minute presentations from a number of companies. This show in the last few years has become one of the most important technical shows on the MySQL Community Team Schedule. LeFred and the presenters did a tremendous job of putting together amazing talks for the MySQL Community.

SunshinePHP is held in Miami and organized by the amazing Adam Culp. He and his team have an amazing knack of pulling fantastic talks together into a great show. Be advised that this is a show where you can go from airport to hotel for the conference and then return to the airport at the end without ever leaving the venue. I spoke on MySQL 8 and received a lot of feedback that I used to update my presentation for the next show.

And the next show is PHP UK. The PHP Community is very strong, supportive, and radiant in new advancements in the PHP 7 series. As with SunshinePHP, the PHP folks are warm, supportive, and invigorated. The organizers of the London show have also assembled a talented group of presenters and I seem to be the only carryover from the previous show with my talk on MySQL 8.

A Comparison of the Communities

The MySQL and PHP Communities are both roughly the same age. Both are now confident twenty years olds with plenty of recent self improvement. PHP 7 is light years ahead in speed and capabilities from the four and five series. MySQL is about to take a giant step with MySQL 8. Both had version sixes that never quite made it into production but the subsequent engineering have produced much stronger products. Both face competition from newer products but still dominate what is the modern implementation of the LAMP stack. And the two products have strong communities working hard to improve the product.

The PHP Community is much better than its counterpart in aiding novices, mentoring, stressing the basics of good style in coding. Many members have had to add JavaScript skills of one order or another in recent years but still try to keep PHP as their core tool. And there are many more local PHP user groups than MySQL.

Next Up

I will be talking to the San Diego PHP User Group before heading to the Southern California Linux Expo. More on those shows later.

↧

Increasing functional testing velocity with pt-query-digest

February 16, 2018, 7:14 am

≫ Next: MySQL 5.7 Multi-threads replication operation tips

≪ Previous: A Tale of Three Computer Conferences, Two Communities

Whenever we do upgrades for our clients from one major version of MySQL to another we strongly recommend to test in two forms.

First, it would be a performance test between the old version and the new version to make sure there aren’t going to be any unexpected issues with the query processing rates. Secondly, do a functional test to ensure all queries that are running on the old version will not have syntactic errors or problems with reserved words in the new version that we’re upgrading to.

If a client doesn’t have an appropriate testing platform to perform these types of tests, we will leverage available tools to test to the best of our ability. More often than not this includes using pt-upgrade after capturing slow logs with long_query_time set to 0 in order to catch everything that’s running on the server for a period of time.

One of the issues you can run into with this sort of test is it has to run the queries one at a time. If you have a query that takes much longer to run in the new version this can slow things down considerably. This also gets a little frustrating if you have that long running query listed thousands of times in your slow query log.

If your objective is to run a functional test and just ensure that you’re not going to run into a syntax error in the new version, it makes no sense to run a query more than once. If it ran okay the first time, it should run okay every time assuming that the litterals in the query are also properly enclosed. So instead of replaying the entire log against the target server, we can first use pt-query-digest to create a slow log that contains one of each type of query.

Let’s take a look at the example below where I created a slow log with 5 identical write queries and 5 identical read queries.

[root@cent5 slowlog]# cat ./testslow.log
......
use ptupgrade;
......
# Time: 180207 12:55:05
# User@Host: root[root] @ localhost []
# Query_time: 0.000134 Lock_time: 0.000051 Rows_sent: 0 Rows_examined: 0
SET timestamp=1518026105;
insert into t1 (c1) values (1);
# Time: 180207 12:55:06
# User@Host: root[root] @ localhost []
# Query_time: 0.000126 Lock_time: 0.000049 Rows_sent: 0 Rows_examined: 0
SET timestamp=1518026106;
insert into t1 (c1) values (2);
# Time: 180207 12:55:08
# User@Host: root[root] @ localhost []
# Query_time: 0.000125 Lock_time: 0.000051 Rows_sent: 0 Rows_examined: 0
SET timestamp=1518026108;
insert into t1 (c1) values (3);
# Time: 180207 12:55:10
# User@Host: root[root] @ localhost []
# Query_time: 0.000130 Lock_time: 0.000052 Rows_sent: 0 Rows_examined: 0
SET timestamp=1518026110;
insert into t1 (c1) values (4);
# Time: 180207 12:55:12
# User@Host: root[root] @ localhost []
# Query_time: 0.000126 Lock_time: 0.000050 Rows_sent: 0 Rows_examined: 0
SET timestamp=1518026112;
insert into t1 (c1) values (5);
# Time: 180207 12:55:17
# User@Host: root[root] @ localhost []
# Query_time: 0.000134 Lock_time: 0.000055 Rows_sent: 2 Rows_examined: 10
SET timestamp=1518026117;
select c1 from t1 where c1 = 1;
# Time: 180207 12:55:19
# User@Host: root[root] @ localhost []
# Query_time: 0.000121 Lock_time: 0.000053 Rows_sent: 2 Rows_examined: 10
SET timestamp=1518026119;
select c1 from t1 where c1 = 2;
# Time: 180207 12:55:20
# User@Host: root[root] @ localhost []
# Query_time: 0.000118 Lock_time: 0.000052 Rows_sent: 2 Rows_examined: 10
SET timestamp=1518026120;
select c1 from t1 where c1 = 3;
# Time: 180207 12:55:22
# User@Host: root[root] @ localhost []
# Query_time: 0.000164 Lock_time: 0.000074 Rows_sent: 2 Rows_examined: 10
SET timestamp=1518026122;
select c1 from t1 where c1 = 4;
# Time: 180207 12:55:24
# User@Host: root[root] @ localhost []
# Query_time: 0.000121 Lock_time: 0.000052 Rows_sent: 2 Rows_examined: 10
SET timestamp=1518026124;
select c1 from t1 where c1 = 5;

I then used pt-query-digest to create a new version of this slow query log with only 1 of each type of query.

[root@cent5 slowlog]# pt-query-digest --limit=100% --sample 1 --no-report --output slowlog ./testslow.log
# Time: 180207 12:55:05
# User@Host: root[root] @ localhost []
# Query_time: 0.000134 Lock_time: 0.000051 Rows_sent: 0 Rows_examined: 0
use ptupgrade;
insert into t1 (c1) values (1);
# Time: 180207 12:55:17
# User@Host: root[root] @ localhost []
# Query_time: 0.000134 Lock_time: 0.000055 Rows_sent: 2 Rows_examined: 10
use ptupgrade;
select c1 from t1 where c1 = 1;

You’ll notice that not only did we get 1 query of each type, it also added a use statement before each query so MySQL knows what schema to run the query against when being replayed.

You can now take this new slow log and run it via pt-upgrade against your target servers.

Conclusion

If you have a large slow query log file that you are trying to test against your server using a log replay tool like pt-upgrade, you can make your life a lot simpler by getting one sample of each query using pt-query-digest. In the field we’ve seen this reduce log file sizes from hundreds of gigs to less than a meg and have reduced log replay times from weeks to minutes.

Please note that this is mainly something you’ll want to consider for functional testing as you may want to have a lot of variety with your litterals when doing a performance test.

↧

MySQL 5.7 Multi-threads replication operation tips

February 16, 2018, 7:44 am

≫ Next: One year of MySQL Replication Contributions

≪ Previous: Increasing functional testing velocity with pt-query-digest

With support of multi-threads replication starting from MySQL 5.7, the operations on slave are slightly different from single-thread replication. Here is a list of some operation tips for the convenience of use as below:

1. Skip a statement for a specific channel.

Sometimes, we might find out that one of the channels stop replication due to some error, and we may want to skip the statement for that channel so that we can restart a slave for it. We need to be very careful not to skip the statement from the other channel, since the command SET GLOBAL sql_slave_skip_counter = N is for global. How can we make sure the global sql_slave_skip_counter is applied to a specific channel and not to the other channel? Here are the steps:

1.1: Stop all slaves by: stop slave;

stop slave;

1.2: Set up the count of statement to skip by: SET GLOBAL sql_slave_skip_counter = N;

SET GLOBAL sql_slave_skip_counter = 1;

1.3: Start slave on the channel we want to skip the statement on. The command will use the setting for global sql_slave_skip_counter = 1 to skip one statement and start slave on that channel (for example ‘main’) by: starting slave for channel ‘channel-name’;

start slave for channel 'main';

1.4: Start slave on all the other channels by: start slave;

start slave;

**2. Check the status of replication with detailed messages in the table performance_schema.replication_applier_status_by_worker through select * from the table**:

mysql> select * from performance_schema.replication_applier_status_by_worker;
| CHANNEL_NAME | WORKER_ID | THREAD_ID | SERVICE_STATE | LAST_SEEN_TRANSACTION | LAST_ERROR_NUMBER | LAST_ERROR_MESSAGE | LAST_ERROR_TIMESTAMP |
| metrics | 1 | 1784802 | ON | ANONYMOUS | 0 | | 0000-00-00 00:00:00 |
| accounting | 1 | 1851760 | ON | ANONYMOUS | 0 | | 0000-00-00 00:00:00 |
| main | 1 | NULL | OFF | ANONYMOUS | 1051 | Worker 0 failed executing transaction 'ANONYMOUS' at master log mysql-bin.019567, end_log_pos 163723076; Error 'Unknown table 'example.accounts'' on query. Default database: 'pythian'. Query: 'DROP TABLE `example`.`accounts` /* generated by server */' | 2018-02-14 23:57:52 |
| log | 1 | 1784811 | ON | ANONYMOUS | 0 | | 0000-00-00 00:00:00 |

mysql> select * from performance_schema.replication_applier_status_by_worker;
| CHANNEL_NAME | WORKER_ID | THREAD_ID | SERVICE_STATE | LAST_SEEN_TRANSACTION | LAST_ERROR_NUMBER | LAST_ERROR_MESSAGE | LAST_ERROR_TIMESTAMP |
| metrics | 1 | 1965646 | ON | ANONYMOUS | 0 | | 0000-00-00 00:00:00 |
| accounting | 1 | 1965649 | ON | ANONYMOUS | 0 | | 0000-00-00 00:00:00 |
| main | 1 | 1965633 | ON | ANONYMOUS | 0 | | 0000-00-00 00:00:00 |
| log | 1 | 1965652 | ON | ANONYMOUS | 0 | | 0000-00-00 00:00:00 |

3. Check the status for a specific channel by: show slave status for channel ‘channel-name’\G :

mysql> show slave status for channel 'main'\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: db-test-01.int.example.com
Master_User: replicator
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.019567
Read_Master_Log_Pos: 869255591
Relay_Log_File: db-test-02-relay-bin-example.000572
Relay_Log_Pos: 45525401
Relay_Master_Log_File: mysql-bin.019567
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table: test.sessions,test.metrics
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 869255591
Relay_Log_Space: 869256195
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 4118338212
Master_UUID: b8cee5b1-3161-11e7-8109-3ca82a217b08
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0
Replicate_Rewrite_DB:
Channel_Name: insight
Master_TLS_Version:

I hope this short list of tips helps you enjoy multi-threads replication.

↧

One year of MySQL Replication Contributions

February 16, 2018, 8:21 am

≫ Next: TOP 10 MySQL 8.0 features for DBAs & OPS

≪ Previous: MySQL 5.7 Multi-threads replication operation tips

Since January 2017, the MySQL Replication Team has been involved in processing many Community Contributions !

We are really happy to receive contributions (and not only in the replication team), but this also implies a lot of work from our engineers, as more than resolving a bug or developing a new feature, code contributions need to be analyzed, the code needs to be understood and validated.…

↧

TOP 10 MySQL 8.0 features for DBAs & OPS

February 16, 2018, 8:38 am

≫ Next: Why ZFS Affects MySQL Performance

≪ Previous: One year of MySQL Replication Contributions

Today, let’s have a look at the TOP 10 new features in MySQL 8.0 that will improve DBA’s life.

To shrink the list to 10 only items wasn’t an easy task, but here is the top 10:

Temporary Tables Improvements
Persistent global variables
No more MyISAM System Tables
Reclaim UNDO space from large transactions
UTF8 performance
Removing Query Cache
Atomic DDLs
Faster & More Complete Performance Schema (Histograms, Indexes, …) and Information Schema
ROLES
REDO & UNDO logs encrypted if tablespace is encrypted

Temporary Tables Improvements

Since 5.7, all internal temporary tables are created in a unique shared tablespace called “ibtmp1“.

Additionally, the metadata for temp tables will also be stored in memory (not anymore in .frm files).

In MySQL 8.0, the MEMORY storage engine will also be replaced as default engine for internal temporary tables (those created by the Optimizer during JOIN, UNION, …) by the TempTable storage engine. This new engine provides more efficient storage for VARCHAR and VARBINARY columns (with Memory the full maximum size is allocated).

Persistent Global Variables

With MySQL 8.0 it is now also possible to set variables and make the change persistent to server’s reboot. I’ve written a dedicated blog post that you can check for more information.

Combined this syntax and the new RESTART command, makes very easy to configure MySQL from its shell. This is a cloud friendly feature!

No more MyISAM System Tables

With the new native Data Dictionary, we won’t need MyISAM system tables anymore ! Those tables and the data dictionary tables are now created in a single InnoDB tablespace file named mysql.idb in the data directory. This means that if you don’t explicitly use MyISAM tables (which is totally inadvisable if you care about your data) you can have a MySQL instance without any MyISAM table.

Reclaim UNDO space from large transactions

In MySQL 5.7, we already added the possibility to truncate undo spaces (innodb_undo_log_truncate, disabled by default). In MySQL8, we changed the undo disk format to support a huge number of rollback segments per undo tablespaces. Also, by default, the rollback segments are now created in two separate undo tablespaces instead of the InnoDB system tablespace (2 is now the minimum and this setting is now dynamic). We also deprecated the variable to set that value (innodb_undo_tablespaces) as we will provide SQL commands giving DBAs a real interface to interact with UNDO Tablespaces too.

Automatic truncation of undo tablespaces is also now enabled by default.

UTF8 Performance

The default character set has changed from latin1 to utf8mb4 as UTF8 is now much faster up to 1800% faster on specific queries ! Emojis are everywhere now and MySQL supports them without problem !

Removing Query Cache

The first thing I was always advising during a performance audit was to disable the Query Cache as it didn’t scale by design. The MySQL QC was creating more issues than it solved. We decided to simply remove it in MySQL 8.0 as nobody should use it. If your workload requires a Query Cache, then you should have a look to ProxySQL as Query Cache.

Atomic DDLs

With the new Data Dictionary, MySQL 8.0 now supports Atomic Data Definition Statements (Atomic DDLs). This means that when a DDL is performed, the data dictionary updates, the storage engine operation and the writes in the binary log are combined into a single atomic transaction that is either fully executed or not at all. This provides a better reliability where unfinished DDLs don’t leave any incomplete data.

Faster & More Complete Performance Schema (Histograms, Indexes, …) and Information Schema

Many improvements were made to Performance Schema like fake indexes or histograms.

With the contribution of fake indexes, queries like SELECT * FROM sys.session became 30x faster. Tables scans are now avoided as much as possible and the use of indexes improves a lot the execution time. Additionally to that, Performance Schema also provides histograms of statements latency. The Optimizer can also benefit form these new histograms.

Information Schema has also been improved by the use of the Data Dictionary. No more .frm files are needed to know the table’s definition. Also this allow to scale to more than 1.000.000 tables !

ROLES

SQL Roles have been added to MySQL 8.0. A role is a named collection of privileges. Like user accounts, roles can have privileges granted to and revoked from them. Roles can be applicable by default or by session. There is also the possibility to set roles to be mandatory.

REDO & UNDO logs encrypted if tablespace is encrypted

In MySQL 5.7, it was possible to encrypt an InnoDB tablespace for tables stored in file-per-table. In MySQL 8.0 we completed this feature by adding encryption for UNDO and REDO logs too.

And once again, the list of improvements doesn’t finish here. There are many other nice features. I would like to list below some other important ones (even if they are all important of course )

persistent auto increment
InnoDB self tuning
JSON performance
Invisible Indexes
new lock for backup
Resource Groups
additional metadata into binary logs
OpenSSL for Community Edition too

Please check the online manual to have more information about all these new features.

↧

Why ZFS Affects MySQL Performance

February 16, 2018, 2:43 pm

≫ Next: This Week in Data with Colin Charles 28: Percona Live, MongoDB Transactions and Spectre/Meltdown Rumble On

≪ Previous: TOP 10 MySQL 8.0 features for DBAs & OPS

In this blog post, we’ll look at how ZFS affects MySQL performance when used in conjunction.

ZFS and MySQL have a lot in common since they are both transactional software. Both have properties that, by default, favors consistency over performance. By doubling the complexity layers for getting committed data from the application to a persistent disk, we are logically doubling the amount of work within the whole system and reducing the output. From the ZFS layer, where is really the bulk of the work coming from?

Consider a comparative test below from a bare metal server. It has a reasonably tuned config (discussed in separate post, results and scripts here). These numbers are from sysbench tests on hardware with six SAS drives behind a RAID controller with a write-backed cache. Ext4 was configured with RAID10 softraid, while ZFS is the same (striped three pairs of mirrored VDEvs).

There are a few obvious observations here, one being ZFS results have a high variance between median and the 95th percentile. This indicates a regular sharp drop in performance. However, the most glaring thing is that with write-only only workloads of update-index, overall performance could drop to 50%:

ZFS ZFS ZFS

Looking further into the IO metrics for the update-index tests (95th percentile from /proc/diskstats), ZFS’s behavior tells us a few more things.

ZFS

ZFS batches writes better, with minimal increases in latency with larger IO size per operation.
ZFS reads are heavily scattered and random – the high response times and low read IOPs and throughput means significantly higher disk seeks.

If we focus on observation #2, there are a number of possible sources of random reads:

InnoDB pages that are not in the buffer pool
When ZFS records are updated, metadata also has to be read and updated

This means that for updates on cold InnoDB records, multiple random reads are involved that are not present with filesystems like ext4. While ZFS has some tunables for improving synchronous reads, tuning them can be touch and go when trying to fit specific workloads. For this reason, ZFS introduced the use of L2ARC, where faster drives are used to cache frequently accessed data and read them in low latency.

We’ll look more into the details how ZFS affects MySQL, the tests above and the configuration behind them, and how we can further improve performance from here in upcoming posts.

↧

This Week in Data with Colin Charles 28: Percona Live, MongoDB Transactions and Spectre/Meltdown Rumble On

February 16, 2018, 6:12 am

≫ Next: Sharded replica sets - MySQL and MongoDB

≪ Previous: Why ZFS Affects MySQL Performance

Join Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

In case you missed last week’s column, don’t forget to read the fairly lengthy FOSDEM MySQL & Friends DevRoom summary.

From a Percona Live Santa Clara 2018 standpoint, beyond the tutorials getting picked and scheduled, the talks have also been picked and scheduled (so you were very likely getting acceptance emails from the Hubb.me system by Tuesday). The rejections have not gone out yet but will follow soon. I expect the schedule to go live either today (end of week) or early next week. Cheapest tickets end March 4, so don’t wait to register!

Amazon Relational Database Service has had a lot of improvements in 2017, and the excellent summary from Jeff Barr is worth a read: Amazon Relational Database Service – Looking Back at 2017. Plenty of improvements for the MySQL, MariaDB Server, PostgreSQL and Aurora worlds.

Spectre/Meltdown and its impact are still being discovered. You need to read Brendan Gregg’s amazing post: KPTI/KAISER Meltdown Initial Performance Regressions. And if you visit Percona Live, you’ll see an amazing keynote from him too! Are you still using MyISAM? MyISAM and KPTI – Performance Implications From The Meltdown Fix suggests switching to Aria or InnoDB.

Probably the biggest news this week though? Transactions are coming to MongoDB 4.0. From the site, “MongoDB 4.0 will add support for multi-document transactions, making it the only database to combine the speed, flexibility, and power of the document model with ACID guarantees. Through snapshot isolation, transactions will provide a globally consistent view of data, and enforce all-or-nothing execution to maintain data integrity.”. You want to read the blog post, MongoDB Drops ACID (the title works if you’re an English native speaker, but maybe not quite if you aren’t). The summary diagram was a highlight for me because you can see the building blocks, plus future plans for MongoDB 4.2.

Releases

ProxySQL 1.4.6 – improvements and bug fixes, and you can upgrade straight to 1.4.6 (you don’t, for example, have to go to 1.4.5 then 1.4.6)
MariaDB Server 10.2.13 – updated InnoDB (from MySQL 5.7.21), Galera wsrep library, fixes for slow starts, and more
Percona Server for MySQL 5.6.39-83.1 – bug fixes, plus a some TokuDB changes

Link List

Compiling ProxySQL on FreeBSD – I’d be interested in knowing how many FreeBSD users actively want to deploy MySQL and her variants + ecosystem.
MySQL 8.0 Roles and Graphml – visualize roles, I like this (despite MariaDB Server having roles since 10.0.5, this is not one of the available features).
TOP 10 MySQL 8.0 features for developers – if you haven’t already tried the second release candidate, this might be a good reason to try it. From the document store to JSON enhancements, CTEs, window functions and more, I suggest taking a look at this great list.
How To Enable Binary Logging On An Amazon RDS Read Replica
Collect PostgreSQL Metrics with Percona Monitoring and Management (PMM); this largely thanks to external monitor support. I think another feature people would benefit from? Amazon Aurora MySQL Monitoring with Percona Monitoring and Management (PMM).
From the just for fun department, MariaDB source visualisation with Gource. You see the source tree growing in the video, but as the commentary tells you, you don’t clean too much info from this. Would be nice to visualize how much the code-base has diverged?

Upcoming appearances

SCALE16x – Pasadena, California, USA – March 8-11 2018
FOSSASIA 2018 – Singapore – March 22-25 2018

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

↧

Sharded replica sets - MySQL and MongoDB

February 16, 2018, 8:46 am

≫ Next: Building a Twitter art bot with Python, AWS, and socialist realism art

≪ Previous: This Week in Data with Colin Charles 28: Percona Live, MongoDB Transactions and Spectre/Meltdown Rumble On

MongoDB used to have a great story for sharded replica sets. But the storage engine, sharding and replica management code had significant room for improvement. Over the last few releases they made remarkable progress on that and the code is starting to match the story. I continue to be impressed by the rate at which they paid off their tech debt and transactions coming to MongoDB 4.0 is one more example.

It is time for us to do the same in the MySQL community.

I used to be skeptical about the market for sharded replica sets with MySQL. This is popular with the web-scale crowd but that is a small market. Today I am less skeptical and assume the market extends far beyond web-scale. This can be true even if the market for replicasets, without sharding, is so much larger.

The market for replica sets is huge. For most users, if you need one instance of MySQL then you also need HA and disaster recovery. So you must manage failover and for a long time (before crash-proof slaves and GTID) that was a lousy experience. It is better today thanks to cloud providers and DIY solutions even if some assembly is required. Upstream is finally putting a solution together with MySQL Group Replication and other pieces.

But sharded replica sets are much harder, and even more so if you want to do cross-shard queries and transactions. While there have been many attempts at sharding solutions for the MySQL community, it is difficult to provide something that works across customers. Fortunately Vitess has shown this can be done and already has many customers in production.

ProxySQL and Orchestrator might also be vital pieces of this stack. I am curious to see how the traditional vendors (MySQL, MariaDB, Percona) respond to this progress.

Updates:

I think binlog server should be part of the solution. But for that to happen we need a GPLv2 binlog server and that has yet to be published.

↧

Building a Twitter art bot with Python, AWS, and socialist realism art

February 18, 2018, 4:00 pm

≫ Next: Percona Server for MySQL 5.7.21-20 Is Now Available

≪ Previous: Sharded replica sets - MySQL and MongoDB

"Young Naturalists"
Sergiy Grigoriev, 1948 pic.twitter.com/8KdwFS6PjU
— SovietArtBot (@SovietArtBot) February 15, 2018

TLDR: I built a Twitter bot that tweets paintings from the WikiArt socialist realism category every 6 hours using Python and AWS Lambdas.

The post outlines why I decided to do that, architecture decisions I made, technical details on how the bot works, and my next steps for the bot.

Follow @SovietArtBot Here

Check out its website and code here.

Table of Contents

Why build an art bot?

Often when you’re starting out as a data scientist or developer, people will give you the well-intentioned advice of “just picking a project and doing it” as a way of learning the skills you need.

That advice can be hard and vague, particularly when you don’t have a lot of experience to draw from to figure out what’s even feasible given how much you know, and how that whole process should work.

By writing out my process in detail, I’m hoping it helps more people understand:

1) The steps of a software project from beginning to end.

2) The process of putting out a mininum viable project that’s “good enough” and iterating over your existing code to add features.

3) Picking a project that you’re going to enjoy working on.

4) The joy of socialist realism art.

Technical Goals

I’ve been doing more software development as part of my data science workflows lately, and I’ve found that:

1) I really enjoy doing both the analytical and development pieces of a data science project. 2) The more development skills a data scientist is familiar with, the more valuable they are because it ultimately means they can prototype production workflows, and push their models into production quicker than having to wait for a data engineer.

A goal I’ve had recently is being able to take a full software development project from end-to-end, focusing on understanding modern production best practices, particularly in the cloud.

high-level

Personal Goals

But, a project that’s just about “cloud architecture delivery” is really boring. In fact, I fell asleep just reading that last sentence. When I do a project, it has to have an interesting, concrete goal.

To that end, I’ve been extremely interested in Twitter as a development platform. I wrote recently that one of the most important ways we can fix the internet is to get off Twitter.

Easier said than done, because Twitter is still one of my favorite places on the internet. It’s where I get most of my news, where I find out about new blog posts, engage in discussions about data science, and a place where I’ve made a lot of friends that I’ve met in real life.

But, Twitter is extremely noisy, lately to the point of being toxic. There are systemic ways that Twitter can take care of this problem, but I decided to try to tackle this problem this on my own by starting #devart, a hashtag where people post classical works of art with their own tech-related captions to break up stressful content.

There’s something extremely catharctic about being able to state a problem in technology well enough to ascribe a visual metaphor to it, then sharing it with other people who also appreciate that visual metaphor and find it funny and relatable.

"Waiting for the build to finish"
Albert Anker, 1867 #devart pic.twitter.com/3tAO3idZq1
— Vicki Boykis (@vboykis) December 28, 2017

Debugging a memory leak. (Vincent van Gogh, 1890) #devart pic.twitter.com/A5eYvK1Zmq
— Miria Grunick (@MiriaGrunick) December 9, 2017

"Migration plan from MySQL to Mongo"
Alex Colville, 1954 #devart pic.twitter.com/ieCzv7Hfh8
— Vicki Boykis (@vboykis) December 13, 2017

"Another day, another AGILE Sprint"
~Peter Blume, circa 1944-8#devart pic.twitter.com/GaiueyZniS
— Trevor Grant (@rawkintrevo) December 19, 2017

"Machine learning engineers, data scientists, data analysts, research engineers, and statisticians hold forth on what each of their actual titles means in terms of responsibilities. "
Raphael, 1510 #devart pic.twitter.com/7uX0ompuAo
— Vicki Boykis (@vboykis) January 14, 2018

another Friday night deploy #devart pic.twitter.com/8dSgMbZqwj
— Dmitri Sotnikov ⚛ (@yogthos) January 11, 2018

VC runs into the founder of a blockchain AI startup at SoulCycle Palo Alto, 2017 #devart https://t.co/Kgnj4IcOlK
— Alex Companioni (@achompas) February 9, 2018

C# Developer Contemplates Switching to Node.js - Laszlo Mednyanszky, 1898 #devart pic.twitter.com/FwyQxmXMt5
— PoliticalMath (@politicalmath) November 27, 2017

And, sometimes you just want to break up the angry monotony of text with art that moves you. Turns out I’m not the only one.

If you don't follow @vboykis, she does a wonderful #devart series where she tweets out art w/ hilarious tech-y titles. Chris would LOVE it.
— PoliticalMath (@politicalmath) November 27, 2017

Vicki’s #devart project is my new favorite twitter thing. https://t.co/3MwTNNKJeB
— Rian van der Merwe (@RianVDM) February 9, 2018

As I posted more #devart, I realized that I enjoyed looking at the source art almost as much as figuring out a caption, and that I enjoyed accounts like Archillect, Rabih Almeddine’s, and Soviet Visuals, who all tweet a lot of beautiful visual content with at least some level of explanation.

I decided I wanted to build a bot that tweets out paintings. Particularly, I was interested in socialist realism artworks.

Why Socialist Realism

Socialist realism is an artform that was developed after the Russian Revolution. As the Russian monarchy fell, social boundaries dissovled,and people began experimenting with all kinds of new art forms, including futurism and abstractionism. I’ve previously written about this shift here.

As the Bolsheviks consolidated power, they established Narkompros, a body to control the education and cultrual values of what they deemend was acceptable under the new regime, and the government laid out the new criteria for what was accetable Soviet art.

Socialist realism as a genre had four explicit criteria, developed by the highest government officials, including Stalin himself. It was to be:

+ Proletarian: art relevant to the workers and understandable to them.
+ Typical: scenes of everyday life of the people.
+ Realistic: in the representational sense.
+ Partisan: supportive of the aims of the State and the Party.`

In looking at socialist realism art, it’s obvious that the underlying goal is to promote communism. But, just because the works are blatant propaganda doesn’t discount what I love about the genre, which is that it is indeed representative of what real people do in real life.

"PagerDuty"
Vladimir Kutilin #devart pic.twitter.com/onfI7bOxJL
— Vicki Boykis (@vboykis) February 14, 2018

These are people working, sleeping, laughing, frowning, arguing, and showing real emotion we don’t often see in art. They are relatable and humane, and reflect our humanity back to us. What I also strongly love about this genre of art is that women are depicted doing things other than sitting still to meet the artist’s gaze.

"Young, idealistic data scientists harvesting their first models for pickling"
Tetyana Yablonska, 1966 pic.twitter.com/iSlWhTEeED
— Vicki Boykis (@vboykis) October 6, 2017

So, what I decided is that I’d make a Twitter bot that tweets out one socialist realism work every couple of hours.

Here’s the final result:

"Young Naturalists"
Sergiy Grigoriev, 1948 pic.twitter.com/8KdwFS6PjU
— SovietArtBot (@SovietArtBot) February 15, 2018

There are several steps in traditional software development:

Requirements
Design
Development
Testing
Deployment
Maintenance

Breaking a Project into Chunks

This is a LOT to take in. When I first started, I made a list of everything that needed to be done: setting up AWS credentials, roles, and permissions, version control, writing the actual code, learning how to download images with requests, how to make the bot tweet on a schedule, and more.

When you look at it from the top-down, it’s overwhelming. But in “Bird by Bird,” one of my absolute favorite books that’s about the writing processs (but really about any creative process) Anne Lamott writes,

Thirty years ago my older brother, who was ten years old at the time, was trying to get a report on birds written that he’d had three months to write, which was due the next day. We were out at our family cabin in Bolinas, and he was at the kitchen table close to tears, surrounded by binder paper and pencils and unopened books on birds, immobilized by the hugeness of the task ahead. Then my father sat down beside him, put his arm around my brother’s shoulder, and said, “Bird by bird, buddy. Just take it bird by bird.”

And that’s how I view software development, too. One thing at a time, until you finish that, and then move on to the next piece. So, with that in mind, I decided I’d use a mix of the steps above from the traditional waterfall approach and mix them with the agile concept of making a lot of small, quick cycles of those steps to get closer to the end result.

Requirements and Design: High-Level Bot Architecture

I started building the app by working backwards from what my requirements:

a bot on Twitter, pulling painting images and metadata from some kind of database, on a timed schedule, either cron or something similar.

This helped me figure out the design. Since I would be posting to Twitter as my last step, it made sense to have the data already some place in the cloud. I also knew I’d eventually want to incorporate AWS because I didn’t want the code and data to be dependent on my local machine being on.

I knew that I’d also need version control and continuous integration to make sure the bot was stable both on my local machine as I was developing it, and on AWS as I pushed my code through, and so I didn’t have to manually put to code in the AWS console.

Finally, I knew I’d be using Python, because I like Python, and also because it has good hooks into Twitter through the Twython API (thanks to Timo for pointing me to Twython over Tweepy, which is deprecated) and AWS through the Boto library. I’d start by getting the paintings and metadata about the paintings from a website that had a lot of good socialist realism paintings not bound by copyright. Then, I’d do something to those paintings to get both the name, the painter, and title so I could tweet all of that out. Then, I’d do the rest of the work in AWS.

So my high-level flow went something like this:

high-level

Eventually, I’d refactor out the dependency on my local machine entirely and push everything to S3, but I didn’t want to spend any money in AWS before I figured out what kind of metadata the JSON returned.

Beyond that, I didn’t have a specific idea of the tools I’d need, and made design and architecture choices as my intermediate goals became clearer to me.

Development: Pulling Paintings from WikiArt

Now, the development work began.

WikiArt has an amazing, well-catalogued collection of artworks in every genre you can think of. It’s so well-done that some researchers use the catalog for their papers on deep learning, as well.

Some days, I go just to browse what’s new and get lost in some art. (Please donate to them if you enjoy them.)

WikiArt also has two aspects that were important to the project:

1) They have an explicit category for socialist realism art with a good number of works. 500 works in the socialist realism perspective, which was not a large amount (if I wanted to tweet more than one image a day), but good enough to start with.

2) Every work has an image, title, artist, and year,which would be important for properly crediting it on Twitter.

My first step was to see if there was a way to acces the site through an API, the most common way to pull any kind of content from websites programmatically these days. The problem with WikiArt is that it technically doesn’t have a readily-available public API,so people have resorted to really creative ways of scraping the site.

But, I really, really didn’t want to scrape, especially because the site has infinite scroll Javascript elements, which are annoying to pick up in BeautifulSoup, the tool most people use for scraping in Python.

So I did some sleuthing, and found that WikiArt does have an API, even if it’s not official and, at this point, somewhat out of date.

It had some important information on API rate limits, which tells us how often you can access the API without the site getting angry and kicking out out:

API calls: 10 requests per 2.5 seconds

Images downloading: 20 requests per second

and,even more importantly, on how to access a specific category through JSON-based query parameters. The documentation they had, though, was mostly at the artist level:


so I had to do some trial and error to figure out the correct link I wanted, which was:

https://www.wikiart.org/en/paintings-by-style/socialist-realism?json=2&page=1

And with that, I was ready to pull the data.

I started by using the Python [Requests library](http://docs.python-requests.org/en/master/) to connect to the site and pull two things: 

1) A JSON file that has all the metadata 
2) All of the actual paintings as `png/jpg/jpeg` files 

# Development: Processing Paintings and Metadata Locally

The JSON I got back looked like this: 

```json
{
ArtistsHtml: null,
CanLoadMoreArtists: false,
Paintings: [],
Artists: null,
AllArtistsCount: 0,
PaintingsHtml: null,
PaintingsHtmlBeta: null,
AllPaintingsCount: 512,
PageSize: 60,
TimeLog: null
}

Within the paintings array, each painting looked like this:

{
id: "577271cfedc2cb3880c2de61",
title: "Winter in Kursk",
year: "1916",
width: 634,
height: 750,
artistName: "Aleksandr Deyneka",
image: "https://use2-uploads8.wikiart.org/images/aleksandr-deyneka/winter-in-kursk-1916.jpg",
map: "0123**67*",
paintingUrl: "/en/aleksandr-deyneka/winter-in-kursk-1916",
artistUrl: "/en/aleksandr-deyneka",
albums: null,
flags: 2,
images: null
}

I also downloaded all the image files by returning response.raw from the JSON and using the shutil.copyfileobj method.

I decided not to do anymore processing locally since my goal was to eventually move everything to the cloud anyway, but I now had the files available to me for testing so that I didn’t need to hit WikiArt and overload the website anymore.

I then uploaded both the JSON and the image files to the same S3 bucket with the boto client, which lets you write :

def upload_images_to_s3(directory):
    """
    Upload images to S3 bucket if they end with png or jpg
    :param directory:
    :return: null
    """

    for f in directory.iterdir():
        if str(f).endswith(('.png', '.jpg', '.jpeg')):
            full_file_path = str(f.parent) + "/" + str(f.name)
            file_name = str(f.name)
            s3_client.upload_file(full_file_path, settings.BASE_BUCKET, file_name)
            print(f,"put")

As an aside, the .iterdir() method here is from the pretty great pathlib library, new to Python 3, which handles file operations better than os. Check out more about it here.

Development: Using S3 and Lambdas

Now that I had my files in S3, I needed some way for Twitter to read them. To do that at a regular time interval, I decided on using an AWS Lambda function (not to be confused with Python lambda functions, a completely different animal.) Because I was already familiar with Lambdas and their capabilities - see my previous post on AWS - , they were a tool I could use without a lot of ramp-up time (a key component of architectural decisions.)

Lambdas are snippets of code that you can run without needing to know anything about the machine that runs them. They’re triggered by other events firing in the AWS ecosystem. Or, they can be run on a cron-like schedule, which was perfect for what I wanted to do. This was exactly what I needed, since I needed to schedule the bot to post at an interval.

Lambdas look like this in Python:

def handler_name(event, context): 
    ...
    return some_value

The event is what you decide to do to trigger the function and the context sets up all the runtime information needed to interact with AWS and run the function.

Because I wanted my bot to tweet both the artwork and some context around it, I’d need a way to tweet both the picture and the metadata, by matching the picture with the metadata.

To do this, I’d need to create key-value pairs, a common programming data model, where the key was the filename part of the image attribute, and the value was the title, year, and artistName, so that I could match the two, like this:

high-level

So, all in all, I wanted my lambda function to do several things. All of that code I wrote for that section is here.

1) Open the S3 bucket object and inspect the contents of the metadata file

Opening an S3 bucket within a lambda usually looks something like this:

def handler(event, context):
    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = record['s3']['object']['key'] 
        download_path = '/tmp/{}{}'.format(uuid.uuid4(), key)
        s3_client.download_file(bucket, key, download_path)

where the event is the JSON file that gets passed in from Lambda that signifies that a trigger has occurred. Since our trigger is a timed event, our JSON file doesn’t have any information about that specific event and bucket, and we can exclude the event, in order to create a function that normally opens a given bucket and key.

try:
        data = s3.get_object(Bucket=bucket_name, Key=metadata)
        json_data = json.loads(data['Body'].read().decode('utf-8'))
    except Exception as e:
        print(e)
        raise e

2) Pull out the metadata and pull it into a dictionary with the filename as the key and the metadata as the value. We can pull it into a defaultdict, because those are ordered by default (all dictionaries will be orded as of 3.6, but we’re still playing it safe here.)

indexed_json = defaultdict()

    for value in json_data:
        artist = value['artistName']
        title = value['title']
        year = value['year']
        values = [artist, title, year]

        # return only image name at end of URL
        find_index = value['image'].rfind('/')
        img_suffix = value['image'][find_index + 1:]
        img_link = img_suffix

        try:
            indexed_json[img_link].append(values)
        except KeyError:
            indexed_json[img_link] = (values)

(By the way, a neat Python string utility that I didn’t know before which really helped with the filename parsing was (rsplit) [http://python-reference.readthedocs.io/en/latest/docs/str/rsplit.html]. )

3) Pick a random filename to tweet (single_image_metadata = random.choice(list(indexed_json.items())))

4) Tweet the image and associated metadata

There are a couple of Python libraries in use for Twitter. I initially started using Tweepy, but much to my sadness, I found out it was no longer being maintained. (Thanks for the tip, Timo. )

So I switched to Twython, which is a tad more convoluted, but is up-to-date.

The final piece of code that actually ended up sending out the tweet is here:

    twitter = Twython(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET)

    try:

        tmp_dir = tempfile.gettempdir()
        #clears out lambda dir from previous attempt, in case testing lambdas keeps previous lambda state
        call('rm -rf /tmp/*', shell=True) 
        path = os.path.join(tmp_dir, url)
        print(path)

        s3_resource.Bucket(bucket_name).download_file(url, path)
        print("file moved to /tmp")
        print(os.listdir(tmp_dir))

        with open(path, 'rb') as img:
            print("Path", path)
            twit_resp = twitter.upload_media(media=img)
            twitter.update_status(status="\"%s\"\n%s, %s" % (title, painter, year), media_ids=twit_resp['media_id'])

    except TwythonError as e:
        print(e)

What this does is take advantage of a Lambda’s temp space:

TIL that AWS Lambda Functions have miniature file systems that you can use as temporary storage (https://t.co/egCKwu6GJB).
— Vicki Boykis (@vboykis) January 3, 2018

Pulls the file from S3 into the Lambda’s /tmp/ folder, and matches it by filename with the metadata, which at this point is in key-value format.

The twitter.upload_media method uploads the image and gets back a media id that is then passed into the update_status method with the twit_resp['media_id'].

And that’s it. The image and text are posted.

Development: Scheduling the Lambda

The second part was configuring the function. to run on a schedule. Lambdas can be triggered by two things:

An event occurring
A timed schedule.

Events can be anything from a file landing in an S3 bucket, to polling a Kinesis stream.

Scheduled events can be written either in cron, or at a fixed-rate. I started out writing cron rules, but since my bot didn’t have any specific requirements, only that it needed to post every six hours, the fixed rate turned out to be enough for what I needed:

high-level

Finally, I needed to package the lambda for distribution. Lambdas run on Linux machines which don’t have a lot of Python libraries pre-installed (other than boto3, the Amazon Python client library I used previously that connects the Lambda to other parts of the AWS ecosystem, and json. )

In my script, I have a lot of library imports. Of these, Twython is an external library that needs to be packaged with the lambda and uploaded.

from twython import Twython, TwythonError

Deployment: Bot Tweets!

So I packged the Lambda based on those instructions, manually the first time, by uploading a zip file to the Lambda console.

And, that’s it! My two one-off scripts were ready, and my bot was up and running.

high-level

And here’s the final flow I ended up with: architecture

Where to Next?

There’s a lot I still want to get to with Soviet Art Bot.

The most important first step is tweaking the code so that no painting repeats more than once a week. That seems like the right amount of time for Twitter followers to not get annoyed.

In parallel, I want to focus on testing and maintenance.

Testing and Maintenance

The first time I worked through the entire flow, I started by working in a local Python project I had started in PyCharm and had version-controlled on [GitHub](https://github.com/veekaybee/soviet-art-bot.

Me, trying to explain the cases when I use PyCharm, when I use Sublime Text, and when I use Jupyter Notebooks for development. pic.twitter.com/CEC0WlymlC
— Vicki Boykis (@vboykis) January 18, 2018

So, when I made changes to any part of the process, my execution flow would be:

Run Wikiart download functionality locally
Test the lambda “locally” with python-lambda-local
Zip up the lambda and upload to Lambda
Make mistakes in the Lambda code
Zip up the lambda and run again.

This was not really an ideal workflow for me, because I didn’t want to have to manually re-uploading the lambda every time, so I decided to use Travis CI, which integrates with GitHub really well.. The problem is that there’s a lot of setup involved: virtualenvs, syncing to AWS credentials, setting up IAM roles and profiles that allow Travis to access the lambda, setting up a test Twitter and AWS environment to test travis integration, and more.

For now, the bot is working in production, and while it works, I’m going to continue to automate more and more parts of deployment in my dev branch. (This post was particularly helpful in zipping up a lambda, and my deploy script is here. )

After these two are complete, I want to:

1) Refactor lambda code to take advantage of pathlib instead of OS so my code is standardized (should be a pretty small change)

2) Source more paintings. WikiArt is fantastic, but has only 500ish paintngs available in the socialist realism category. I’d like to find more sources with high-quality metadata and a significant collection of artworks. Then, I’d like to

3) Create a front-end where anyone can upload a work of socialist realism for the bot to tweet out. This would probably be easier than customizing a scraper and would allow me to crowdsource data. As part of this process, I’d need a way to screen content before it got to my final S3 bucket.

Which leads to:

4) Go through current collection and make sure all artwork is relevant and SWF. See if there’s a way I can do that programmatically.

And:

5) Machine learning and deep learning potential possibilities: Look for a classifier to filter out artworks with nudity/questionable content and figure out how to decide what “questionable” means. Potentially with AWS Rekognition, or building my own CNN.

Other machine learning opportunities:

Mash with #devart to see if the bot can create fun headlines for paintings based on painting content
Extract colors from artworks by genre and see how they differ between genres and decades

Conclusion

Software development can be a long, exhausting process with a lot of moving parts and decision-making involved, but it becomes much easier and more interesting if you you break up a project into byte-sized chunks that you can continuously work on to stop yourself from getting overwhelemed with the entire task at hand. The other part, of course, is that it has to be fun and interesting for you so that you make it through all of the craziness with a fun, finished product at the end.

↧

Percona Server for MySQL 5.7.21-20 Is Now Available

February 19, 2018, 9:11 am

≫ Next: MySQL 8.0 new features in real life applications: roles and recursive CTEs

≪ Previous: Building a Twitter art bot with Python, AWS, and socialist realism art

Percona announces the GA release of Percona Server for MySQL 5.7.21-20 on February 19, 2018. Download the latest version from the Percona web site or the Percona Software Repositories. You can also run Docker containers from the images in the Docker Hub repository.

Based on MySQL 5.7.21, including all the bug fixes in it, Percona Server for MySQL 5.7.21-20 is the current GA release in the Percona Server for MySQL 5.7 series. Percona provides completely open-source and free software.

New Features:

A new string variable version_suffix allows to change suffix for the Percona Server version string returned by the read-only version variable. Also version_comment is converted from a global read-only to a global read-write variable.
A new keyring_vault_timeout variable allows to set the amount of seconds for the Vault server connection timeout. Bug fixed #298.

Bugs Fixed:

mysqld startup script was unable to detect jemalloc library location for preloading, and that prevented starting Percona Server on systemd based machines. Bugs fixed #3784 and #3791.
There was a problem with fulltext search, which could find a word with punctuation marks in natural language mode only, but not in boolean mode. Bugs fixed #258, #2501 (upstream #86164).
Build errors were present on FreeBSD (caused by fixing the bug #255 in Percona Server 5.6.38-83.0) and on MacOS (caused by fixing the bug #264 in Percona Server 5.7.20-19). Bugs fixed #2284 and #2286.
A bunch of fixes was introduced to remove GCC 7 compilation warnings for
the Percona Server build. Bugs fixed #3780 (upstream #89420, #89421, and #89422).
CMake error took place at compilation with bundled zlib. Bug fixed #302.
A GCC 7 warning fix introduced regression in Percona Server that led to a wrong SQL query built to access the remote server when Federated storage engine was used. Bug fixed #1134.
It was possible to enable encrypt_binlog with no binary or relay logging enabled. Bug fixed #287.
Long buffer wait times where occurring on busy servers in case of the IMPORT TABLESPACE command.
Bug fixed #276.
Server queries that contained JSON special characters and were logged by Audit Log Plugin in JSON format caused invalid output due to lack of escaping. Bug fixed #1115.
Percona Server now uses Travis CI for additional tests. Bug fixed #3777.

Other bugs fixed: #257, #264, #1090 (upstream #78048), #1109, #1127, #2204, #2414, #2415, #3767, #3794, and #3804 (upstream #89598).

This release also contains fixes for the following CVE issues: CVE-2018-2565, CVE-2018-2573, CVE-2018-2576, CVE-2018-2583, CVE-2018-2586, CVE-2018-2590, CVE-2018-2612, CVE-2018-2600, CVE-2018-2622, CVE-2018-2640, CVE-2018-2645, CVE-2018-2646, CVE-2018-2647, CVE-2018-2665, CVE-2018-2667, CVE-2018-2668, CVE-2018-2696, CVE-2018-2703, CVE-2017-3737.

MyRocks Changes:

A new behavior makes Percona Server fail to restart on detected data corruption; rocksdb_allow_to_start_after_corruption variable can be passed to mysqld as a command line parameter to switch off this restart failure.
A new cmake option ALLOW_NO_SSE42 was introduced to allow MyRocks build on hosts not supporting SSE 4.2 instructions set, which makes MyRocks usable without FastCRC32-capable hardware. Bug fixed MYR-207.
rocksdb_bytes_per_sync and rocksdb_wal_bytes_per_sync variables were turned into dynamic ones.
rocksdb_flush_memtable_on_analyze variable has been removed.
rocksdb_concurrent_prepare is now deprecated, as it has been renamed in upstream to rocksdb_two_write_queues.
rocksdb_row_lock_deadlocks and rocksdb_row_lock_wait_timeouts global status counters were added to track the number of deadlocks and the number of row lock wait timeouts.
Creating table with string indexed column to non-binary collation now generates warning about using inefficient collation instead of error. Bug fixed MYR-223.

TokuDB Changes:

A memory leak was fixed in the PerconaFT library, caused by not destroying PFS key objects on shutdown. Bug fixed TDB-98.
A clang-format configuration was added to PerconaFT and TokuDB. Bug fixed TDB-104.
A data race was fixed in minicron utility of the PerconaFT. Bug fixed TDB-107.
Row count and cardinality decrease to zero took place after long-running REPLACE load.

Other bugs fixed: TDB-48, TDB-78, TDB-93, and TDB-99.

The release notes for Percona Server for MySQL 5.7.21-20 are available in the online documentation. Please report any bugs on the project bug tracking system.

↧

MySQL 8.0 new features in real life applications: roles and recursive CTEs

February 19, 2018, 10:37 am

≫ Next: MySQL HA Architecture #1 : InnoDB Cluster & Consul

≪ Previous: Percona Server for MySQL 5.7.21-20 Is Now Available

MySQL 8.0 roles and recursive Common Table Expressions I am happy that the MySQL team is, during the last years, blogging about each major feature that MySQL Server is getting; for example, the series on Recursive Common Table Expressions. Being extremely busy myself, I appreciate taking the time to share details with the advantage of being an insider point of view.

However, first party guides and examples can be seen at times as not terrible useful at first- as normally they are done with synthetic, artificial examples and not real-life ones. I want to share with you 2 examples of how 2 of the upcoming features of MySQL 8.0 will be useful to us, with examples that we already use or plan to use for the Wikimedia Foundation databases: roles and recursive CTEs.

Roles

MySQL roles Giusseppe “data charmer” Maxia presented recently at FOSDEM 2018 how the newly introduced roles are going work (you can watch and read his presentation at the FOSDEM website) and he seemed quite skeptical about the operational side of it. I have to agree -some of the details are not straightforward as how users and roles work on other environments, but I have to say we have been using database roles for a while with MariaDB without any problems in the last year.

User roles were introduced on MariaDB 10.0, although for us they were unusable until MariaDB 10.1, and its implementation seems to be similar, if not exactly the same as that of MySQL 8.0 (side note. I do not know why that is ̣—is it an SQL standard? Was the implementation inspired by Oracle or other famous database? does it have a common origin? Please tell me in the comments if you know the answer).

If you use MySQL for a single large web-like application, roles may not be very useful to you. For example, for the Mediawiki installation that supports Wikipedia, only a few accounts are setup per database server -one or a few for the application, plus those needed for monitoring, backups and administration (please do not use a single root account for all in yours!).

labsdb replication diagram — LabsDB databases provided a vital service to the community by providing direct access to community-accessible database replicas of production data.

However, we also provide free hosting and other IT services for all developers (including volunteers) to create webs, applications and gadgets that could be useful to the full Wikimedia community. Among those services, we have a data service where we provide a replica of a sanitized database with most of the data of the production mediawiki installation, plus additional database accounts to store their own application data. This requires a highly multi-tiered MySQL/MariaDB installation, with thousands of accounts per server. In the past, each account had to be managed separately, with its own grants and accounts limits- this was a nightmare, and the little accountability could easily lead to security issues. Wildcards were used to assign grants, which was a really bad idea- because wildcards not only provide grants to current databases, also to future databases that match the pattern- and that is very dangerous- the wrong data could accidentally end up on the wrong servers and all users would get automatically access. Also, every time there was some kind of maintenance where all users had to be added or revoked certain grant (not frequent, but could happen), a script had to be run to do that in a loop for each account. Also, there are other accounts aside from user accounts (the administration and monitoring ones), but aside from a specific string pattern, there was no way to differentiate user from administration or monitoring accounts.

Our cloud databases were the first we upgraded to MariaDB 10.1 exclusively to get a transparent role implementation (SET default role). We created a template role (e.g. ‘labsdbuser‘), which all normal users will be granted to (you can see the actual code on our puppet repo):

GRANT USAGE ON *.* TO <newuseraccount> ...;
GRANT labsdbuser TO <newuseraccount>;
SET DEFAULT ROLE labsdbuser FOR <newuseraccount>;

, and because it is so compact, we can give a detailed, non-wildcard-based selection of grants to to the labsdbuser generic role.

If we had to add new grant to all users, we just have to run:

GRANT <new grants and objects> TO labsdbuser;

only once per server, and all users affected will get it automatically (well, they have to log off and on, but that was true of any grant).

Can the syntax be simpler or the underlying tables confusing? Sure, but in practice -at least for us-, we rarely have to handle role changes- they don’t tend to be very dynamic. However, it simplified a lot the creation, accountability, monitoring and administration of user accounts for us.

Recursive CTEs

CTEs are one of the other large syntactic sugar added on 8.0 (they were also added on MariaDB 10.2). Unlike the roles, however, this is something that application developers will benefit from -roles are mostly helpful for Database Administrators.

As I mentioned before, we have upgraded our servers at most to MariaDB 10.1 (stability is very important for us), so I have not yet been able to play on production with them. Also, because Mediawiki has to support older applications, it might take years to see those being used on web requests for wikis. I see them, however, being interesting first for analytics and for the cloud community applications I mentioned previously.

Mediawiki has some hierarchical data stored on their tables; probably the most infamous one is the category tree. As of 2018 (this may change in the future), the main way to group Wiki pages and files is by using categories. However, unlike the concept of tags, categories may contain other more specialized categories.

Diagram of category relationships — Mediawiki/Wikipedia category system can become quite **complex**.
Image License: CC-BY-SA-3.0 Author: Gurch at English Wikipedia

For example, the [[MySQL]] article has the following categories:

1995 software
Client-server database management systems
Cross-platform software
Free database management systems
MySQL
Oracle software
RDBMS software for Linux
Relational database management systems
Sun Microsystems software
(Plus other hidden ones coming from templates)

(You can obtain a real time list with a Mediawiki API call, too)

However, if you go to the category Database management systems, you will not find it there, because you have to browse first thought Database management systems > Relational database management systems > MySQL; or maybe Database management systems > Database-related software for Linux > RDBMS software for Linux > MySQL

The relationship between pages and categories are handled on the classical “(parent, child) table” fashion, the categorylinks table, which has the following structure:

CREATE TABLE /*_*/categorylinks (
  -- Key to page_id of the page defined as a category member.
  cl_from int unsigned NOT NULL default 0,

  -- Name of the category.
  -- This is also the page_title of the category's description page;
  -- all such pages are in namespace 14 (NS_CATEGORY).
  cl_to varchar(255) binary NOT NULL default '',
...
  PRIMARY KEY (cl_from,cl_to)
) /*$wgDBTableOptions*/;

CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to,cl_type,cl_sortkey,cl_from);

The table has been simplified, we will not get on this post into collation issues- that would be material for other separate discussion. Also ignore the fact that cl_to links to a binary string rather than an id, there are reasons for all of that, but cannot enter into details now.

If you want to play with the real thing, you can download it from here, it is, as of February 2018, only 2GB compressed: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-categorylinks.sql.gz

This table structure works nicely to answer questions like:

“Which articles and subcategories are directly on a given category (Database_management_systems)?”

SELECT IF(page_namespace = 14, 'category', 'article') as type, page_title 
FROM categorylinks 
JOIN page 
ON page_id = cl_from 
WHERE cl_to = 'Database_management_systems'

You can see the live results of this query.

“Which categories does a particular article (MySQL) contain?”

SELECT cl_to
FROM categorylinks
JOIN page
ON page_id = cl_from
WHERE page_namespace = 0 AND page_title = 'MySQL';

You can see the live results of this query.

However, if you want to answer the question “Give me all articles in a particular category or its subcategories”, this structure is not well suited to answer this question. Or at least it didn’t use to be, for older versions of MySQL and MariaDB. Alternative table design structures have been proposed in the past (both for Mediawiki and for the general problem) to solve this commonly found structure, but not all will be a applicable, as the Wiki category system is very flexible, and not even strictly a tree- it can contain, for example, loops.

Up to now, custom solutions, such as this answer on stackoverflow, or specially-built applications, such as: Petscan, had to be developed to provide lists of articles queried recursively.

Again, enter recursive CTEs. Instead of having to use external code to apply recursion, or an awkward stored procedure, we can just send the following sql:

WITH RECURSIVE cte (cl_from, cl_type) AS
(
    SELECT cl_from, cl_type FROM categorylinks WHERE cl_to = 'Database_management_systems' -- starting category
    UNION
    SELECT categorylinks.cl_from, categorylinks.cl_type FROM cte JOIN page ON cl_from = page_id JOIN categorylinks ON page_title = cl_to WHERE cte.cl_type = 'subcat' -- subcat addition on each iteration
)
SELECT page_title FROM cte JOIN page ON cl_from = page_id WHERE page_namespace = 0 ORDER BY page_title; -- printing only articles in the end, ordered by title

With results:

+-------------------------------------------------------------+
| page_title                                                  |
+-------------------------------------------------------------+
| 4th_Dimension_(software)                                    |
| A+_(programming_language)                                   |
| ABC_(programming_language)                                  |
| ACID                                                        |
| ADABAS                                                      |
| ADO.NET                                                     |
| ADOdb                                                       |
| ADOdb_Lite                                                  |
| ADSO                                                        |
| ANSI-SPARC_Architecture                                     |
| Adabas_D                                                    |
| Adaptive_Server_Enterprise                                  |
...
| Yellowfin_Business_Intelligence                             |
| Zope_Object_Database                                        |
+-------------------------------------------------------------+
511 rows in set (0.02 sec)

It looks more complicated than it should, because we need to join with the page table in order to get page ids from titles; but otherwise the idea is simple: get a list of pages, if they are subcategories, query them recursively, otherwise, add them to the list. By using UNION instead of UNION ALL we make sure we do not explore the same subcategory twice. Note how we end up with a list larger than the 143 items are directly on the category.

If we fall in an infinite loop, or a very deep chain of relationships, the variable limiting the number of executions, cte_max_recursion_depth can be tuned. By default it fails if we do 1001 iterations.

I don’t intend to provide an in depth guide about CTEs, check the 8.0 manual for more information and the above mentioned MySQL Team blog posts.

So tell me, do you think these or other recently-added features will be as useful for you as the seem to be for us? What is your favourite one recently added to MySQL or MariaDB? Send me a comment or a message to @jynus.

↧

Amazon Aurora MySQL Transaction Commits

Amazon Aurora MySQL Load

Amazon Aurora MySQL Memory Usage

Amazon Aurora MySQL Statement Latency

Amazon Aurora MySQL Special Command Counters

Amazon Aurora MySQL Problems

New Database Roles

Index Hiding, a.k.a “Invisible” Indexes

Soft Delete

Staged Rollout

Improved JSON and Document Support

Searching JSON Data

Aggregation functions

Document-store Abilities

Configuration Persistence

Unicode UTF-8 Encoding

Common Table Expressions

Recursive CTEs

Window Functions

Conclusion

The Lineup

A Comparison of the Communities

Next Up

Conclusion

1. Skip a statement for a specific channel.

2. Check the status of replication with detailed messages in the table performance_schema.replication_applier_status_by_worker through select * from the table:

3. Check the status for a specific channel by: show slave status for channel ‘channel-name’\G :

Temporary Tables Improvements

Persistent Global Variables

No more MyISAM System Tables

Reclaim UNDO space from large transactions

UTF8 Performance

Removing Query Cache

Atomic DDLs

Faster & More Complete Performance Schema (Histograms, Indexes, …) and Information Schema

ROLES

REDO & UNDO logs encrypted if tablespace is encrypted

Releases

Link List

Upcoming appearances

Feedback

Why build an art bot?

Technical Goals

Personal Goals

Why Socialist Realism

Breaking a Project into Chunks

Requirements and Design: High-Level Bot Architecture

Development: Pulling Paintings from WikiArt

Development: Using S3 and Lambdas

Development: Scheduling the Lambda

Deployment: Bot Tweets!

Where to Next?

Testing and Maintenance

Conclusion

New Features:

Bugs Fixed:

MyRocks Changes:

TokuDB Changes:

Roles

Recursive CTEs

**2. Check the status of replication with detailed messages in the table performance_schema.replication_applier_status_by_worker through select * from the table**: