Log Buffer #436: A Carnival of the Vanities for DBAs

August 14, 2015, 7:00 am

≫ Next: Ease of use or consistency

≪ Previous: The 8 Best Ways To Lose Your DBA

This Log Buffer Edition covers the top blog posts of the week from the Oracle, SQL Server and MySQL arenas.

Oracle:

Momentum and activity regarding the Data Act is gathering steam, and off to a great start too. The Data Act directs the Office of Management and Budget (OMB) and the Department of the Treasury (Treasury) to establish government-wide financial reporting data standards by May 2015.
RMS has a number of async queues for processing new item location, store add, warehouse add, item and po induction. We have seen rows stuck in the queues and needed to release the stuck AQ Jobs.
We have a number of updates to partitioned tables that are run from within pl/sql blocks which have either an execute immediate ‘alter session enable parallel dml’ or execute immediate ‘alter session force parallel dml’ in the same pl/sql block. It appears that the alter session is not having any effect as we are ending up with non-parallel plans.
Commerce Cloud, a new flexible and scalable SaaS solution built for the Oracle Public Cloud, adds a key new piece to the rich Oracle Customer Experience (CX) applications portfolio. Built with the latest commerce technology, Oracle Commerce Cloud is designed to ignite business innovation and rapid growth, while simplifying IT management and reducing costs.
Have you used R12: Master Data Fix Diagnostic to Validate Data Related to Purchase Orders and Requisitions?

SQL Server:

SQL Server 2016 Community Technology Preview 2.2 is available
What is Database Lifecycle Management (DLM)?
SSIS Catalog – Path to backup file could not be determined
SQL SERVER – Unable to Bring SQL Cluster Resource Online – Online Pending and then Failed
Snapshot Isolation Level and Concurrent Modification Collisions – On Disk and In Memory OLTP

MySQL:

A Better Approach to all MySQL Regression, Stress & Feature Testing: Random Coverage Testing & SQL Interleaving.
What is MySQL Package Verification? Package verification (Pkgver for short) refers to black box testing of MySQL packages across all supported platforms and across different MySQL versions. In Pkgver, packages are tested in order to ensure that the basic user experience is as it should be, focusing on installation, initial startup and rudimentary functionality.
With the rise of agile development methodologies, more and more systems and applications are built in series of iterations. This is true for the database schema as well, as it has to evolve together with the application. Unfortunately, schema changes and databases do not play well together.
MySQL replication is a process that allows you to easily maintain multiple copies of MySQL data by having them copied automatically from a master to a slave database.
In Case You Missed It – Breaking Databases – Keeping your Ruby on Rails ORM under Control.

The post Log Buffer #436: A Carnival of the Vanities for DBAs appeared first on Pythian - Data Experts Blog.

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

Ease of use or consistency

August 14, 2015, 8:51 am

≫ Next: Election Money and Data

≪ Previous: Log Buffer #436: A Carnival of the Vanities for DBAs

I am working on New features in Performance Schema 5.7 in action tutorial for Percona Live Amsterdam for quite a time already. Probably since version 5.7.3 when instrumentation for metadata locks were introduced and which I presented as a teaser in my combined "General MySQL Troubleshooting" and "Troubleshooting MySQL Performance" seminar I did in South Korea for Oracle University (for 5.6 that time).

In version 5.7.6 instrumentation for variables and status variables were introduced. It supports session, global and user variables. I was very happy to see this addition, especially because originally there was not any duplication in session_variables and global_variables tables. You could simply run query like SELECT * FROM session_status WHERE variable_value>0; to see list of all status variables which were changed during session. Amazing, isn't it?

But in version 5.7.8 it was fixed. Release notes contain:

When the Performance Schema session variable tables produced output,
they included no rows for global-only variables and thus did not fully
reflect all variable values in effect for the current session. This has
been corrected so that each table has a row for each session variable,
and a row for each global variable that has no session counterpart. This
change applies to the session_variables and session_status tables.

And yes, now session_status table contains such status as innodb_buffer_pool_pages_flushed and session_variables contain innodb_buffer_pool_size as well as all other global-only status and variables which make no sense in session content:

mysql> select * from session_status where variable_name='innodb_buffer_pool_pages_flushed';
+----------------------------------+----------------+
| VARIABLE_NAME                    | VARIABLE_VALUE |
+----------------------------------+----------------+
| Innodb_buffer_pool_pages_flushed | 45             |
+----------------------------------+----------------+
1 row in set (0.00 sec)

mysql> select * from session_variables where variable_name='innodb_buffer_pool_size';
+-------------------------+----------------+
| VARIABLE_NAME           | VARIABLE_VALUE |
+-------------------------+----------------+
| innodb_buffer_pool_size | 25165824       |
+-------------------------+----------------+
1 row in set, 1 warning (0.00 sec)

Honestly I don't know what to think about this change. I am so unhappy what I created bug report, but I am not sure if it will be fixed and even makes sense.

Of course there is workaround for my one-liner to see all status changes in a session:

select ss.variable_name, ss.variable_value from session_status ss left join global_status gs using(variable_name) where ss.variable_value != gs.variable_value or gs.variable_value is null and ss.variable_value>0;

It is simply longer and harder to write. Probably good idea for sys schema to have views with only-session and only-global status and variables.

What do you think?
PlanetMySQL Voting: Vote UP / Vote DOWN

↧

Election Money and Data

August 9, 2015, 5:00 pm

≫ Next: Facebook's Charity Majors Says VividCortex Is Impressive

≪ Previous: Ease of use or consistency

So far, the early drama of the 2016 presidential race has been more silly than substantial. Donald Trump has been a (successful?!) one-man show, Hillary has been playing coy, some candidates have been getting trolled (much to Reddit’s delight), and, this week, Fox News has announced its first Republican debate line-up, which somehow seems more akin to the guest list for a popular 7th grader’s slumber party than presidential grooming. But there’s been one topic of conversation that’s been deadly serious and that is sure to stay on people’s minds: campaign finance.

statue-money

Incredibly vast sums of money are now flowing in anticipation of next November, and it’s the entire country’s concern where that cash comes from, where it’s going, how it’s being spent, and how it’s being tracked. Data fans, now’s your time to heed Jack’s advice: help your fellow citizens stay afloat in this ocean of information. At VividCortex, we understand how suffocating a vast data field can seem if an analyst doesn’t have the proper tools for navigation. And this, a presidential election, is a time when it’s absolutely vital to the public’s interest that data be both tracked and clearly understood.

In 2010, the Supreme Court’s decision in Citizens United v. FEC codified new rules for how donors can contribute to campaigns: basically, Super PACs are free to grow, without limit. Now, Americans’ eyes are on the funds. And that means watching data and databases. The New York Times has been publishing extensive reports and timelines that make some of that information visual and legible for the average reader.

nytimes-graph

But that’s hardly cutting edge. For instance, this 2009 post from R-bloggers used MySQL and R to track and visualize the progression of contributions to Obama’s first campaign.

For more current, granular reports, any user can go directly to the source: the Federal Election Commission’s website, where all campaign finance info is made public. Below are snippets of Clinton and Bush’s individual campaign finance report cards. (Or you can try a search yourself.)

Clinton’s Finance Report Card Summary

Clinton-info

Bush’s Finance Report Card Summary

Bush-info

The info on the Super PACs themselves is more complicated. Here’s the list of “Independent Expenditure-Only Committees.” They’re technically independent of any individual candidate, but the nuances of that technicality is another controversial topic. Regardless, they hold most of the money. Want to see for yourself? Take a look at the mid-year report for Right to Rise, the PAC that supports Bush, the candidate currently supported by the most funds. The report is 1,656 pages long. Everything really is bigger in Texas – even the data.

This is why we consider it crucial that data not just be collected, but also managed, streamlined, and presented in a way that it can be understood. We Americans might consider it self-evident that all people are created equal, but here at VividCortex it’s also obvious that not all data are created that way: some pieces of information are more valuable than others. Sometimes, within 1,656 pages, all you’re really looking for is a paragraph, a sentence, a word, or a single number next to a dollar sign. And at times like this, DBA’s aren’t the only people who need to understand what big data means.

Be on the lookout for more VividCortex posts on Databases and Democracy and candidate finances in the future – the campaign season is just getting started.

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

Facebook's Charity Majors Says VividCortex Is Impressive

August 10, 2015, 5:00 pm

≫ Next: In Case You Missed It - Breaking Databases - Keeping your Ruby on Rails ORM under Control

≪ Previous: Election Money and Data

A few months ago, we featured Charity Majors, the production engineering manager for Parse at Facebook, on Brainiac Corner. We are featuring Charity and her expertise once again. This time, though, she is reviewing VividCortex: from installation to problem solving to a feature wishlist.

One of our favorite takeaways: “And VividCortex is a DB monitoring system built by database experts. They know what information you are going to need to diagnose problems, whether you know it or not. It’s like having a half a DBA on your team.” And without further ado…

Parse review of VividCortex

Many years ago, when I was but a wee lass trying to upgrade mysql and having a terrible time with performance regressions, Baron and the newly-formed Percona team helped me figure my shit out. The Percona toolset (formerly known as Maatkit) changed my life. It helped me understand what was going on under the hood in my database for the very first time, and basically I’ve been playing with data ever since. (Thanks, I think?)

I’ve been out of the mysql world for a while now, mostly doing Mongo, Redis, and Cassandra these days. So when I heard that Baron’s latest startup VividCortex was entering the NoSQL monitoring space, I was intrigued.

To be perfectly clear, I don’t need VividCortex at the moment, and do not use it for my day-to-day work. Parse was acquired by Facebook two years ago, and the first thing we did was pipeline all of our metrics into the sophisticated Facebook monitoring systems. Facebook’s powerful tools work insanely well for what we need to do. That said, I was eager to take VividCortex for a spin.

Parse workload

First, a little bit of background on Parse. We are a complete framework for building mobile apps. You can use our APIs and SDKs to build beautiful, fully featured apps with core storage, analytics, push notifications, cloud code etc without needing to build your own backend. We currently host over half a million apps , and all mobile application data is stored in MongoDB using the RocksDB storage engine.

We face some particular challenges with our MongoDB storage layer. We have millions of collections and tens of millions of indexes, which is not your traditional Mongo use case. Indexes are intelligently auto-generated for apps based on real query patterns and the cardinality of their data. Parse is a platform, which means we have very little control over the types of queries that enter our systems. We often have to do things like transparently scale or optimize apps that have just been featured on the front page of the iTunes store, or handle spiky events, or figure out complex query planner caching issues.

Basically, Parse is a DBA’s worst nightmare or most delicious fantasy, depending on how you feel about tracking down crazy problems and figuring out how to solve them naively for the entire world.

SO.

VividCortex. I was really curious to see if it could tell me anything new about our systems, given that we have already extensively instrumented them using the sophisticated monitoring platforms at Facebook.

Setup

The setup flow for VividCortex is a delight. It took less than two minutes from generating an account to capturing all metrics for a few machines (the trial period lets you monitor 5 nodes for 14 days). Signup is fun, too: you get a cute little message from the VividCortex team, a tutorial video, and a nudge for how to get live chat support.

I chose to install the agent on each node. You have the option of installing locally or remotely, but you have to install one agent process per monitored node. I sorta wish I could install just one agent, or one per replica set with autodetection for all nodes in the replica set, but as a storage-agnostic monitoring layer this decision makes sense. If I was running this in production, I would probably consider making this part of the chef bootstrap process. It has a supervisor process that restarts the agent if it dies, and the agent polls the VividCortex API to detect any server-side instructions or configuration changes.

I had to input the DB auth credentials, but it automatically detected what type of DB I was running and enabled all the right monitoring plugins — nice touch.

Agent

The agent works by capturing pcaps off the network, reconstructing queries or transactions, and also frequently running “SHOW INNODB STATUS” or “db.serverStatus()” or whatever the equivalent is for that db.

The awesome thing about monitoring over the network is that this gives VividCortex second-level granularity for metrics gathering, and it has less potential impact on your production systems. At Parse we do all our monitoring by tailing logs, reprocessing the logs into a structured format, and aggregating the metrics after that (whether via ganglia or FB systems). This means we have minute-level granularity and often a delay of a couple of minutes before logs are fully processed and stored. On the one hand this means we can use the same unified storage systems for all of our structured logs and metrics, but on the other hand it takes a lot more work upfront to parse the logs, structure the data, and ship it off for storage.

Second-level granularity isn’t a thing that I’ve often longed to have, but it could be that this is just because I’ve never had it before. Also: log files can lie to you. There’s a long-standing bug in MongoDB where the query time logged to disk doesn’t include the time spent waiting to acquire the lock. If you were timing this yourself over the wire, you wouldn’t have this problem. Log files also incur a performance penalty that can be substantial.

Query families

The most impressive feature of VividCortex is really the query family normalization and “top queries” dashboard. As a scrappy startup with limited engineering cycles, this is the most important thing for you to pay attention to. It’s not particularly easy to implement, and every company past a certain scale ends up reinventing the same wheel again. We built something very similar to this at Parse a while back. Before we had it we spent a lot of time tailing and sorting logs, looking for slow queries, running mongotop, sorting by scanned documents and read/write lock time held, and other annoying firefighting techniques.

With the top queries dashboard, you can just watch the list or generate a daily report. Or better yet, train your developers to check it themselves after they ship a change. :)

VividCortex also has a really neat “compare queries” feature, which lets you compare the same query over two different time ranges. This is definitely something we don’t have now, although we can kinda fake it. The “adaptive fault detection” also looks basically like magic, although not yet implemented for MongoDB (it’s a patent-pending method that VividCortex has developed for detecting database stalls).

Live support

Ok, I don’t usually use this kind of thing, but I actually love VividCortex’s built-in live support chat. The techs were incredibly friendly and responsive. We ran into some strange edge cases due to the weirdness of our traffic characteristics, which caused some hiccups getting started. The people manning the chat systems were clearly technical contributors with deep knowledge of the systems, and they were very straightforward about what was happening on the backend, what they had to do to fix it, and when we could expect to get back up and running. Love it.

Things I wish it had

I wish it was easier to group query families, queue lengths and load by replica set, not just by single nodes. If you’re sending a lot of queries to secondaries, you need those aggregate counts. You can get around this by creating an “environment” by hand for each replica set (thanks @dbsmasher!), but that’s gonna get painful if you have more than a few replica sets, and it won’t dynamically adjust the host list for a RS when it changes.
Comments attached to query families. It’s really nice to be able to attach a comment to the query in the application layer, for example with the line number of the code that’s issuing the query.
Some sort of integration with in-house monitoring systems. Like, maybe a REST API that a Nagios check could query for alerting on critical thresholds. This is obviously a pretty complicated request, but my heart longs for it.

Summary

This might be a good time to mention that I’ve always been fairly prejudiced against outsourcing my monitoring and metrics. I hate being paged by multiple systems or having to correlate connected incidents across disparate sources of truth. I still think monitoring sprawl and source-of-truth proliferation is a serious issue for anyone who decides to outsource any or all of their monitoring infrastructure.

But you know what? I’m getting really tired of building monitoring systems over and over again. If I never have to build out another ganglia or graphite system I will be pretty damn happy. Especially since the acquisition, I’ve come to see how wonderful it is when you can let experts do their thing so you don’t have to. And VividCortex is a DB monitoring system built by database experts. They know what information you are going to need to diagnose problems, whether you know it or not. It’s like having a half a DBA on your team.

Monitoring, for me, is starting to cross over that line between “key competency that you should always own in-house” to “commodity service that you should outsource to other companies that are better at it so you can concentrate on your own core product.” In a couple of years, I think we’re all going to look at building our own monitoring pipelines the same way we now look at running our own mail systems and spam filters: mildly insane.

I do still think there’s a lot of efficiencies to aggregating all metrics in the same space. For that reason, I would love to see more crossover and interoperability between deep specialists like VividCortex, and more generalized offerings like Interana and Data Dog, or even on-prem solutions like syncing VividCortex data back to crappy local ganglia instances.

But if I were to go off and do a new startup today? VividCortex would be a really useful tool to have, no question.

Thanks, Charity, for the thoughtful, flattering, and constructive review! See for yourself how VividCortex can revolutionize your monitoring with a free trial.

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

In Case You Missed It - Breaking Databases - Keeping your Ruby on Rails ORM under Control

August 11, 2015, 5:00 pm

≫ Next: Brainiac Corner with Camille Fournier

≪ Previous: Facebook's Charity Majors Says VividCortex Is Impressive

Object-relational mapping is common in most modern web frameworks such as Ruby on Rails. For the developer APIs, the ORM provides simplified interaction with the database and a productivity boost. However, the layer of abstraction the ORM provides can hide how the database is being queried. If you’re not paying attention, these generated queries can have a negative effect on your database’s health and performance.

broken computer

In this webinar, Owen Zanzal discussed ways common Rails ORMs can abuse various databases and how VividCortex can discover them. Themes covered include N+1 Queries, Missing Indexes, and Caching.

If you did not have a chance to join the webinar live, you can register for a recording here.

Pic Cred

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

Brainiac Corner with Camille Fournier

August 12, 2015, 5:00 pm

≫ Next: The 8 Best Ways To Lose Your DBA

≪ Previous: In Case You Missed It - Breaking Databases - Keeping your Ruby on Rails ORM under Control

The Brainiac Corner is a format where we talk with some of the smartest minds in the system, database, devops, and IT world. If you have opinions on pirates, or anything else related, please don’t hesitate to contact us.

camillefournier

Today we interview Camille Fournier, the current the CTO at Rent the Runway. Follow her on twitter @skamille.

How did you get from stork to brainiac (i.e. what do you do today and how did you get there)?

I’m currently the CTO of Rent the Runway, a company that rents designer dresses and accessories. My journey into tech is a familiar one; enjoyed computers as a kid, decided that computer science would be a smart area to go into based on the growth of personal computing in the 80s and early 90s, and been happy with it ever since. I ended up at Rent the Runway after a long period at Goldman Sachs doing various software engineering for internal distributed systems. I came to Rent the Runway because I wanted a change, I wanted to try out the startup world and I wanted the opportunity to get into more of a leadership role, and of course I thought that the business had huge potential to change the fashion world. 4 years and 4X team growth later, all of that has happened, and it’s been a rollercoaster and an amazing learning experience.

What is in your group’s technology stack?

We’re relatively conservative. Java (micro)services, MySQL, MongoDB, Redis, RabbitMQ, Ruby that does not touch the database directly, Memcache, JavaScript (Backbone+React), and of course the Objective-C/Swift stuff for our app. We also have Vertica, Scala, and Python for our data processing layer.

Who would win in a fight between ninjas and pirates? Why?

Probably Ninjas unless it’s a ship of Ninjas vs a ship of Pirates, in which case I’m going to bet on the Pirates.

Which is a more accurate state of the world, #monitoringsucks or #monitoringlove?

We’re trending towards #monitoringlove but not there yet. I think people want magical monitoring tools that will eliminate their need to think, and that will never happen, but at least we’re able to get more useful insights now than we have in the recent past.

In six words or less, what are the biggest challenges your organization faces?

Move fast, don’t break too much.

What’s the best piece of advice you’ve ever received?

When you go after what you want, you get what’s in the way.

What principles guide your expertise in your given domain?

In the domain of technical architecture: don’t overbuild too early, write unit tests, think about the nature of the data and functionality you’re working with and scale on the axis that make sense for the evolution of your business.

In the domain of management and leadership: Be brave, be kind, sometimes the kindest thing is to be brave and tell people the hard truth, remember that people are fellow human beings and everyone is living their own story so try not to overlay your own story on top of them.

In both: Spend the time to get really clear with yourself about what you want, write it down, say it a few different ways. A narrative is needed both for leading people and for leading technology, and the clearer your narrative is the faster you can move and the better the outcome will be.

What is your vision for system administration in 5 years?

It will still exist, and more companies will realize the value of hiring people who actually have expertise in that area. But the people with expertise will also learn that they really do have to meet the rest of the team at least halfway, or the developers will just do a bad job without their input.

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

The 8 Best Ways To Lose Your DBA

August 14, 2015, 5:42 am

≫ Next: Own Your Data, Own Your Management

≪ Previous: Brainiac Corner with Camille Fournier

As we all know, good DBAs are a dime a dozen. They’re easy to replace and the cost of replacing them in terms of lost productivity, downtime, recruiting, training, etc is minimal. You may even suspect that your DBA(s) aren’t very good since there is occasional downtime and people complain about the systems running too slowly. Firing people is icky so we’ve identified 8 great ways to encourage your DBA to leave.

8. Specialize Their Role

Nothing puts more pressure on a DBA to perform than being a specialist. A specialist is the only person who has access or knowledge to do something, which means everyone else is going to be coerced into learned helplessness and apathy. Oh, and the bystander effect will run rampant when something goes wrong. “I’m sure the DBA is working on that.”

Yep. You definitely want the DBA’s role to be specialized so they’re properly isolated and all the blame falls on them when anything goes bad. Certainly don’t want developers and other operations staff to be competent with the database!

7. Institute Change Control

Since you’ve created a specialized DBA role in which all database responsibility rests on the DBA(s) you might as well take the next step to institute strong change control. Since the developers have no responsibility for database performance problems they create, they’ll write code recklessly and figure it’s the DBAs responsibility to fix it. To solve for that all code changes must be reviewed by the DBA before shipping to production. No changes can happen during business hours. And there will be no changes during critical times like the Super Bowl ads or the holiday shopping season, period.

Lumbergh

We’re fully confident this’ll solve all the outages, but as a delicious side effect of this, we’ll also rub the DBA’s nose in a bunch of menial, thankless reviews of code and applications they don’t understand, which should incent them to leave right away.

6. Mismatched Control And Responsibility

Nothing punishes a DBA better than being responsible for systems they can’t control. Naturally, item #7 is designed to create the illusion of control, so when they protest, we can point to that and say “what do you mean you have no control over what queries are running in production?” The DBA is not only wholly responsible for database performance, but also for delays in front-end development and feature roll-out.

5. Make Them A SPOF

If you only have one DBA by instituting #8, 7 and 6 above you’ve done a great job of creating a single point of failure. Even with multiple DBAs you’ve created a team of SPOFs. You can add insult to this injury through promotions. The smartest management move I ever saw was when an overworked DBA (let’s call him Atlas, because he held the world on his shoulders) was promoted. I mean, the man just wouldn’t quit. He was in the office at 2am every week doing the things that management insisted couldn’t be done during work hours, he never got to leave or turn off his cellphone from October through January, and this had gone on for years. Clearly a promotion to DBA Manager was the only way to make him quit. Did it work? Sure did, it only took a week.

4. Give Them Great Tools To Do Their Job

As the VP of Technology, it’s clearly your job to tell the DBA what tools they need to do their job. Make sure you do that. Remember, any production MySQL issue can be properly diagnosed by staring at thousands or tens of thousands of time series charts of SHOW STATUS counters in five-minute resolution, so Cacti or Graphite ought to do the job just fine. If they insist on more than that, you can pretend you’re bending over backwards by giving them Nagios or statsd. These create an illusion of database performance monitoring by creating mountains of false alarms tied to ratios that don’t really mean much.

3. Make Sure Developers Can’t Self-Service

Whatever you do, don’t let the developers get their work done by themselves. The DBA can’t truly be a SPOF if the developers can get stuff done without them. You need developers to go to the DBA with every little database-related request. This will impress upon the DBA their essential role in the organization and how they’re failing to live up to it and need to leave. Coincidentally, using Cacti or Graphite for monitoring will help ensure all DB-related questions can only be answered by the DBA.

2. Insist On Root Cause Analysis

There is always a single root cause. Five whys. It’s a human error problem. Who is the human error? The DBA is. The DBA’s very existence is an error. If there are outages, downtime, sluggish performance, delays in code release the root cause has to be database performance and that is the DBA’s responsibility 100%. Creating a revolving door DBA position will guarantee that the people responsible for the database don’t know much about the system because they just got here. Not that that’s an acceptable excuse.

1. Work-Life Balance Is Overrated

You get the most out of your people by driving them hard. No one ever got good results on the battlefield by handing out Kleenex. No matter how many developers you have, 1 DBA is plenty; in fact try to make it a side responsibility for one of your systems admin folks. If they whine about their burdens, tell them to just work harder. Your DBA should be online or in the office after hours, and if they’re not they’re slackers and should be replaced anyway. Stress, guilt, all encompassing responsibility, shame, and failure are powerful motivators, too.

Conclusions

Remember: as an IT Manager/Director/VP you need to have a scapegoat, and your DBA should be that scapegoat. By placing the DBA in an impossible situation, giving them full responsibility for keeping the systems up and running, and keeping them from having collaborative tools that allow developers to self-service and take responsibility for being the first line of defense against bad queries, you’ll always be able to tell your boss that the reason for the problem is poor database administration.

The alternative to using the DBA as your scapegoat is to have that responsibility fall on you! You might have to take responsibility for building or licensing collaboration tools that allow the whole team to function more efficiently. You might have to build a culture of shared responsibility and teamwork. And, while doing so might improve speed, innovation and help attract and keep top drawer developers, it requires change and change is hard.

Much easier to just churn through DBAs.

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

Own Your Data, Own Your Management

August 16, 2015, 5:00 pm

≫ Next: Own Your Data, Own Your Management

≪ Previous: The 8 Best Ways To Lose Your DBA

The following passage is excerpted from the VividCortex eBook The Strategic IT Manager’s Guide to Building a Scalable DBA Team, by Baron Schwartz. This eBook offers Baron’s expert insights and opinions on how top-performing companies manage vast amounts of data, while keeping it secure, available, and performant. The highlighted section examines how a company can look to its DBAs for clues about its greater IT systems, and what these clues might mean about the company in a larger sense. Consider why “DBAs are the canaries in the coal mine” and what you should do with the very valuable information their experiences yield.

To read more from this eBook, you can download a free copy here.

Why DBAs are Canaries in the Coal Mine

canary-coal-mine

If yours is a data-driven organization, your DBAs probably face significant challenges to executing effectively. These typically include the following:

The scale and growth of the data. You know that data is big and growing fast, but you might not realize that DBAs are often expected to handle it without an increase in resources. In other words, the data-to-human ratio is growing rapidly.
Emerging technologies, which often lack the mature tools DBAs rely on for productivity.
The diversity and complexity of databases and application architectures. Polyglot persistence means that DBAs can’t manage a single set of technologies with a single set of tools. And modern applications are almost always distributed with clustering and replication across large numbers of machines.
Distributed and outsourced team structures. The pressures of working with remote teams challenge a DBA’s schedule, add friction to interpersonal communications, and complicate office politics.

Because organizations usually view IT as a cost center, outsourcing and “taking away parts of the job” is seen as a smart decision, but can backfire. The truth is that as long as companies own their data, they need to own the management of it too, at least in large part. This is because the DBA role is a critical interface between vital IT teams, and seeking to minimize or eliminate this role can be counter-productive. I will explore this theme throughout the book. DBAs, in fact, can end up being the canaries in the coal mine for IT as a whole. But when the canary is dying, instead of recognizing that there’s an environmental problem, the instinctive response is sometimes “get the canary out of here!”

Here are some of the reasons that trouble with DBAs should be seen as a symptom, rather than a cause, of overall IT dysfunction:

Failure to recognize their strategic importance means they aren’t being hired, trained, and managed correctly.
They occupy multiple positions of handoff, interaction, and information sharing between different teams.
Their duties and knowledge are specialized, leading to a temptation to centralize the burden on them instead of sharing or offloading it.

Key takeaways from this section:

IT’s challenge with data management isn’t just a problem to be minimized. It’s an opportunity.
IT productivity is a gauge of how effectively you’re managing the interactions between roles in IT.

strategic-it-managers-cover-big

Again, to read more from The Strategic IT Manager’s Guide to Building a Scalable DBA Team, you can download a free copy here.

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

Own Your Data, Own Your Management

August 16, 2015, 5:00 pm

≫ Next: MySQL replication in action - Part 3: all-masters P2P topology

≪ Previous: Own Your Data, Own Your Management

The following passage is excerpted from the VividCortex eBook The Strategic IT Manager’s Guide to Building a Scalable DBA Team, by Baron Schwatrz. This eBook offers Baron’s expert insights and opinions on how top-performing companies manage vast amounts of data, while keeping it secure, available, and performant. The highlighted section examines how a company can look to its DBAs for clues about its greater IT systems, and what these clues might mean about the company in a larger sense. Consider why “DBAs are the canaries in the coal mine” and what you should do with the very valuable information their experiences yield.

To read more from this eBook, you can download a free copy here.

Why DBAs are Canaries in the Coal Mine

canary-coal-mine

If yours is a data-driven organization, your DBAs probably face significant challenges to executing effectively. These typically include the following:

• The scale and growth of the data. You know that data is big and growing fast, but you might not realize that DBAs are often expected to handle it without an increase in resources. In other words, the data-to-human ratio is growing rapidly.

• Emerging technologies, which often lack the mature tools DBAs rely on for productivity.

• The diversity and complexity of databases and application architectures. Polyglot persistence means that DBAs can’t manage a single set of technologies with a single set of tools. And modern applications are almost always distributed with clustering and replication across large numbers of machines.

• Distributed and outsourced team structures. The pressures of working with remote teams challenge a DBA’s schedule, add friction to interpersonal communications, and complicate office politics.

Here are some of the reasons that trouble with DBAs should be seen as a symptom, rather than a cause, of overall IT dysfunction:

• Failure to recognize their strategic importance means they aren’t being hired, trained, and managed correctly.

• They occupy multiple positions of handoff, interaction, and information sharing between different teams.

• Their duties and knowledge are specialized, leading to a temptation to centralize the burden on them instead of sharing or offloading it.

Key takeaways from this section:

• IT’s challenge with data management isn’t just a problem to be minimized. It’s an opportunity.

• IT productivity is a gauge of how effectively you’re managing the interactions between roles in IT.

strategic-it-managers-cover-big

Again, to read more from The Strategic IT Manager’s Guide to Building a Scalable DBA Team, you can download a free copy here.

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

MySQL replication in action - Part 3: all-masters P2P topology

August 16, 2015, 8:00 pm

≫ Next: Become a MySQL DBA blog series - The Query Tuning Process

≪ Previous: Own Your Data, Own Your Management

Previous episodes:

In the previous article, we saw the basics of establishing replication from multiple origins to the same destination. By extending that concept, we can deploy more complex topologies, such as the point-to-point (P2P) all-masters topology, a robust and fast way of moving data.

Introduction to P2P all-masters topology

A P2P (Point-to-point) topology is a kind of deployment where replication happens in a single step from the producer to the consumers. For example, in a master/slave topology, replication from the master (producer) reaches every slave (consumer) in one step. This is simple P2P replication. If we use a hierarchical deployment, where every slave that is connected to the master is also replicating to one or more slaves, we will have a 2-step replication (Figure 1). Similarly, in circular replication, we have as many steps as the number of nodes minus one (Figure 2.)

Figure 1 - Hierarchical replication depth of processing

Figure 2 - Circular replication depth of processing

Why is this important? The number of steps affects performance, resilience, and, potentially, accuracy.

Performance depends on the number of steps. Before the final leaf of the topology graph gets the data, it will replicate N times, one for each step. In figure 1, host4 will be updated twice as slower as host2. In Figure 2, host4 will be three times slower than host2, as it has to wait for two steps before data reaches its tables.
Resilience, or the capacity to withstand failures, also depends on the number of intermediate steps. Intermediate masters are single points of failure (SPOF) that can break a branch of the topology graph, or the whole deployment. In this context, a master/slave deployment has one SPOF; the topology in figure 1 has 2, and the circular replication has 4 of them.
Accuracy can be different if the data goes from master to slave directly, compared to the data going through one or more intermediaries. If data is applied and then extracted again, its chances of reaching the final destination unchanged depend on the intermediate masters to have exactly the same configuration as its predecessors in the chain.

With multi-source replication, we can overcome the limitations of circular topologies, and create a functionally corresponding deployment that has no SPOF, and it is, by virtue of its direct connections, faster and potentially more accurate than its predecessors.

Figure 3 - All-masters P2P replication

An all-masters P2P topology is a lot like fan-in topology, but with the number of nodes, masters, and slaves being the same. If all the nodes are fan-in slaves, and are also masters at the same time, every node can get data from the others and can send data at the same time.

Figure 4 - All-masters P2P replication depth of processing

In an all-masters P2P topology, each node replicates to every other node. Compared to circular replication, this deployment requires more connections per node (it's a small price to pay) but the data flows faster and more cleanly, as the origin of each transaction is easier to track.

Deploying a P2P all-masters topology in MySQL 5.7

The procedure is the same that we have seen for fan-in replication, but with a few differences:

Every node needs to be a master, and therefore it must have binary logs configured;
The procedure for connecting to the other nodes needs to be repeated for each node. In a N-node deployment, you will end up having, for each node, N-1 slave channels.

We will repeat the installation that we used for FAN-IN, running the same script from mysql-replication-samples. The difference in invocation will be that we ask for ALL-MASTERS:

$ ./multi_source.sh 5.7.8 mysql ALL-MASTERS
installing node 1
installing node 2
installing node 3
installing node 4
group directory installed in $HOME/sandboxes/multi_msb_5_7_8
# server: 1:
# server: 2:
# server: 3:
# server: 4:
# option 'master-info-repository=table' added to node1 configuration file
# option 'relay-log-info-repository=table' added to node1 configuration file
# option 'gtid_mode=ON' added to node1 configuration file
# option 'enforce-gtid-consistency' added to node1 configuration file
# option 'master-info-repository=table' added to node2 configuration file
# option 'relay-log-info-repository=table' added to node2 configuration file
# option 'gtid_mode=ON' added to node2 configuration file
# option 'enforce-gtid-consistency' added to node2 configuration file
# option 'master-info-repository=table' added to node3 configuration file
# option 'relay-log-info-repository=table' added to node3 configuration file
# option 'gtid_mode=ON' added to node3 configuration file
# option 'enforce-gtid-consistency' added to node3 configuration file
# option 'master-info-repository=table' added to node4 configuration file
# option 'relay-log-info-repository=table' added to node4 configuration file
# option 'gtid_mode=ON' added to node4 configuration file
# option 'enforce-gtid-consistency' added to node4 configuration file
# executing "stop" on $HOME/sandboxes/multi_msb_5_7_8
executing "stop" on node 1
executing "stop" on node 2
executing "stop" on node 3
executing "stop" on node 4
# executing "start" on $HOME/sandboxes/multi_msb_5_7_8
executing "start" on node 1
. sandbox server started
executing "start" on node 2
. sandbox server started
executing "start" on node 3
. sandbox server started
executing "start" on node 4
. sandbox server started
# Setting topology ALL-MASTERS
# node node1
--------------
CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=8380, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_AUTO_POSITION=1 for channel 'node2'
START SLAVE for channel  'node2'
--------------
CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=8381, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_AUTO_POSITION=1 for channel 'node3'
START SLAVE for channel  'node3'
--------------
CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=8382, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_AUTO_POSITION=1 for channel 'node4'
START SLAVE for channel  'node4'
--------------

# node node2
--------------
CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=8379, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_AUTO_POSITION=1 for channel 'node1'
START SLAVE for channel  'node1'
--------------
CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=8381, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_AUTO_POSITION=1 for channel 'node3'
START SLAVE for channel  'node3'
--------------
CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=8382, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_AUTO_POSITION=1 for channel 'node4'
START SLAVE for channel  'node4'
--------------

# node node3
--------------
CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=8379, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_AUTO_POSITION=1 for channel 'node1'
START SLAVE for channel  'node1'
--------------
CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=8380, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_AUTO_POSITION=1 for channel 'node2'
START SLAVE for channel  'node2'
--------------
CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=8382, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_AUTO_POSITION=1 for channel 'node4'
START SLAVE for channel  'node4'
--------------

# node node4
--------------
CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=8379, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_AUTO_POSITION=1 for channel 'node1'
START SLAVE for channel  'node1'
--------------
CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=8380, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_AUTO_POSITION=1 for channel 'node2'
START SLAVE for channel  'node2'
--------------
CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=8381, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_AUTO_POSITION=1 for channel 'node3'
START SLAVE for channel  'node3'
--------------

$HOME/git/mysql-replication-samples/test_all_masters_replication.sh -> $HOME/sandboxes/multi_msb_5_7_8/test_all_masters_replication.sh

The procedure is similar, but since we are connecting all nodes instead of just one, the list of operations is longer. You can see that we have enabled GTID and crash-safe tables, as we did for FAN-IN, and we have executed a grand total of 12 'CHANGE MASTER TO' statements. AT the end of the installation, we have a test script that will tell us if replication is working. This script will create one table for each node, and then check that each node has got 4 tables

$ ./test_all_masters_replication.sh
# NODE node1 created table test_node1
# NODE node2 created table test_node2
# NODE node3 created table test_node3
# NODE node4 created table test_node4
# Data in all nodes
101
1 101 8379 node1 2015-08-12 19:40:35
1 102 8380 node2 2015-08-12 19:40:35
1 103 8381 node3 2015-08-12 19:40:35
1 104 8382 node4 2015-08-12 19:40:35
102
1 101 8379 node1 2015-08-12 19:40:35
1 102 8380 node2 2015-08-12 19:40:35
1 103 8381 node3 2015-08-12 19:40:35
1 104 8382 node4 2015-08-12 19:40:35
103
1 101 8379 node1 2015-08-12 19:40:35
1 102 8380 node2 2015-08-12 19:40:35
1 103 8381 node3 2015-08-12 19:40:35
1 104 8382 node4 2015-08-12 19:40:35
104
1 101 8379 node1 2015-08-12 19:40:35
1 102 8380 node2 2015-08-12 19:40:35
1 103 8381 node3 2015-08-12 19:40:35
1 104 8382 node4 2015-08-12 19:40:35

The output shows that each node has got 4 tables. Replication is working as expected. We can have a look at the monitoring options, to see how useful and clear they are in this topology.

As we did for fan-in topologies, we load the Sakila database in one of the nodes, to get some differences, and then look at the GTID situation:

$ for N in 1 2 3 4 ; do ./n$N -e 'select @@server_id, @@server_uuid; select @@global.gtid_executed\G'; done
+-------------+--------------------------------------+
| @@server_id | @@server_uuid                        |
+-------------+--------------------------------------+
|         101 | 18fd3be0-4119-11e5-97cd-24acf2bbd1e4 |
+-------------+--------------------------------------+
*************************** 1. row ***************************
@@global.gtid_executed: 18fd3be0-4119-11e5-97cd-24acf2bbd1e4:1-3,
1e629814-4119-11e5-85cf-aac6e218d3d8:1-119,
226e3350-4119-11e5-8242-de985f123dfc:1-3,
270c0ebe-4119-11e5-a1c9-b7fbc4e42c2c:1-3
+-------------+--------------------------------------+
| @@server_id | @@server_uuid                        |
+-------------+--------------------------------------+
|         102 | 1e629814-4119-11e5-85cf-aac6e218d3d8 |
+-------------+--------------------------------------+
*************************** 1. row ***************************
@@global.gtid_executed: 18fd3be0-4119-11e5-97cd-24acf2bbd1e4:1-3,
1e629814-4119-11e5-85cf-aac6e218d3d8:1-119,
226e3350-4119-11e5-8242-de985f123dfc:1-3,
270c0ebe-4119-11e5-a1c9-b7fbc4e42c2c:1-3
+-------------+--------------------------------------+
| @@server_id | @@server_uuid                        |
+-------------+--------------------------------------+
|         103 | 226e3350-4119-11e5-8242-de985f123dfc |
+-------------+--------------------------------------+
*************************** 1. row ***************************
@@global.gtid_executed: 18fd3be0-4119-11e5-97cd-24acf2bbd1e4:1-3,
1e629814-4119-11e5-85cf-aac6e218d3d8:1-119,
226e3350-4119-11e5-8242-de985f123dfc:1-3,
270c0ebe-4119-11e5-a1c9-b7fbc4e42c2c:1-3
+-------------+--------------------------------------+
| @@server_id | @@server_uuid                        |
+-------------+--------------------------------------+
|         104 | 270c0ebe-4119-11e5-a1c9-b7fbc4e42c2c |
+-------------+--------------------------------------+
*************************** 1. row ***************************
@@global.gtid_executed: 18fd3be0-4119-11e5-97cd-24acf2bbd1e4:1-3,
1e629814-4119-11e5-85cf-aac6e218d3d8:1-119,
226e3350-4119-11e5-8242-de985f123dfc:1-3,
270c0ebe-4119-11e5-a1c9-b7fbc4e42c2c:1-3

It's not a pretty sight. It's what we saw for fan-in, but multiplied by 4. Now we know that the price to pay for this efficient topology is an increase in monitoring complexity.

Let's have a look inside:

node1 [localhost] {msandbox} ((none)) > SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 8380
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000002
          Read_Master_Log_Pos: 1164948
               Relay_Log_File: mysql-relay-node2.000002
                Relay_Log_Pos: 1165161
        Relay_Master_Log_File: mysql-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 1164948
              Relay_Log_Space: 1165370
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 102
                  Master_UUID: 1e629814-4119-11e5-85cf-aac6e218d3d8
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 1e629814-4119-11e5-85cf-aac6e218d3d8:1-119
            Executed_Gtid_Set: 18fd3be0-4119-11e5-97cd-24acf2bbd1e4:1-3,
1e629814-4119-11e5-85cf-aac6e218d3d8:1-119,
226e3350-4119-11e5-8242-de985f123dfc:1-3,
270c0ebe-4119-11e5-a1c9-b7fbc4e42c2c:1-3
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name: node2
*************************** 2. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 8381
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000002
          Read_Master_Log_Pos: 891
               Relay_Log_File: mysql-relay-node3.000002
                Relay_Log_Pos: 1104
        Relay_Master_Log_File: mysql-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 891
              Relay_Log_Space: 1313
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 103
                  Master_UUID: 226e3350-4119-11e5-8242-de985f123dfc
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 226e3350-4119-11e5-8242-de985f123dfc:1-3
            Executed_Gtid_Set: 18fd3be0-4119-11e5-97cd-24acf2bbd1e4:1-3,
1e629814-4119-11e5-85cf-aac6e218d3d8:1-119,
226e3350-4119-11e5-8242-de985f123dfc:1-3,
270c0ebe-4119-11e5-a1c9-b7fbc4e42c2c:1-3
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name: node3
*************************** 3. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 8382
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000002
          Read_Master_Log_Pos: 891
               Relay_Log_File: mysql-relay-node4.000002
                Relay_Log_Pos: 1104
        Relay_Master_Log_File: mysql-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 891
              Relay_Log_Space: 1313
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 104
                  Master_UUID: 270c0ebe-4119-11e5-a1c9-b7fbc4e42c2c
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 270c0ebe-4119-11e5-a1c9-b7fbc4e42c2c:1-3
            Executed_Gtid_Set: 18fd3be0-4119-11e5-97cd-24acf2bbd1e4:1-3,
1e629814-4119-11e5-85cf-aac6e218d3d8:1-119,
226e3350-4119-11e5-8242-de985f123dfc:1-3,
270c0ebe-4119-11e5-a1c9-b7fbc4e42c2c:1-3
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name: node4
3 rows in set (0.00 sec)

This is a partial view of replication in this deployment. It only applies to node #1, where we see the status of its slave channels. We need to run the same command in all nodes to make sure that replication is healthy everywhere. As mentioned before, we have 12 channels to monitor. Looking at one node only will give us a possibly misleading picture.

And here we can see once more why it was a bad decision not to have a table for master status:

node1 [localhost] {msandbox} (mysql) > show master status\G
*************************** 1. row ***************************
             File: mysql-bin.000002
         Position: 891
     Binlog_Do_DB:
 Binlog_Ignore_DB:
Executed_Gtid_Set: 18fd3be0-4119-11e5-97cd-24acf2bbd1e4:1-3,
1e629814-4119-11e5-85cf-aac6e218d3d8:1-119,
226e3350-4119-11e5-8242-de985f123dfc:1-3,
270c0ebe-4119-11e5-a1c9-b7fbc4e42c2c:1-3
1 row in set (0.00 sec)

If we want to match GTID positions in master and slave, we need to get the value of Executed_Gtid_set from master status, or the same information from @@global.gtid_executed, then find the GTID containing the server_uuid belonging to this master within that long string, and finally extract the GTID sequence.

Get the raw info: "18fd3be0-4119-11e5-97cd-24acf2bbd1e4:1-3,1e629814-4119-11e5-85cf-aac6e218d3d8:1-119,226e3350-4119-11e5-8242-de985f123dfc:1-3,270c0ebe-4119-11e5-a1c9-b7fbc4e42c2c:1-3"
Find the server UUID: "18fd3be0-4119-11e5-97cd-24acf2bbd1e4"
Find the relevant GTID: "18fd3be0-4119-11e5-97cd-24acf2bbd1e4:1-3"
Extract the GTID: "3"

The information in mysql.slave_relay_log_info and performance_schema.replication_* tables will not help us to simplify the task of monitoring replication. All the shortcomings that we have noticed for fan-in are also present for all-masters topologies. The main difference is that the information in SHOW MASTER STATUS and SHOW SLAVE STATUS is more crowded.

Deploying a P2P all-masters topology in MariaDB 10

The installation is fairly similar to MySQL 5.7. We only see the same syntax differences already noted for fan-in topologies.

$ ./multi_source.sh ma10.0.20 mariadb ALL-MASTERS
installing node 1
installing node 2
installing node 3
installing node 4
group directory installed in $HOME/sandboxes/multi_msb_ma10_0_20
# server: 1:
# server: 2:
# server: 3:
# server: 4:
# server: 1:
# server: 2:
# server: 3:
# server: 4:
# Setting topology ALL-MASTERS
# node node1
--------------
CHANGE MASTER 'node2' TO MASTER_HOST='127.0.0.1', MASTER_PORT=19022, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_USE_GTID=current_pos
START SLAVE  'node2'
--------------
CHANGE MASTER 'node3' TO MASTER_HOST='127.0.0.1', MASTER_PORT=19023, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_USE_GTID=current_pos
START SLAVE  'node3'
--------------
CHANGE MASTER 'node4' TO MASTER_HOST='127.0.0.1', MASTER_PORT=19024, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_USE_GTID=current_pos
START SLAVE  'node4'
--------------

# node node2
--------------
CHANGE MASTER 'node1' TO MASTER_HOST='127.0.0.1', MASTER_PORT=19021, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_USE_GTID=current_pos
START SLAVE  'node1'
--------------
CHANGE MASTER 'node3' TO MASTER_HOST='127.0.0.1', MASTER_PORT=19023, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_USE_GTID=current_pos
START SLAVE  'node3'
--------------
CHANGE MASTER 'node4' TO MASTER_HOST='127.0.0.1', MASTER_PORT=19024, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_USE_GTID=current_pos
START SLAVE  'node4'
--------------

# node node3
--------------
CHANGE MASTER 'node1' TO MASTER_HOST='127.0.0.1', MASTER_PORT=19021, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_USE_GTID=current_pos
START SLAVE  'node1'
--------------
CHANGE MASTER 'node2' TO MASTER_HOST='127.0.0.1', MASTER_PORT=19022, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_USE_GTID=current_pos
START SLAVE  'node2'
--------------
CHANGE MASTER 'node4' TO MASTER_HOST='127.0.0.1', MASTER_PORT=19024, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_USE_GTID=current_pos
START SLAVE  'node4'
--------------

# node node4
--------------
CHANGE MASTER 'node1' TO MASTER_HOST='127.0.0.1', MASTER_PORT=19021, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_USE_GTID=current_pos
START SLAVE  'node1'
--------------
CHANGE MASTER 'node2' TO MASTER_HOST='127.0.0.1', MASTER_PORT=19022, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_USE_GTID=current_pos
START SLAVE  'node2'
--------------
CHANGE MASTER 'node3' TO MASTER_HOST='127.0.0.1', MASTER_PORT=19023, MASTER_USER='rsandbox', MASTER_PASSWORD='rsandbox', MASTER_USE_GTID=current_pos
START SLAVE  'node3'
--------------

And the test script produces similar results:

$ ./test_all_masters_replication.sh
# NODE node1 created table test_node1
# NODE node2 created table test_node2
# NODE node3 created table test_node3
# NODE node4 created table test_node4
# Data in all nodes
101
1 101 19021 node1 2015-08-12 20:20:46
1 102 19021 node2 2015-08-12 20:20:47
1 103 19021 node3 2015-08-12 20:20:47
1 104 19021 node4 2015-08-12 20:20:47
102
1 101 19022 node1 2015-08-12 20:20:46
1 102 19022 node2 2015-08-12 20:20:47
1 103 19022 node3 2015-08-12 20:20:47
1 104 19022 node4 2015-08-12 20:20:47
103
1 101 19023 node1 2015-08-12 20:20:46
1 102 19023 node2 2015-08-12 20:20:47
1 103 19023 node3 2015-08-12 20:20:47
1 104 19023 node4 2015-08-12 20:20:47
104
1 101 19024 node1 2015-08-12 20:20:46
1 102 19024 node2 2015-08-12 20:20:47
1 103 19024 node3 2015-08-12 20:20:47
1 104 19024 node4 2015-08-12 20:20:47

After loading the Sakila database into node #2, we see a familiar pattern, already noted for fan-in. The GTID is shown as a comma delimited list of all the data streams that have conveyed in each server.

$ for N in 1 2 3 4; do ./n$N -e 'select @@server_id; select @@global.gtid_current_pos\G' ; done
+-------------+
| @@server_id |
+-------------+
|         101 |
+-------------+
*************************** 1. row ***************************
@@global.gtid_current_pos: 1020-102-119,1040-104-3,1030-103-3,1010-101-3
+-------------+
| @@server_id |
+-------------+
|         102 |
+-------------+
*************************** 1. row ***************************
@@global.gtid_current_pos: 1010-101-3,1040-104-3,1030-103-3,1020-102-119
+-------------+
| @@server_id |
+-------------+
|         103 |
+-------------+
*************************** 1. row ***************************
@@global.gtid_current_pos: 1010-101-3,1040-104-3,1020-102-119,1030-103-3
+-------------+
| @@server_id |
+-------------+
|         104 |
+-------------+
*************************** 1. row ***************************
@@global.gtid_current_pos: 1010-101-3,1030-103-3,1020-102-119,1040-104-3

Looking at SHOW ALL SLAVES STATUS, there are no surprises. The information that was missing from fan-in (GTID executed) is still missing from the slave status.

node1 [localhost] {msandbox} ((none)) > SHOW ALL SLAVES STATUS\G
*************************** 1. row ***************************
              Connection_name: node2
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 19022
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 3230973
               Relay_Log_File: mysql-relay-node2.000002
                Relay_Log_Pos: 3231260
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 3230973
              Relay_Log_Space: 3231559
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 102
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Current_Pos
                  Gtid_IO_Pos: 1020-102-119
         Retried_transactions: 0
           Max_relay_log_size: 1073741824
         Executed_log_entries: 263
    Slave_received_heartbeats: 3
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 1020-102-119,1040-104-3,1030-103-3,1010-101-3
*************************** 2. row ***************************
              Connection_name: node3
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 19023
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 882
               Relay_Log_File: mysql-relay-node3.000002
                Relay_Log_Pos: 1169
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 882
              Relay_Log_Space: 1468
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 103
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Current_Pos
                  Gtid_IO_Pos: 1030-103-3
         Retried_transactions: 0
           Max_relay_log_size: 1073741824
         Executed_log_entries: 14
    Slave_received_heartbeats: 3
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 1020-102-119,1040-104-3,1030-103-3,1010-101-3
*************************** 3. row ***************************
              Connection_name: node4
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 19024
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 882
               Relay_Log_File: mysql-relay-node4.000002
                Relay_Log_Pos: 1169
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 882
              Relay_Log_Space: 1468
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 104
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Current_Pos
                  Gtid_IO_Pos: 1040-104-3
         Retried_transactions: 0
           Max_relay_log_size: 1073741824
         Executed_log_entries: 14
    Slave_received_heartbeats: 3
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 1020-102-119,1040-104-3,1030-103-3,1010-101-3
3 rows in set (0.00 sec)

The contents of the crash-safe data does not offer surprises either. It's the same that we've seen for fan-in, multiplied by 4.

$ for N in 1 2 3 4; do ./n$N -e 'select @@server_id, @@gtid_domain_id; select * from mysql.gtid_slave_pos' ; done
+-------------+------------------+
| @@server_id | @@gtid_domain_id |
+-------------+------------------+
|         101 |             1010 |
+-------------+------------------+
+-----------+--------+-----------+--------+
| domain_id | sub_id | server_id | seq_no |
+-----------+--------+-----------+--------+
|      1020 |    124 |       102 |    118 |
|      1020 |    125 |       102 |    119 |
|      1030 |      5 |       103 |      2 |
|      1030 |      6 |       103 |      3 |
|      1040 |      8 |       104 |      2 |
|      1040 |      9 |       104 |      3 |
+-----------+--------+-----------+--------+

+-------------+------------------+
| @@server_id | @@gtid_domain_id |
+-------------+------------------+
|         102 |             1020 |
+-------------+------------------+
+-----------+--------+-----------+--------+
| domain_id | sub_id | server_id | seq_no |
+-----------+--------+-----------+--------+
|      1010 |      2 |       101 |      2 |
|      1010 |      3 |       101 |      3 |
|      1030 |      5 |       103 |      2 |
|      1030 |      6 |       103 |      3 |
|      1040 |      8 |       104 |      2 |
|      1040 |      9 |       104 |      3 |
+-----------+--------+-----------+--------+

+-------------+------------------+
| @@server_id | @@gtid_domain_id |
+-------------+------------------+
|         103 |             1030 |
+-------------+------------------+
+-----------+--------+-----------+--------+
| domain_id | sub_id | server_id | seq_no |
+-----------+--------+-----------+--------+
|      1010 |      2 |       101 |      2 |
|      1010 |      3 |       101 |      3 |
|      1020 |    124 |       102 |    118 |
|      1020 |    125 |       102 |    119 |
|      1040 |      8 |       104 |      2 |
|      1040 |      9 |       104 |      3 |
+-----------+--------+-----------+--------+

+-------------+------------------+
| @@server_id | @@gtid_domain_id |
+-------------+------------------+
|         104 |             1040 |
+-------------+------------------+
+-----------+--------+-----------+--------+
| domain_id | sub_id | server_id | seq_no |
+-----------+--------+-----------+--------+
|      1010 |      2 |       101 |      2 |
|      1010 |      3 |       101 |      3 |
|      1020 |    124 |       102 |    118 |
|      1020 |    125 |       102 |    119 |
|      1030 |      8 |       103 |      2 |
|      1030 |      9 |       103 |      3 |
+-----------+--------+-----------+--------+

Summing up

Using the methods already learned for fan-in deployments, an all-masters P2P topology is easy to install, albeit longer and more complex.

Monitoring this topology presents the same hurdles already seen for fan-in, increased by the number of connections. For a N-node deployment, we will need to monitor N*(N-1) channels.

The lack of a table for master status is felt more acutely in this topology, as the current data is more difficult to parse.

What's next

This topology shows that we can deploy a very efficient multi-source replication system, at the expense of having many connections and enduring more complex monitoring data.

We can, however, compromise between the need of having many masters and the complexity of the deployment. We will see the star topology, where, by introducing a SPOF in the system we can deploy a more agile all-masters topology. And we will also see some hybrid deployments, all made possible by the multi-source enhancements in MySQL 5.7 and MariaDB 10.

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

Become a MySQL DBA blog series - The Query Tuning Process

August 16, 2015, 10:03 pm

≫ Next: Proposal to extend binary operators in MySQL

≪ Previous: MySQL replication in action - Part 3: all-masters P2P topology

Query tuning is something that a DBA does on a daily basis - analyse queries and updates, how these interact with the data and schema, and optimize for performance. This is an extremely important task as this is where database performance can be significantly improved - sometimes by orders of magnitude.

In the next few posts, we will cover the basics of query tuning - indexing, what types of queries to avoid, optimizer hints, EXPLAIN and execution plans, schema tips, and so on. We will start, though, by discussing the process of query review - how to gather data and which methods are the most efficient.

This is the ninth installment in the 'Become a MySQL DBA' blog series. Our previous posts in the DBA series include Configuration Tuning, Live Migration using MySQL Replication, Database Upgrades, Replication Topology Changes, Schema Changes, High Availability, Backup & Restore, Monitoring & Trending.

Data gathering

There are a couple of ways to grab information about queries that are executed on the database. MySQL itself provides three ways - general log, binlog and slow query log.

General log

General log is the least popular way as it causes significant amount of logging and has a high impact on the overall performance (Thanks to PavelK for correcting us in the comments section). It does, though, store data about queries that are being executed, together with all the needed information to assess how long a given query took.

                   40 Connect   root@localhost on sbtest
                   40 Query     set autocommit=0
                   40 Query     set session read_buffer_size=16384
                   40 Query     set global read_buffer_size=16384
                   40 Query     select count(*) from sbtest.sbtest3 where pad like '6%'
150812  7:37:42    40 Query     select count(*) from sbtest.sbtest3 where pad like '6%'
150812  7:41:45    40 Query     select count(*) from sbtest.sbtest3 where pad like '6%'
150812  7:45:46    40 Query     select count(*) from sbtest.sbtest3 where pad like '6%'
150812  7:49:56    40 Query     select count(*) from sbtest.sbtest3 where pad like '6%'
150812  7:54:08    40 Quit

Given the additional impact, the general log is not really a feasible way of collecting slow queries, but it still can be a valid source if you have it enabled for some other reason.

Binary log

Binary logs store all modifications that were executed on the database - this is used for replication or for point-in-time recovery. There’s no reason, though, why you couldn’t use this data to check the performance of the DML's - as long as the query has been logged in the original format. This means that you should be ok for the majority of the writes as long as binlog format is set to ‘mixed’. Even better would be to use the ‘statement’ format, but it’s not recommended due to possible issues with data consistency between the nodes. The main difference between ‘statement’ and ‘mixed’ formats is that in ‘mixed’ format, all queries which might cause inconsistency will be logged in a safe, ‘row’ format. This format, though, doesn’t preserve the original query statement and so the data cannot be used for a query review.

If requirements are fulfilled, binary logs will give us enough data to work on - the exact query statement and time taken to execute it on the master. Note that this is not a very popular way of collecting the data. It has it’s own uses, though. For example, if we are concerned about the write traffic, using binary logs is a perfectly valid way of getting the data, especially if they are already enabled.

Slow query log

The slow query log is probably the most common source of information for slow queries. It was designed to log the most important information about the queries - how long they took, how many rows were scanned, how many rows were sent to the client.

Slow query log can be enabled by setting the slow_query_log variable to 1. It’s location can be set using slow_query_log_file variable. Another variable, long_query_time, sets the threshold above which queries are being logged. By default it’s 10 seconds which means queries that execute under 10 seconds will not be logged. This variable is dynamic, you can change it at any time. If you set it to 0, all queries will be logged into a slow log. It is possible to use fractions when setting long_query_time so settings like 0.1 or 0.0001 are valid ones.

What you need to remember when dealing with long_query_time is that it’s change on a global level affects only new connections. When changing it in a session, it affects the current session only (as one would expect). If you use some kind of connection pooling, this may become a significant issue. Percona Server has an additional variable, slow_query_log_use_global_control, which eliminates this drawback - it made possible to make long_query_time (and a couple of other slow log related settings that were introduced in Percona Server) truly a dynamic variable, affecting also currently open sessions.

Let’s take a look at the content of the slow query log:

# Time: 150812  8:25:19
# User@Host: root[root] @ localhost []  Id:    39
# Schema: sbtest  Last_errno: 0  Killed: 0
# Query_time: 238.396414  Lock_time: 0.000130  Rows_sent: 1  Rows_examined: 59901000  Rows_affected: 0
# Bytes_sent: 69
SET timestamp=1439367919;
select count(*) from sbtest.sbtest3 where pad like '6%';

In this entry we can see information about the time when the query was logged, the user who executed the query, thread id inside MySQL (something you’d see as Id in your processlist output), current schema, whether it failed with some error code or whether it was killed or not. Then we have the most interesting data: how long it took to execute this query, how much of this time was spent on row level locking, how many rows were sent to the client, how many rows were scanned in MySQL, how many rows were modified by the query. Finally we have info on how many bytes were sent to the client, timestamp of the time when the query was executed and the query itself.

This gives us a pretty good idea of what may be wrong with the query. Query_time is obvious - the longer a query takes to execute, the more impact on the server it will have. But there are also other clues. For example, if we see that the number of rows examined is high compared to the rows sent, it may mean that the query is not indexed properly and it’s scanning much more rows that it should be. High lock time can be a hint that we are suffering from row level locking contention. If a query had to wait some time to grab all the locks it needed, something is definitely not right. It could be that some other long running query already acquired some of the locks needed, it could be that there are long-running transactions that stay open and do not release their locks. As you can see, even such simple information may be of a great value for a DBA.

Percona Server can log some additional information into the slow query log - you can manage what’s being logged by using variable log_slow_verbosity. Below is a sample of such data:

# Time: 150812  9:41:32
# User@Host: root[root] @ localhost []  Id:    44
# Schema: sbtest  Last_errno: 0  Killed: 0
# Query_time: 239.092995  Lock_time: 0.000085  Rows_sent: 1  Rows_examined: 59901000  Rows_affected: 0
# Bytes_sent: 69  Tmp_tables: 0  Tmp_disk_tables: 0  Tmp_table_sizes: 0
# InnoDB_trx_id: 13B08
# QC_Hit: No  Full_scan: Yes  Full_join: No  Tmp_table: No  Tmp_table_on_disk: No
# Filesort: No  Filesort_on_disk: No  Merge_passes: 0
#   InnoDB_IO_r_ops: 820579  InnoDB_IO_r_bytes: 13444366336  InnoDB_IO_r_wait: 206.397731
#   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.000000
#   InnoDB_pages_distinct: 65314
SET timestamp=1439372492;
select count(*) from sbtest.sbtest3 where pad like '6%';

As you can see, we have all the data from the default settings and much more. We can see if a query created temporary tables, how many in total and how many of them were created on disk. We can also see the total size of those tables. This data gives us important insight - are temporary tables an issue or not? Small temporary tables may be fine, larger ones can significantly impact overall performance. Next, we see some characteristics of the query - did it use query cache, did it make a full table scan? Did it make a join without using indexes? Did it run a sort operation? Did it use disk for the sorting? How many merge passes the filesort algorithm had to do?

Next section contains information about InnoDB - how many read operations it had to do? How many bytes of data it read, how much time MySQL spend on waiting for InnoDB I/O activity, row lock acquisition or waiting in the queue to start being processed by InnoDB? We can also see the approximate number of unique pages that the query accessed. Finally, we see the timestamp and the query itself.

This additional information is useful to pinpoint the problem with a query. By looking at our example it’s clear that the query does not use any index and it makes a full table scan. By looking at the InnoDB data we can confirm that the query did a lot of I/O - it scanned ~13G of data. We can also confirm that the majority of the query execution time (239 seconds) was spent on InnoDB I/O (InnoDB_IO_r_wait - 206 seconds).

Impact of the slow log on the performance

The slow log is definitely a great way of collecting data about the performance of the queries. Unfortunately, it comes at a price - enabling slow query log enables some additional load on MySQL - a load that impacts the overall performance. We are talking here about both throughput and stability, we do not want to see drops in performance as they don’t play well with user experience and may cause some additional problems by causing temporary pileups in queries, scanned rows etc.

We’ve prepared a very simple and generic benchmark to show you the impact of the different slow log verbosity. We’ve set up a m4.xlarge instance on AWS and we used sysbench to build a single table. The workload is two threads (m4.xlarge has four cores), it’s read-only and the data set fits in the memory - we have here a very simple CPU-bound workload. Below is the exact sysbench command:

sysbench \
--test=/root/sysbench/sysbench/tests/db/oltp.lua \
--num-threads=2 \
--max-requests=0 \
--max-time=600 \
--mysql-host=localhost \
--mysql-user=sbtest \
--mysql-password=sbtest \
--oltp-tables-count=1 \
--oltp-read-only=on \
--oltp-index-updates=200 \
--oltp-non-index-updates=10 \
--report-interval=1 \
--oltp-table-size=800000 \
run

We used four verbosity stages for the slow log:

disabled
enabled, no Percona Server features
enabled, log_slow_verbosity='full'
enabled, log_slow_verbosity='full,profiling_use_getrusage,profiling'

The last one adds profiling information for a query - time spent in each of the states it went through and overall CPU time used for it. Here are the results:

As you can see, throughput-wise, impact is not that bad, except for the most verbose option. Unfortunately, it’s not a stable throughput - as you can see there are many periods when no transaction was executed - it’s true for all runs with slow log enabled. This is a significant drawback of using slow log to collect the data - if you are ok with some impact, that’s a great tool. If not, we need to look for an alternative. Of course, this is a very generic test and your mileage may vary - impact may depends on so many factors: CPU utilization, I/O throughput, number of queries per second, exact query mix etc. If you are interested in checking the impact on your system, you need to perform tests on your own.

Using tcpdump to grab the data

As we have seen, using slow log, while allowing you to collect a great deal of information, significantly impacts the throughput of the server. That’s why yet another way of collecting the data was developed. The idea is simple - MySQL sends the data over the network so all queries are there. If you capture the traffic between the application and the MySQL server, you’ll have all the queries exchanged during that time. You know when a query started, you know when a given query finished - this allows you to calculate the query’s execution time.

Using the following command you can capture the traffic hitting port 3306 on a given host.

tcpdump -s 65535 -x -nn -q -tttt -i any -c 1000 port 3306 > mysql.tcp.txt

Of course, it still causes some performance impact, let’s compare it with a clean server, no slow log enabled:

As you can see, total throughput is lower and spiky but it’s somewhat more stable than when slow log is enabled. What’s more important, tcpdump can be executed on a MySQL host but it can also be used on a proxy node (you may need to change the port in some cases), to ease the load on the MySQL node itself. In such case the performance impact will be even lower. Of course, tcpdump can’t provide you with such detailed informations as the query log does - all you can grab is a query itself and amount of data sent from the server to the client. There’s no info about rows scanned or sent, there’s no info if a query created a temporary table or not, nothing - just the execution time and query size.

Given the fact that both main methods, slow log and tcpdump, have their pros and cons, it’s common to combine them. You can use long_query_time to filter out most of the queries and log only the slowest ones. You can use tcpdump to collect the data on a regular basis (or even all the time) and use slow log in some particular cases, if you find an intriguing query. You can use data from the slow log only for thorough query reviews that happen couple of times during a year and stick to tcpdump on a daily basis. Those two methods complement each other and it’s up to a DBA to decide how to use them.

Once we have data captured using any of the methods described in this blog post, we need to process it. While data in the log files can be easily read by a human, and also it’s not a rocket science to print and parse data captured by a tcpdump, it’s not really the way you’d like to approach the query review - it’s definitely too hard to get the total picture and there’s too much noise. You need something that will aggregate information you collected and present you a nice summary. We’ll discuss such a tool in next post in this series.

Blog category:

DB Ops

Tags:

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

Proposal to extend binary operators in MySQL

August 17, 2015, 7:17 am

≫ Next: Transactions with RocksDB

≪ Previous: Become a MySQL DBA blog series - The Query Tuning Process

In order to make it easier to work with data stored as binary strings (BINARY/VARBINARY) we are considering extending the &,|,^,~,<<,>> operators, to accept any-size binary strings and return binary string data as response. This can be useful for several complex data types not covered by the basic SQL data types (e.g. working with IPV6 addresses, manipulating UUID, etc).

Motivation

Let's say we're interested in getting all the networks that contain the given IP address. With ipv4 the common practice is to store the IP addresses as INT and execute:

SELECT inet_ntoa(network) AS network, inet_ntoa(netmask) AS netmask FROM network_table WHERE (inet_aton('192.168.0.30') & netmask) = network;

At the moment you are not able to do the same with ipv6 because inet6_aton('2001:0db8:85a3:0000:0000:8a2e:0370:7334') & netmask converts both operands from VARBINARY to BIGINT resulting in data truncation and when the & operation gets executed, the result is incorrect. But, if the & operator could have worked directly on BINARY/VARBINARY data, this would have been possible:

SELECT inet6_ntoa(network) AS network, inet6_ntoa(netmask) AS netmask FROM network_table WHERE (inet6_aton('2001:0db8:85a3:0000:0000:8a2e:0370:7334') & netmask) = network;

The SQL standard does not define bitwise operations over any-size binary string data but it does define binary string comparison:

All binary string values are comparable. When binary large object string values are compared, they shall have exactly the same length (in octets) to be considered equal. Binary large object string values can be compared only for equality. For binary string values other than binary large object string values, it is implementation-defined whether trailing X'00's are considered significant when comparing two binary string values that are otherwise equivalent.

Thus, the standard allows binary strings to be zero-padded in the least significant part for comparisons (right-side). If you're interpreting binary data as a hexadecimal unsigned integer, you would expect the operand with smaller size to be zero-padded to the left side. So an easy approach to avoid confusion would be to allow the operators(^,&,|) to only work with same size operands, thus avoiding any confusion over whether padding occurs in most or least significant part.

Another aspect to be mentioned is that the old behavior with INT arguments would be preserved. Example SELECT 23 & 4 would still return a numeric BIGINT response: 4.

MySQL has two ways of representing hexadecimal string literals: x'val' and 0xval (where val contains hexadecimal digits 0-9, a-f, A-F). The difference between the two is that the first is SQL standard and has a constraint: the number of hexadecimal digits must be even, the second version is not standard and does not require an even number of characters (it will be zero-padded on the left side in case of even number of characters). But there is one issue, in numeric contexts, hexadecimal values act like integers (64-bit precision) and in string contexts, they act like binary strings. So currently when executing SELECT x'10000001' & x'00000001' the operands get converted from VARBINARY to BIGINT(int64), with loss of any parts beyond 64 bits, and this returns BIGINT; with our change, this would return BINARY(4), breaking existing applications. That's something that can be solved with a new sql_mode named BINARY_BIT_OPS_YIELD_BINARY, off by default for backward-compatibility; if turned on, bit operation on binary strings will yield BINARY result (or an error if operators are not of the same size).

The alternative would be to introduce new functions. Example bin_and(x,y), bin_or(x,y), bin_xor(x,y) that can take two binary string arguments of same length and return binary string. We are considering also other names like binary_and(x,y), binary_or(x,y), binary_xor(x,y).

Pros:

No compatibility issues
Bitwise operators (&, |, ^) would remain operators that yield integer results (preserving existing functionality), whereas the new functions would yield binary string results

Cons:

It's longer to type bin_and()/binary_and() than &
This creates new syntax for users to learn
bit_and(x) already exists, it's an aggregate function; we fear this could confuse users, though, fortunately both functions do a similar thing (they AND bits).

Please let us know in a comment below what are your opinions on this:

Is it a good idea to implement bitwise operations for binary strings?
Is the BINARY_BIT_OPS_YIELD_BINARY sql_mode necessary?
Can you think of other use cases where this can be useful (UUID handling is one such case)?

Thanks to Catalin Besleaga on the optimizer team for ghost writing this post.

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

Transactions with RocksDB

August 17, 2015, 8:09 am

≫ Next: Installing MySQL from source – CMAKE issues

≪ Previous: Proposal to extend binary operators in MySQL

RocksDB has work-in-progress to support transactions via optimistic and pessimistic concurrency control. The features need more documentation but we have shared the API, additional code for pessimistic and optimistic and examples for pessimistic and optimistic. Concurrency control is a complex topic (see these posts) and is becoming popular again for academic research. An awesome PhD thesis on serializable snapshot isolation by Michael Cahill ended up leading to an implementation in PostgreSQL.

We intend to use the pessimistic CC code for MyRocks, the RocksDB storage engine for MySQL. We had many discussions about the repeatable read semantics in InnoDB and PostgreSQL and decided on Postgres-style. That is my preference because the gap locking required by InnoDB is more complex.

MongoRocks uses a simpler implementation of optimistic CC today and a brief discussion on CC semantics for MongoDB is here. AFAIK, write-write conflicts can be raised, but many are caught and retried internally. I think we need more details. This is a recent example of confusion about the current behavior.

Thanks go to Anthony for doing the hard work on this.

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

Installing MySQL from source – CMAKE issues

August 18, 2015, 4:39 am

≫ Next: A followup on show_compatibility_56

≪ Previous: Transactions with RocksDB

Today’s topic is related primarily to compiling MySQL from source using CMAKE and what kind of issues we encounter during this task.
We want to install MySQL 5.6.19 with Debug+Valgrind etc.
On CentOS 7 here are my dependency packages start point:

[root@centos-base ~]# yum install zlib zlib-devel openssl openssl-devel valgrind valgrind-devel cmake gcc cpp ncurses ncurses-devel

Here is my CMAKE command:

[root@centos-base mysql-5.6.19]# cmake -DCMAKE_INSTALL_PREFIX=/opt/mysql-5.6.19 -DMYSQL_DATADIR=/var/lib/mysql -DSYSCONFDIR=/opt/mysql-5.6.19 -DWITH_SSL=system -DMYSQL_TCP_PORT=3306 -DMYSQL_UNIX_ADDR=/opt/mysql-5.6.19/mysqld.sock -DDEFAULT_CHARSET=utf8 -DDEFAULT_COLLATION=utf8_general_ci -DWITH_DEBUG=1 -DCOMPILATION_COMMENT="Shahriyar Rzayev's CentOS MySQL-5.6.19" -DOPTIMIZER_TRACE=1 -DWITH_ZLIB=system -DWITH_VALGRIND=1 -DCMAKE_C_FLAGS=-DHAVE_purify -DCMAKE_CXX_FLAGS=-DHAVE_purify

If you try to run this command first ERROR will be:

CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
CMake Error: Internal CMake error, TryCompile configure of cmake failed
-- Performing Test HAVE_PEERCRED - Failed
-- Library mysqlclient depends on OSLIBS -lpthread;/usr/lib64/libz.so;m;/usr/lib64/libssl.so;/usr/lib64/libcrypto.so;dl
-- Googlemock was not found. gtest-based unit tests will be disabled. You can run cmake . -DENABLE_DOWNLOADS=1 to automatically download and build required components from source.

Googlemock was not found -> to resolve this issue add -DENABLE_DOWNLOADS=1 to CMAKE command.

After you will see, it will download necessary package:

-- Library mysqlclient depends on OSLIBS -lpthread;/usr/lib64/libz.so;m;/usr/lib64/libssl.so;/usr/lib64/libcrypto.so;dl
-- Successfully downloaded http://googlemock.googlecode.com/files/gmock-1.6.0.zip to /root/mysql-5.6.19/source_downloads

Second issue you will likely see:

CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
CMake Error: Internal CMake error, TryCompile configure of cmake failed

The problem is missing gcc-c++ package, so install it:

[root@centos-base ~]# yum install gcc-c++

Third one is a Warning: Bison executable not found in PATH.
Again just install bison package:

[root@centos-base ~]# yum install bison

After all we should say that following packages should be installed on server prior to compiling MySQL:

[root@centos-base ~]# yum install zlib zlib-devel openssl openssl-devel valgrind valgrind-devel cmake gcc cpp ncurses ncurses-devel bison gcc-c++

After getting success message:

-- Configuring done
-- Generating done
-- Build files have been written to: /root/mysql-5.6.19

Just run make & make install

The post Installing MySQL from source – CMAKE issues appeared first on Azerbaijan MySQL UG.

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

A followup on show_compatibility_56

August 18, 2015, 7:25 am

≫ Next: MariaDB on IBM z Systems with Linux

≪ Previous: Installing MySQL from source – CMAKE issues

Giuseppe and Shlomi both blogged on one of the recent changes introduced to MySQL 5.7.8-rc, where the setting show_compatibility_56 is now set OFF by default.

Both raise very good points. Here is how we plan to address them:

The permissions issue reported by Giuseppe will be fixed.
When selecting from information_schema tables in show_compatibility_56=OFF mode, an error will now be produced:
mysql> select * from information_schema.global_status; ERROR 3167 (HY000): The 'INFORMATION_SCHEMA.GLOBAL_VARIABLES' feature is disabled; see the documentation for 'show_compatibility_56'
(Previously this was a warning + empty set returned)
When show_compatibility_56=ON, users will now be able to select from either information_schema or performance_schema tables. This presents a more viable upgrade path for users that need some period to transition.

The show_compatibility_56 setting itself will remain deprecated, as it was from its introduction. This signifies that we intend to remove INFORMATION_SCHEMA.GLOBAL_VARIABLES in a future release.

Outside of the scope of today's update, my colleague Mark Leith has also volunteered to blog with examples of how the new performance_schema tables now expose variables down to the thread-level. This will add more context as to why we are moving this meta data from information_schema to performance_schema. Thanks Mark!

So thank you again to Giuseppe and Shlomi for helping make a better MySQL. We're delighted to incorporate your feedback!

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

MariaDB on IBM z Systems with Linux

August 18, 2015, 8:06 am

≫ Next: Manage your Amazon Aurora databases with SQLyog

≪ Previous: A followup on show_compatibility_56

Tue, 2015-08-18 15:06

dion_cornett_g

The Convergence of Enterprise-Class and Choice – an Open Source Ecosystem Running on a True Enterprise-Class Platform: MariaDB Enterprise on IBM z Systems

While selling free software sounds like a hard job, the growing mainstream adoption of open source makes it one of the most exciting segments in IT. In my daily meetings with customers, there is no ambiguity that open source is the foundation for the hottest trends in technology, including Cloud, Analytics, Mobile and Social. Gartner reports that 70% of all new applications are built using an open source database. (1)

Yet until now, many customers had to make compromises related to leveraging the innovation, high-quality code, lower costs, and choice inherent in open source, versus leveraging highly secure, scalable and power efficient enterprise-class z Systems offered by IBM.

Today at LinuxCon North America that changes. We at MariaDB Corporation are proud to be part of a new ecosystem with IBM that enables MariaDB Enterprise to run on Linux on z Systems as well as on the new IBM LinuxONE. MariaDB on z Systems extends our current enterprise-level open source offerings, which deliver Community-driven innovation and a comprehensive value proposition; combined with performance, high availability and security at web-scale.

Benefits to Our Customers

Choice – Now, with the availability of MariaDB Enterprise on z Systems, clients can have the best of both worlds—an open source ecosystem running on a true enterprise-class platform.

Security – The world-class security of IBM z Systems, when combined with the hyper-secure community-driven innovations within MariaDB, provide unparalleled protection against a growing IT threat environment.

Extensibility – Joining with IBM on z Systems, MariaDB is now able to offer customers the only open source enterprise database platform that runs from x86 and Power to Mainframe and Cloud.

Consolidation – The ability to scale MariaDB with IBM’s industry-leading mainframe virtualization technologies means clients can easily consolidate clusters with many servers on one mainframe. MariaDB on IBM z Systems can host more servers per core than any other system with high-speed encryption, disaster recovery, and continuous availability capabilities,hereby also saving electricity, cooling and management costs.

Future-Proofing Your Database Investment –The customer’s database investment whether on premises, SaaS or Cloud is protected. Moreover, a true open source development model means that the latest innovations can always be incorporated. MariaDB will support the customer’s needs no matter the direction they choose, all without vendor lock-in.

"As the ONE default database platform for leading Linux distributors, including Red Hat and SUSE, MariaDB is excited to support IBM LinuxONE,” stated Patrik Sallner, CEO of MariaDB. “With Linux on IBM z Systems growing at twice the rate of the Linux market overall, there is clear customer demand for open source solutions on IBM’s highly scalable and secure platform. These qualities align perfectly with MariaDB’s true open source model, which leverages Community innovations to provide best-in-class scalability, improved performance, and security, for on-premise, hybrid and cloud applications.”

MariaDB Enterprise on IBM z Systems (LinuxONE) will be available later this calendar year.

MariaDB Community Server on z Systems

The MariaDB community version runs out of the box on Linux on z Systems. You can install MariaDB 5.x through Yum on RHEL or Zypper on SLES. See the Installation guide for MariaDB 10.x (https://github.com/linux-on-i/docs/wiki/Building-MariaDB-10.0bm-z). MariaDB Inc. and IBM are working together to bring MariaDB Enterprise to z Systems.

For more information, go to https://mariadb.com

More about the IBM partnership, go to, http://www-03.ibm.com/press/us/en/pressrelease/47474.wss

1. Gartner, The State of Open-Source RDBMSs, 2015, Donald Feinberg, Merv Adrian, April 21, 2015

Tags:

MariaDB Enterprise

About the Author

Dion Cornett is VP Global Sales at MariaDB and was previously VP Sales Strategy at Red Hat.

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

Manage your Amazon Aurora databases with SQLyog

August 18, 2015, 8:08 am

≫ Next: InnoDB Transparent Page Compression

≪ Previous: MariaDB on IBM z Systems with Linux

We are elated to announce the availability of SQLyog for Amazon Aurora Databases. You can now manage Amazon Aurora databases with SQLyog, the most powerful database manager, admin and GUI tool.

About Amazon Aurora:
Amazon Aurora is a MySQL-compatible, relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.

SQLyog + Amazon Aurora:
It is very easy to connect Aurora database using SQLyog: just enter the DNS address of the Aurora instance as the host parameter, port number from the endpoint as the port parameter and key in your credentials. You can manage your Aurora databases with powerful tools like Synchronisation, Import External Data, Backup Wizard, Job Scheduler etc.

Not only these, you get all the top features of SQLyog to work with Amazon Aurora databases.

Download a free trial of SQLyog and start managing your Aurora databases now!

If you have any feedback or queries, drop us an email or leave a comment below.

The post Manage your Amazon Aurora databases with SQLyog appeared first on Webyog Blog.

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

InnoDB Transparent Page Compression

August 18, 2015, 8:39 am

≫ Next: MariaDB Galera Cluster 10.0.21 and 5.5.45 now available

≪ Previous: Manage your Amazon Aurora databases with SQLyog

Astute readers will note that InnoDB already had compression since the MySQL 5.1 plugin. We are using the terminology of ‘Page Compression’ to describe the new offering that will ship with MySQL 5.7, and ‘InnoDB Compression’ for the earlier offering.

First a brief history and explanation of the current motivation.

InnoDB Compression

Compressed tables were introduced in 2008 as part of the InnoDB plugin for MySQL 5.1. By adding compression, InnoDB is able to more efficiently make use of modern hardware such as fast SSD devices. Facebook was an early adopter of this technology. They did extensive analysis and made lots of improvements, most of which were later pushed to the upstream MySQL code as well.

The code changes required to get the old InnoDB compression to work properly were extensive and complex. Its tentacles are everywhere—I think that just about every module inside InnoDB has had modifications done to it in order to make it work properly when compression is in use. This complexity has its challenges, both in terms of maintainability and when it comes to improving the feature. We have been debating internally about what we should do about this over the long run. As much as we would like to redesign and rewrite the entire old compression code, it is impractical. Therefore we are only fixing issues in the old compression code reported by customers. We think there are better ways to solve the underlying performance and compression problems around B-Trees. For example by adding support for other types of indexes e.g. LSM tree and/or BW-Tree or some variation of the two. We also want to change our page and row formats so that we have the flexibility to add new indexing algorithms as required by the changing landscape and at the same time maintaining full backward compatibility. As we continue to develop and improve InnoDB, we also think that the write amplification and compression efficacy problems need to be fixed more substantially rather than tinkering around the edges of our old compression codebase.

Introducing InnoDB Page Compression

When FusionIO (now part of SanDisk) floated the idea around transparent page compression, the appeal was in its simplicity—the changes to the code base were very localised and more importantly it complements the existing InnoDB compression as both transparent page compression and the old compression can coexist in the same server instance. Users can then choose the compression scheme that makes more sense for their use case, even on a table-by-table basis.

From a high level, transparent page compression is a simple page transformation:

Write : Page -> Transform -> Write transformed page to disk -> Punch hole

Read  : Page from disk -> Transform -> Original Page

When we put these abstractions in place, it was immediately obvious that we could apply any type of transformation to the page including encryption/decryption and other things as we move forward. It is then trivial to add support for a new compression algorithm, among other things. Also, MySQL 5.7 already had the multiple dedicated dirty page flushing threads feature. This existing 5.7 feature was a natural fit for offloading the “transformation” to a dedicated background thread before writing the page to disk thus parallelizing the compression and the “hole punch” after the write to disk. By contrast, with the old InnoDB compression the compress/decompress/recompress operations are done in the query threads (mostly), and it would take a mini-series of blog posts to explain how it works, even when assuming that you’re already very familiar with the InnoDB internals.

To use the new Transparent Page Compression feature the operating system and file system must support sparse files and hole punching.

Linux

Several popular Linux file systems already support the hole punching feature. For example: XFS since Linux 2.6.38, ext4 since Linux3.0, tmpfs (/dev/shm) since Linux 3.5, and Btrfs since Linux 3.7.

Windows

While this feature is supported on Windows, it may not provide much practical value “out of the box”. The issue is in the way NTFS clustering is designed.

On Windows the block size is not used as the allocation unit for sparse files. The underlying infrastructure for sparse files on NTFS is based on NTFS compression. The hole punch is done on a “compression unit” and this compression unit is derived from the cluster size (see the table below). This means that by default you cannot punch a hole if the cluster size >= 8K. Here’s a breakdown for smaller cluster sizes:

Cluster Size	Compression Unit
512 Bytes	8K Bytes
1K Bytes	16K Bytes
2K Bytes	32K Bytes
4K Bytes	64K Bytes

The default NTFS cluster size is 4K bytes, which means that the compression unit size is 64K bytes. Therefore, unless the user has created the file system with a smaller cluster size and used larger InnoDB page sizes, there is little practical benefit from transparent compression “out of the box”.

Using the Feature

In the current release we support two compression algorithms, ZLib and LZ4. We introduced a new table level attribute that is stored in the .frm file. The new attribute is:

COMPRESSION := ( "zlib" | "lz4" | "none")

If you create a new table and you want to enable transparent page compression on it you can use the following syntax:

CREATE TABLE T (C INT) COMPRESSION="lz4";

If later you want to disable the compression you can do that with:

ALTER TABLE T COMPRESSION="none";

Note that the above ALTER TABLE will only result in a meta-data change. This is because inside InnoDB COMPRESSION is a page level attribute. The Implication is that a tablespace can contain pages that could be a mix of any supported compression algorithm. If you want to force the conversion of every page then you need to force it by invoking:

OPTIMIZE TABLE T;

Tuning

Remember to set the innodb-page-cleaners to a suitable value, in my tests I’ve used a value of 32. Configuring for reads is not that straight forward. InnoDB can do both sync and async reads. Sync reads are done by the query threads. Async reads are done during read ahead, one downside with read ahead is that even if the page is not used it will be decompressed. Since the async read and decompress is done by the background IO read threads, it should not be too much of an issue in practice. Sync read threads will use the page otherwise they wouldn’t be doing a sync read in the first place. One current problem with converting a sync read to an async read is the InnoDB IO completion logic. If If there are too many async read requests, it can cause a lot of contention on the buffer pool and page latches.

Monitoring

With hole punching the file size shown by ls -l displays the logical file size and not the actual allocated size on the block device. This is a generic issue with sparse files. Users can now query the INNODB_SYS_TABLESPACES information schema table in order to get both the logical and the actual allocated size.

The following additional columns have been added to that Information Schema view: FS_BLOCK_SIZE, FILE_SIZE, ALLOCATED_SIZE, and COMPRESSION.

FS_BLOCK_SIZE is the file system block size
FILE_SIZE is the file logical size, the one that you see with ls -l
ALLOCATED_SIZE is the actual allocated size on on the block device where the filesystem resides
COMPRESSION is the current compression algorithm setting, if any

Note: As mentioned earlier, the COMPRESSION value is the current tablespace setting and it doesn’t guarantee that all pages currently in the tablespace have that format.

Here are some examples:

mysql> select * from information_schema.INNODB_SYS_TABLESPACES WHERE name like 'linkdb%';
+-------+------------------------+------+-------------+----------------------+-----------+---------------+------------+---------------+-------------+----------------+-------------+
| SPACE | NAME                   | FLAG | FILE_FORMAT | ROW_FORMAT           | PAGE_SIZE | ZIP_PAGE_SIZE | SPACE_TYPE | FS_BLOCK_SIZE | FILE_SIZE   | ALLOCATED_SIZE | COMPRESSION |
+-------+------------------------+------+-------------+----------------------+-----------+---------------+------------+---------------+-------------+----------------+-------------+
|    23 | linkdb/linktable#P#p0  |    0 | Antelope    | Compact or Redundant |     16384 |             0 | Single     |           512 |  4861198336 |     2376154112 | LZ4         |

...

mysql> select name, ((file_size-allocated_size)*100)/file_size as compressed_pct from information_schema.INNODB_SYS_TABLESPACES WHERE name like 'linkdb%';
+------------------------+----------------+
| name                   | compressed_pct |
+------------------------+----------------+
| linkdb/linktable#P#p0  |        51.1323 |
| linkdb/linktable#P#p1  |        51.1794 |
| linkdb/linktable#P#p2  |        51.5254 |
| linkdb/linktable#P#p3  |        50.9341 |
| linkdb/linktable#P#p4  |        51.6542 |
| linkdb/linktable#P#p5  |        51.2027 |
| linkdb/linktable#P#p6  |        51.3837 |
| linkdb/linktable#P#p7  |        51.6309 |
| linkdb/linktable#P#p8  |        51.8193 |
| linkdb/linktable#P#p9  |        50.6776 |
| linkdb/linktable#P#p10 |        51.2959 |
| linkdb/linktable#P#p11 |        51.7169 |
| linkdb/linktable#P#p12 |        51.0571 |
| linkdb/linktable#P#p13 |        51.4743 |
| linkdb/linktable#P#p14 |        51.4895 |
| linkdb/linktable#P#p15 |        51.2749 |
| linkdb/counttable      |        50.1664 |
| linkdb/nodetable       |        31.2724 |
+------------------------+----------------+
18 rows in set (0.00 sec)

Limitations

Nothing is perfect. There are always potential issues and it is best to be aware of them up front.

Usage

Currently you cannot compress shared system tablespaces (for example the system tablepsace and the UNDO tablespaces) nor general tablespaces, and it will silently ignore pages that belong to spatial (R-Tree) indexes. The first two are more like artificial limitations from an InnoDB perspective. Inside InnoDB it doesn’t make any difference whether it is a shared tablespace or an UNDO log. It is a page level attribute. The problem is because we store the compression table attributes in a .frm file and the .frm file infrastructure doesn’t know about InnoDB tablespaces. The spatial index (R-Tree) limitation is because both features use the same 8 bytes within the header file. The first is a limitation due to MySQL legacy issues and difficult to fix. The latter is probably fixable by changing the format of the compressed page for R-Tree indexes, but we decided not to introduce two formats for the the same feature. Many of these limitations will go away with the new Data Dictionary and related work.

Space savings

The granularity of the block size on many file systems by default is 4K. This means that even if the data in the InnoDB data block compresses down from 16K to say 1K we cannot punch a hole and release the 15K of empty space on the block, but rather only 12K. By contract, for file systems that use a much smaller size of 512 bytes by default, this works extremely well (e.g. DirectFS/NVMFS). There is a trend on the hardware side to move from 512 byte sectors to larger sectors e.g. 4K, as larger sector sizes usually result in faster writes provided the writes are the same size as the sector size. FusionIO devices provide the ability to set this size during the formatting phase. Perhaps other vendors will follow and allow this flexibility too. Then one can decide on the size vs speed trade off during capacity planning.

Copying files from one system to another

If you copy files on a host that has hole punching support and the files are compressed using the transparent page compression feature, then on the destination host the files could “expand” and fill the holes. Special care must be taken when copying such files. (rsync generally has good support for copying sparse files, while scp does not.)

ALTER TABLE/OPTIMIZE TABLE

If you use the ALTER TABLE syntax to change the compression algorithm of a tablespace or you disable transparent page compression, this will only result in a meta-data change. The actual pages will be modified when they are next written. To force a change to all of the pages, you can use OPTIMIZE TABLE.

File system fragmentation

The file system can get fragmented due to hole punching releasing blocks back to the file system free list. This has two ramifications:

Sequential scans can actually end up as random IO. This is particularly problematic with HDDs.
It may increase the FS free list management overhead. In our (shorter) tests using XFS I haven’t seen any major problem, but one should do an IO intensive test over the course of several months to be sure.

Implementation Details

We repurposed 8 bytes in the InnoDB page header, which exists on every page in every tablespace file. The repurposed field is the “FLUSH_LSN” (see definition of the page header as a C struct below). This field, prior to the MySQL 5.7 R-Tree implementation being added, was only valid for the system tablespace’s (ID: 0) first page (ID: 0). For all other pages it was unused.

typedef uint64_t lsn_t;

struct Page _PACKED_ {
    
    struct Header _PACKED_ {
       uint32_t        checksum;
       uint32_t        offset;
       uint32_t        prev;
       uint32_t        next;
       lsn_t           lsn;
       uint16_t        page_type;

       /* Originally only valid for space 0 and page 0. In 5.7 it is now
       used by the R-Tree and the transparent page compression to store
       their relevant meta-data */
       lsn_t           flush_lsn;
       
       uint32_t        space;
    };

    /* PAGE_SIZE on of : 1K, 2K, 4K, 8K, 16K, 32K and 64K */
    byte               data[PAGE_SIZE - (sizeof(Header) + sizeof(Footer)];

    struct Footer _PACKED_ {
       uint32_t        checksum;

       /* Low order bits from Header::lsn */
       uint32_t        lsn_low;
    };
};

flush_lsn is interpreted as meta_t below. Before writing the page to disk we compress the page and set the various fields in the meta_t structure for later deserialization when we read the page from disk.

/** Compressed page meta-data */
struct meta_t _PACKED_ {
    /** Version number */
    uint8_t         m_version;

    /** Algorithm type */ 
    uint8_t         m_algorithm;

    /** Original page type */
    uint16_t        m_original_type;

    /** Original page size, before compression */
    uint16_t        m_original_size;
 
    /** Size after compression */
    uint16_t        m_compressed_size;
};

Readers will notice that the size type is a uint16_t. This limits the page size to 64K bytes. This is a generic limitation in the current InnoDB page layout, not a limitation introduced by this new feature. Issues like this are the reason why we want to rethink our long term objectives and fix them by doing more structural changes.

Performance

The first test uses FusionIO hardware along with their NVMFS file system. When I ran my first set of tests using the new compression last week, it was a little underwhelming to say the least. I hadn’t run any performance tests for over a year and a half, and a lot has changed since then. The numbers were not very good, the compression was roughly the same as the older compression and the Linkbench requests per second were quite similar to the old compression, just a little bit better, nothing compelling. I think it could be due to NUMA issues in the NVMFS driver as the specific hardware that I tested on is one of our newer machines with 4 sockets.

I did some profiling using Linux Perf but nothing stood out, the biggest CPU hog was Linkbench. I played around with the Java garbage collector settings, but it made no difference. Then I started looking for any latching overheads that may have been introduced in the final version. I noticed that the hole punching call was invoked under the protection of the AIO mutex. After moving the code out of that mutex, there was only a very small improvement. After a bit of head scratching, I tried to decompose the problem into smaller pieces. We don’t really do much in this feature, we compress the page, write the page to disk and then punch a hole to release the unused disk blocks. Next I disabled the hole punching stage and the requests per second went up 500%, this was encouraging. Were we making too many hole punching calls? It turns out we were. We only need to punch a hole if the compressed length is less than the previous length on disk. With this fix (which is not in the current 5.7.8 RC) the numbers were again looking much better.

One observation that I would like to make is that the compression/decompression doesn’t seem to be that expensive, about 15% overhead. The hole punching in NVMFS seems to have a bigger impact on the numbers, approximately 25% on NVMFS.

The host specs are:

RAM:                   1 TB 

Disk1:                 FusionIO ioMemory SX300-1600 + NVMFS

Disk2:                 2 x Intel SSDSA2BZ10 in a RAID configuration + EXT4

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 47
Stepping:              2
CPU MHz:               2394.007
BogoMIPS:              4787.84
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-9,40-49
NUMA node1 CPU(s):     10-19,50-59
NUMA node2 CPU(s):     20-29,60-69
NUMA node3 CPU(s):     30-39,70-79

Below are the important my.cnf settings from my tests. I won’t claim that this is a highly tuned configuration, it’s just what I used. As long as it is the same for all three tests, I don’t think it should matter much:

loose_innodb_buffer_pool_size = 50G
loose_innodb_buffer_pool_instances = 16
loose_innodb_log_file_size = 4G
loose_innodb_log_buffer_size = 128M
loose_innodb_flush_log_at_trx_commit = 2
loose_innodb_max_dirty_pages_pct = 50
loose_innodb_lru_scan_depth = 2500
loose_innodb_page_cleaners = 32
loose_innodb_checksums = 0
loose_innodb_flush_neighbors = 0
loose_innodb_read_io_threads = 8
innodb_write_io_threads = 8
loose_innodb_thread_concurrency = 0
loose_innodb_io_capacity = 80000
loose_innodb_io_capacity_max=100000

I then ran three tests:

No compression with the default page size of 16K.
Old compression with an 8K page size.
New compression with default page size of 16K and COMPRESSION="lz4".

The Linkbench commands used to load and test were:

linkbench -D dbid=linkdb -D maxid1=100000001 -c config/MyConfig.properties -l

linkbench -D requesters=64 -Dmaxid1=100000001 -c config/MyConfig.properties -D requests=500000000 -D maxtime=3600 -r

LinkBench Load Times

The loading times when using the new compression were quite similar to the loading times without any compression, while the old compression quite slow by comparison.

The first test was on NVMFS:

The next test was on a slower Intel SSD using EXT4:

LinkBench Request Times

The old compression gave the worst result again. It is 3x times slower than the new compression. I also think that since we haven’t even really begun to optimise the new compression code yet, we can squeeze some more juice out of it.

This is on NVMFS:

This is on a slower Intel SSD using EXT4:

I was quite surprised with this result and it needs more analysis. The new transparent compression actually performed better than with no compression. I was not expecting this :-). Perhaps the new compression gives better numbers because it writes less data to the disk compared to the uncompressed table. I don’t know for sure yet.

Space Savings

So it performs better, but does it save on disk space? The old compression performs a little better here, but only by about 1.5%.

The is using NVMFS:

This is with the Intel SSD using EXT4:

The old compression saves more space in our second test. This is likely because the block size on the EXT4 file system was set at the default of 4K. Next I want to check if using a smaller block size helps us and provides the better requests/per second numbers.

Conclusion

This feature adds more choice for our users. On the right hardware and file system combination you can get up to a 300% performance boost and get the same space savings you get with the older compression.

Please let us know what you think of this new feature. We’d love to hear your feedback! If you encounter any problems with this new feature, please let us know here in the comments, open a bug report at bugs.mysql.com, or open a support ticket.

Thank you for using MySQL!.

PlanetMySQL Voting: Vote UP / Vote DOWN

↧

MariaDB Galera Cluster 10.0.21 and 5.5.45 now available

August 18, 2015, 12:46 pm

≫ Next: MySQL checksum

≪ Previous: InnoDB Transparent Page Compression

The MariaDB project is pleased to announce the immediate availability of MariaDB Galera Cluster 10.0.21 and 5.5.45. These are both Stable (GA) releases.

Download MariaDB Galera Cluster 10.0.21

Release Notes Changelog What is MariaDB Galera Cluster?

Download MariaDB Galera Cluster 5.5.45

Release Notes Changelog What is MariaDB Galera Cluster?

MariaDB APT and YUM Repository Configuration Generator

See the Release Notes and Changelogs for detailed information on these releases.

Thanks, and enjoy MariaDB!

mariadb-seal-shaded-browntext-alt