Muzing on NoSQL, damned ! can't get rid of InifiniDB

NoSQL have been frequently used for building analytic solutions. The big picture is using some scalable client code with map reduce to distribute full data scan.

This approach have nothing new to the RDBMS world and can be considered an extension of the Kimball ROLAP normalization, just allowing more CPU power on a single query.

NoSQL or sharding take advantages on

Distributed processing

NoSQL or sharding are loosing advantages such

Fast memory communication
Per column data type optimization and deserialization cost (NoSQL)
C processing when reduced with slower language (NoSQL)

There is more non technical advantages in classic ROLAP normalization like using same well known OLTP tools and bug free storage engine, all coming with GPL licences for reducing the ownership cost. In many, many case it will allow building analytic solutions with acceptable response time.

Our existing tools in My SQL using vertical scalability contains

MyISAM for minimizing query latency
MyISAM read only pack table for sparing space and io
Deprecated merge table for partitioning
Native partitioning
InnoDB transparent bzip compression
InnoDB read ahead , and multiple io threads
Tokutek storage engine with predictable response time on table sizing

On the same front MariaDB is cutting the gap with the best commercial RDBMS, introducing new algorithms based on query plan optimization

Table elimination (MariaDB 5.1)
Batched Key Access (MariaDB 5.3)
Join buffering with outer joins
Index Condition Pushdown
Subquery optimizations (MariaDB 5.3)
NULL-aware materialization
Materialization-scan for grouping queries
Predicate caching
FROM subquery optimizations

ROLAP, despite being the most common approach, is facing some well know limitations that OLAP or NoSQL try to balance.

Single query is bound to a single server, and for My SQL single core. It is good enough in most case for a marketing or management requirements putting limited load on the analytic solution and asking data with some predefine time delay.

Putting technical team in front of the write vs read challenge, adding more indexes will speed some queries but doubtfully can cover all queries and will slow down writes and DDL.

Over time the schema is more and more linked with specific requirements, bringing aggregates and datamarts and forcing to maintain bunch of ETL assynchronus code.

NoSQL defenders tend to clams this is not acceptable? Analytic systems today are moving online and could populate events at the rate of the number of Internet sessions, clicks or network packets. This is just true, but is NoSQL the best solution ?

On the same track to NoSQL,a toolkit is emerging in MariaDB and Drizzle, to overcome those limitations

Column based OLAP storage engine
ROLAP sharding solutions with or without transparent map reduce

What is the best today? Again there is no rule of thumb but i can share some founding in my day to day job.

OLAP storage engine InfiniDB, Infobright, LucidDB provide very good results on processing a lot of records, depending on the underlying hardware and type of queries. I tend to give credit to InfiniDB to maximize resources out of a single query and Infobright to provide better response time on limited hardware resources.

Based on the top 10 queries of an existing client running an online DWH 1 Billion records on 8 cores hardware RAID 5 4*15000tpm SAS, 32G RAM

	InnoDB	InfiniDB	InfoBright
Avg number of logical records/s on all type of queries	1 598 682	12 098 102	4 527 652
Minimum response time / million logical records scan best case on a 128M rows group by	0.6	0.02	0.06

So far so good OLAP is the big winner here !

Can NoSQL come clause to OLAP, at the rate of 500 000 get/s ? It would required 128 servers to come clause to such response time and 12 of them to maintain avg read records rate. Hopefully NoSQL can use bulk data transfer covered with distributed SQL layer.

On the other side, ROLAP sharding can put more memory caching on a single query, introducing home made logical partitioning that can follow the business rules like putting most hot data on a single node, mixing huge aggregates with small data lookups. This approch is already credited with very good results for me on a 20 billion DWH build last year for a fraction of the price of commercial DB

Can ROLAP come clause to OLAP and how many servers will i need to achieve the same performance ? According to previous specific bench it is 1 to 1 if we can make the all cores work together, but for the same latency it is more 1 to 3 or 4. Can we verify this with micro benchmarking.

Let's abstract a linear parallel processing on ROLAP designed with some sharding.

With Spider storage engine allowing transparent horizontal sharding, we end up splitting the data in per day shards, giving 2M rows fact tables per partition, distributing the shard across multiple MariaDB instances. Spider make it possible to do this and reattach your data just reusing same SQL statements.

Let's now present some micro benchmark that discourage me from the brute force only approach.

Mounting the partitions on a RAMFS, let catch the time for a single 2 Million rows partition scan

InnoDB	1.3
Pbxt	0.8
MyISAM Pack	1.32
MyISAM	0.46
Aria 32K blocs	0.86

With linear scalability we can get 8 cores to perform 32M row scan/s on MyISAM.
Giving our previous query doing 128M row aggregate in 2s in InfiniDB, result would have run in 4 secondes, 2 times slower comparing with InfiniDB and this is just abstracting network, disk access and any consolidation time for each shard.

What is the magic behind OLAP, could it be per column sharding ?

Our Q10 queries is touching 3 fields out of the 12 in the giving table.This is opening the possibility to divide full scan time per 4 with additional 3 times the scan of a primary key and adding 2 times the CPU to join 2 millions rows. This operation could possibly be done transparently with the excellent VP storage engine.

Oups, join on 2M records is adding minimum 8 seconds per shard for a full scan. This POC was just a big u turn. Conclusion, sharding per column and brut force is not the solution in most case giving the cost of algorithms to rejoin the multiple peaces.

Fetch 2 INT COLUMNS 2M rows MyISAM	Join cache level	Time to scan all records
FULL SCAN 8 bytes RECORD SIZE	1	1.07
FULL SCAN 108 bytes RECORD SIZE	1	1.87
FULL JOIN NESTED LOOP	1	17.48
FULL JOIN BKAH	8	15.43
FULL JOIN BKA	5	15.3
FULL JOIN NESTED LOOP COVERING INDEX	1	8.37

OLAP engines take clever advantage of metadata per shard and pre aggregate informations it means that any ROLAP sharding or NoSQL solution should also come to the same conclusion in regard to latency, and adding hybrid mechanisms like additional metadata or pre calculated tables that can limit the cost of brute forcing.

Some great idea would be to build a transparent metadata layer,with My SQL proxy or gearman that will inject some pre calculated results per shard, execute asynchronously at insert time. I have no doubt that a map reduce solution for building such asynchronously metadata is the way to go for a ROLAP sharding with almost real time statistics.

For those already convince, I'd like to point out, a not so known feature of My SQL call WITH ROLLUP to help trigger aggregating data.

Image may be NSFW.
Clik here to view.

PlanetMySQL Voting: Vote UP / Vote DOWN

Muzing on NoSQL, damned ! can't get rid of InifiniDB

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112