Monday, August 11, 2014

The Ultimate Database: Mesa - Google's New, Near Real-Time Geo-Replicated Data Warehouse

Each paper coming from Google is guaranteed to create a stir, discussions, as well as a slew of startup and open source imitation attempts. It is thus important to keep track of what is going on at Google, as some of ideas  presented  there might once become mainstream ( like Hadoop did ). 

Google will soon present a paper on Mesa, Google proprietary software product, which tries to resolve some performance limitations of F1 and Spanner ( Google's previous attempts at geo-replicated databases ). 

According to the paper:
Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails. 

F1/Spanner were impressive already, as they presented large scale, globally distributed database. In contrast, current enterprise database software attempts to provide globally distributed databases usually involve messy replication software like Oracle GoldenGate, custom written conflict resolution routines etc. ( it is difficult to keep databases in sync, especially in multi-node, active-active configuration ). 

In Mesa updates are done in relatively frequent, regular batches ( hence near real-time designation ), which significantly increases update throughput. Petabyte scale ACID properties are supported, on elastic, commodity hardware base. 

Mesa paper also implicitly states that MapReduce is still used at Google ( contrary to what Google's SVP Urs H√∂lzle said just recently, at Google IO conference 2014 ). Mesa uses MR  for parallel batch updates and pre-aggregating data ( not for direct queries ). 

No comments: