Wednesday, August 27, 2014

Cloud, Mobile, Big Data and Other Buzzwords - This Too Shall Pass ? Not Really, They are Here, and They are Real

IT industry's constant drumbeat of Cloud, Mobile, Big Data can be somewhat abstract and sound too far removed from reality for many seasoned, hardened and perhaps cynical IT professionals who went through many overhyped non-events ( CASE revolution, Y2K scare, OODBMS, Grid Computing, XYZ Methodology that will change everything ... ). 
Yet, I think IT pros are well advised not to overlook latest developments in Big Data, Cloud and Mobile arenas as something substantially new is happening. 
Here is another example of how converging Cloud, Mobile and Big Data technologies can help provide new business insights and increase profits:  applications that keep track of shopper behavior inside physical stores. Vendords like RetailNext or Euclid Analytics provide applications that  gather location data from your smartphone and other sources ( video ) the moment you enter the store, process the data ( possibly combined with data coming from other sources like POS, loyalty and payment cards )  in the Cloud ( AWS, for example ) and deliver analytic reports, dashboards and mobile applications. Possible improvements come from staffing, floor plan optimizations, improved marketing and customer experiences, etc. 
This type of applications is perfectly suited for Cloud as data 100% lives in the Cloud. It originates in the Cloud ( smartphone, physical stores ), must be transferred to Cloud data processing service and results are delivered to Cloud/Mobile consumers. It is brick-and-mortar store version of what e-commerce sites like Amazon do to maximize customer experience and profits. It is potentially another milestone event in Big Data world,  since most of shopping ( over 90% ) is still traditional, 'walk-in' retail. What we today classify as Big Data computing originated and is still somewhat limited to relatively small number of purely Internet companies like eBay, Amazon ( in retail space ), or Google, Facebook, Yahoo ( in search or web application space ). In-store Analytics will change this as all major retailers will have to embrace it.   

Monday, August 11, 2014

The Ultimate Database: Mesa - Google's New, Near Real-Time Geo-Replicated Data Warehouse

Each paper coming from Google is guaranteed to create a stir, discussions, as well as a slew of startup and open source imitation attempts. It is thus important to keep track of what is going on at Google, as some of ideas  presented  there might once become mainstream ( like Hadoop did ). 

Google will soon present a paper on Mesa, Google proprietary software product, which tries to resolve some performance limitations of F1 and Spanner ( Google's previous attempts at geo-replicated databases ). 

According to the paper:
Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails. 

F1/Spanner were impressive already, as they presented large scale, globally distributed database. In contrast, current enterprise database software attempts to provide globally distributed databases usually involve messy replication software like Oracle GoldenGate, custom written conflict resolution routines etc. ( it is difficult to keep databases in sync, especially in multi-node, active-active configuration ). 

In Mesa updates are done in relatively frequent, regular batches ( hence near real-time designation ), which significantly increases update throughput. Petabyte scale ACID properties are supported, on elastic, commodity hardware base. 

Mesa paper also implicitly states that MapReduce is still used at Google ( contrary to what Google's SVP Urs H√∂lzle said just recently, at Google IO conference 2014 ). Mesa uses MR  for parallel batch updates and pre-aggregating data ( not for direct queries ). 

Sunday, August 03, 2014

Big Data Killer App - Usage Based Insurance

Larry Ellison compares IT and fashion industries, saying that IT industry's most frequent product is .. bunch of synonyms ( fashion industry often just slaps new names on old products, and so does IT ).
Many IT buzzwords fizzled so nobody should be surprised when some good ideas and concepts are doubted. Big Data is definitely one of them, and although some industry analysts think Big Data is going down and out, hardly a day goes by without related article in mainstream business press and IT related publications.

Perhaps best way to describe Big Data is by example. I chose car usage based insurance industry revolution, which is partially enabled by Big Data. It is now possible to effectively collect and analyze actual driving data and adjust insurance premium accordingly. Usage Based Insurance is relatively old concept dating back to early 90s, but more recent technological developments made it a mandatory part of insurer activities. An example of a simple UBI implementation is Progressive Insurance Snapshot product which collects and sends ( via car onboard diagnostics port attached dongle ) basic driving information ( hard braking, distance driven ) over to  Progressive, so it can more accurately determine driver's risk profile.

Why is UBI Big Data application ? Amount of data generated can be staggering - even basic information like time of day, car location, acceleration, speed can easily amount to 10 MB per customer per year, which equals to 1TB of data for just 100,000 customers. This data also comes at high rate ( velocity in Big Data definition ) and advanced ( predictive ) analytics is needed to accurately determine driver's risk profile. It is also possible to provide real time analysis of driver's data, compare current speed with maps containing speed limits and provide immediate feedback to the driver.  UBI mandates  rethinking of architecture and tools to be able to effectively collect and analyze driving data.

Friday, August 01, 2014

US Visa System Database Crash and Performance Problems - Too Big to Handle for Oracle Database

These days we are witnessing performance and access problems with US Visa system. The culprit is Consular Consolidated Database  - data warehouse deployed as Oracle database on Windows operating system. Current database size is over 100TB ( allegedly one of the largest Oracle databases in the world ), growing double-digits percent per year, storing text and photos, with complex feeds and lookups into external and internal databases. Reason for crash and latest performance problems is a patch that was applied to resolve performance problems.

We think that there are multiple reasons this problem popped up:
- database size is clearly beyond what Oracle database software can comfortably handle
- mixed data types are futher exacerbating the problem ( photos ), as RDBMS is general, and Oracle in particular have a  problem with storing data beyond straightforward text and numbers
- inadequate hardware and operating system ( Wintel )

Possible origins of the problem are: bad analysis of business requirements, bad capacity planning, wrong architecture choice. Probably neither of these steps was done correctly.

Oracle database clearly has a problem with scale out beyond certain size ( couple of dozen TB ) and we are doubtful that even the latest software and hardware currently available from Oracle ( Oracle 12c In-Memory with Exadata on M6-32 Database Machine ) would provide adequate performance for future use on this database. Nevertheless, State Department chose Oracle again for upgrade ( can't get fired for buying Oracle ? ... that might change, even in government ).

Architecture that is better fitted for this purpose is one of many MPP databases available on the market which can comfortably handle large databases ( starting with venerable Teradata and on to SAP HANA, etc.), with photos stored outside database in file system.

Update ( 6/19/2016 - a year later ):  another multi day world wide outage ( still ongoing as of this day). It looks like it is blamed on hardware this time around. Whatever the root cause is, fact remains that Oracle based  US visa system is brittle, unstable and not in functional HA configuration.