Wednesday, August 27, 2014

Cloud, Mobile, Big Data and Other Buzzwords - This Too Shall Pass ? Not Really, They are Here, and They are Real

IT industry's constant drumbeat of Cloud, Mobile, Big Data can be somewhat abstract and sound too far removed from reality for many seasoned, hardened and perhaps cynical IT professionals who went through many overhyped non-events ( CASE revolution, Y2K scare, OODBMS, Grid Computing, XYZ Methodology that will change everything ... ). 
Yet, I think IT pros are well advised not to overlook latest developments in Big Data, Cloud and Mobile arenas as something substantially new is happening. 
Here is another example of how converging Cloud, Mobile and Big Data technologies can help provide new business insights and increase profits:  applications that keep track of shopper behavior inside physical stores. Vendords like RetailNext or Euclid Analytics provide applications that  gather location data from your smartphone and other sources ( video ) the moment you enter the store, process the data ( possibly combined with data coming from other sources like POS, loyalty and payment cards )  in the Cloud ( AWS, for example ) and deliver analytic reports, dashboards and mobile applications. Possible improvements come from staffing, floor plan optimizations, improved marketing and customer experiences, etc. 
This type of applications is perfectly suited for Cloud as data 100% lives in the Cloud. It originates in the Cloud ( smartphone, physical stores ), must be transferred to Cloud data processing service and results are delivered to Cloud/Mobile consumers. It is brick-and-mortar store version of what e-commerce sites like Amazon do to maximize customer experience and profits. It is potentially another milestone event in Big Data world,  since most of shopping ( over 90% ) is still traditional, 'walk-in' retail. What we today classify as Big Data computing originated and is still somewhat limited to relatively small number of purely Internet companies like eBay, Amazon ( in retail space ), or Google, Facebook, Yahoo ( in search or web application space ). In-store Analytics will change this as all major retailers will have to embrace it.   

Monday, August 11, 2014

The Ultimate Database: Mesa - Google's New, Near Real-Time Geo-Replicated Data Warehouse

Each paper coming from Google is guaranteed to create a stir, discussions, as well as a slew of startup and open source imitation attempts. It is thus important to keep track of what is going on at Google, as some of ideas  presented  there might once become mainstream ( like Hadoop did ). 

Google will soon present a paper on Mesa, Google proprietary software product, which tries to resolve some performance limitations of F1 and Spanner ( Google's previous attempts at geo-replicated databases ). 

According to the paper:
Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails. 

F1/Spanner were impressive already, as they presented large scale, globally distributed database. In contrast, current enterprise database software attempts to provide globally distributed databases usually involve messy replication software like Oracle GoldenGate, custom written conflict resolution routines etc. ( it is difficult to keep databases in sync, especially in multi-node, active-active configuration ). 

In Mesa updates are done in relatively frequent, regular batches ( hence near real-time designation ), which significantly increases update throughput. Petabyte scale ACID properties are supported, on elastic, commodity hardware base. 

Mesa paper also implicitly states that MapReduce is still used at Google ( contrary to what Google's SVP Urs H√∂lzle said just recently, at Google IO conference 2014 ). Mesa uses MR  for parallel batch updates and pre-aggregating data ( not for direct queries ). 

Sunday, August 03, 2014

Big Data Killer App - Usage Based Insurance

Larry Ellison compares IT and fashion industries, saying that IT industry's most frequent product is .. bunch of synonyms ( fashion industry often just slaps new names on old products, and so does IT ).
Many IT buzzwords fizzled so nobody should be surprised when some good ideas and concepts are doubted. Big Data is definitely one of them, and although some industry analysts think Big Data is going down and out, hardly a day goes by without related article in mainstream business press and IT related publications.

Perhaps best way to describe Big Data is by example. I chose car usage based insurance industry revolution, which is partially enabled by Big Data. It is now possible to effectively collect and analyze actual driving data and adjust insurance premium accordingly. Usage Based Insurance is relatively old concept dating back to early 90s, but more recent technological developments made it a mandatory part of insurer activities. An example of a simple UBI implementation is Progressive Insurance Snapshot product which collects and sends ( via car onboard diagnostics port attached dongle ) basic driving information ( hard braking, distance driven ) over to  Progressive, so it can more accurately determine driver's risk profile.

Why is UBI Big Data application ? Amount of data generated can be staggering - even basic information like time of day, car location, acceleration, speed can easily amount to 10 MB per customer per year, which equals to 1TB of data for just 100,000 customers. This data also comes at high rate ( velocity in Big Data definition ) and advanced ( predictive ) analytics is needed to accurately determine driver's risk profile. It is also possible to provide real time analysis of driver's data, compare current speed with maps containing speed limits and provide immediate feedback to the driver.  UBI mandates  rethinking of architecture and tools to be able to effectively collect and analyze driving data.

Friday, August 01, 2014

US Visa System Database Crash and Performance Problems - Too Big to Handle for Oracle Database

These days we are witnessing performance and access problems with US Visa system. The culprit is Consular Consolidated Database  - data warehouse deployed as Oracle database on Windows operating system. Current database size is over 100TB ( allegedly one of the largest Oracle databases in the world ), growing double-digits percent per year, storing text and photos, with complex feeds and lookups into external and internal databases. Reason for crash and latest performance problems is a patch that was applied to resolve performance problems.

We think that there are multiple reasons this problem popped up:
- database size is clearly beyond what Oracle database software can comfortably handle
- mixed data types are futher exacerbating the problem ( photos ), as RDBMS is general, and Oracle in particular have a  problem with storing data beyond straightforward text and numbers
- inadequate hardware and operating system ( Wintel )

Possible origins of the problem are: bad analysis of business requirements, bad capacity planning, wrong architecture choice. Probably neither of these steps was done correctly.

Oracle database clearly has a problem with scale out beyond certain size ( couple of dozen TB ) and we are doubtful that even the latest software and hardware currently available from Oracle ( Oracle 12c In-Memory with Exadata on M6-32 Database Machine ) would provide adequate performance for future use on this database. Nevertheless, State Department chose Oracle again for upgrade ( can't get fired for buying Oracle ? ... that might change, even in government ).

Architecture that is better fitted for this purpose is one of many MPP databases available on the market which can comfortably handle large databases ( starting with venerable Teradata and on to SAP HANA, etc.), with photos stored outside database in file system.

Update ( 6/19/2016 - a year later ):  another multi day world wide outage ( still ongoing as of this day). It looks like it is blamed on hardware this time around. Whatever the root cause is, fact remains that Oracle based  US visa system is brittle, unstable and not in functional HA configuration.

Saturday, July 26, 2014

Oracle 12c Database In-Memory is Out - Hardly Anybody Notices

Oracle 12c with In-Memory option ( note that even Oracle doesn't dare to call it In-Memory RDBMS - hence awkward Database In-Memory designation )  was released last week. Press and media are rightfully silent about it ( aside from a couple of  Oracle heads who are excited something "new" is finally happening in Oracle database world ). Much hyped Oracle flagship database's initial foray into in-memory, columnar world is a storm in a teacup, as no ground breaking features are offered ( that is, unless your database world view is limited to Oracle only ).

Oracle In-Memory option is yet another (column-organized) cache on top of old and tired row/disk based database ( establ. 1979 ).
Oracle's latest solution is fairly pedestrian - columnar cache has to be reloaded and data transformed from row to columnar on each database restart or first use of a particular table. The question this approach raises is how fast will cache be populated on each database startup, since query performance will surely suffer until cache is fully reloaded.

IBM BLU, SAP HANA are ahead of Oracle as they store data on disk natively in columnar format i.e. once data is written to disk it does not need to be transformed from row to columnar any more.

Oracle is clearly in defensive mode and not in aggressive leading edge charge. Their selling point is: no changes, backwards compatibility, all remains the same. You will get spectacular improvements in performance in some specific cases and under specific conditions. Bottom line is: you pay a lot ( In-Memory is separately priced option ), do little in terms of your technical effort ( it is easy to administer this feature and it is transparent to applications ) and experience some performance gains ( improvements will be noticeable in specific cases and situations, not across the board ). Oracle will surely continue to expand on this theme, but at this pace it will take a while to catch up with SAP Hana and IBM BLU. Any possibility of Oracle taking the lead position in database innovation race is out of question for now.

Friday, July 18, 2014

Big Data/Hadoop and Incumbent RDBMS Vendors - Saga Continues

All major RDBMS vendors ( Oracle, IBM, Teradata ) are developing and executing on strategies on how to cope with the rise of Hadoop, as well as on how to ride Big Data wave.
As far as Hadoop is concerned, initial idea was to use simple RDBMS-Hadoop connectors and loaders and thus contain or reduce Hadoop's role to storage or perhaps ETL platform. We are now witnessing next stage of Hadoop related strategies -   proliferation of federated query engines like Teradata QueryGrid, SAP Hana Smart Data Access, recently announced Oracle Big Data SQL etc. 
Oracle Big Data SQL, Teradata QueryGrid and other  federated query approaches over heterogeneous data ( Composite software and other data virtualization vendors don't belong to this category ) have Hadoop in cross-hairs i.e. they are legacy vendor attempts to cope with inevitable rise of Hadoop as centerpiece of Big Data initiatives.
Federated query engines typically originate queries from respective  legacy vendor software and/or hardware platforms. Oracle Big Data SQL, for example, runs on custom hardware only for now ( it doesn't, actually - it is still vaporware as of this date, in customary, time-honored Oracle manner ); Hadoop related part is essentially Exadata cellsrv software port to Hadoop datanodes. Distributed queries are executed across heterogeneous data sources ( Hadoop, databases etc ) with varied degrees of intelligence ( predicate pushdown, local data processing via smart scan in case of Oracle), query optimization and performance. 
Fundamental problem with this approach is that it centers Big Data activities in wrong place.
Hadoop is synonymous with Big Data initiative and is the hub around which other data sources will revolve i.e. Hadoop is a system of record for Big Data. Big Data activities should not be centered around legacy data platforms like Oracle, Teradata etc., which is exactly what above mentioned products enforce.
Federated query solutions like Oracle Big Data SQL only cover minor Big Data use cases (even if they deliver on performance area, which in itself is a tough problem to solve in heterogeneous environments ). 
This class of products should be viewed as legacy vendors attempt to defend and expand their turf by leveraging large installed base.
Some of these products are fairly advanced as they build on decades of experience in data management and are backed by huge financial and other resources of legacy vendors. The old guard can thus innovate on Hadoop platform quite fast. Hadoop was not initially built for BI or corporate data management. Relatively inexperienced ( at least in enterprise database management software development arena ) dedicated Hadoop vendors like Cloudera are tiptoeing around basic DBMS concepts and rediscovering tricks that legacy vendors mastered over decades of experience. 
Not surprisingly, Teradata was one of the first vendors to release such federated query solution ( first called SQL-H, now it is QueryGrid )  - probably because potential Hadoop squeeze is felt the strongest in their high end, very large data warehouses niche, which also happens to be Hadoop's entry point into DBMS market. 
Hadoop and Big Data are new approaches to building completely new analytic infrastructure and develop whole new class of applications based on nearly infinite scalability and near zero storage prices. While we can borrow some concepts and technologies from the old world, Big Data folks are also experimenting with newish concepts like schema-on-read that will redefine how we deal with all aspects of analytics pipeline.