Friday, February 17, 2017

Cloud Spanner, Apache Kudu, TensorFlow; Deep Learning is a Hammer in Search of a Nail ( and Why That is Okay )

It is only February, yet we already have a slew of important announcements in Big Data arena. Google made its monster Spanner database available to everybody via Cloud Spanner API. Spanner has a potential to make obsolete armies of inhouse IT staff maintaining primary and DR sites, copying, synchronizing, backing up and restoring, Golden Gating and generally being involved in  futile attempts to serve consistent data across the enterprise.
Spanner is the ultimate database - automatically georeplicated and monstrously scalable, ACID compliant, relational ( SQL support ), in other words it is a true cloud relational database. It might finally push hesitating enterprises into a public cloud arena, as advantages far outweigh general concerns about security and lack of control over data and infrastructure. Spanner is a generational leap over established RDBMS vendor offerings ( Oracle, IBM, Microsoft ).

Another Google driven development is a release of TensorFlow 1.0. It confirms TensorFlow traction and fast pace of innovation, as well as Google's commitment to it, with major performance ( compiler ) and usability ( layers, packaged, prebuilt models, new APIs, integration of Keras with TensorFlow announcement, Estimator API).





In another development Cloudera/Apache announced Apache Kudu 1.0 release. Kudu fills an important gap in Hadoop eco system - it is SQL/ACID compliant engine with random read/write capability. Hadoop/HDFS files can only be appended to ( with the exception of MapR distribution ), which ripples and causes problems through the whole ecosystem ( no updates, indexes, or ugly workarounds for the same). HBase has the ability for random RW, but never caught on, partially because of the lack of SQL interfrace ( Phoenix tried to solve it, but now we are talkiing about scaffold that is too flimsy and fractured, even by Hadoop standards ).
With Kudu we can expect further Hadoop ecosystem onslaught on Teradata, Netezza, Exadata turfs.

In this little post we are also addressing critiques that Machine / Deep Learning is a solution in search of a problem. Yes, it is, and no, it is not the first time that approach is taken, with great outcome. Transistor, computer, satellites were also invented first, then found its many applications. Just imagine how would top down, pick a problem, then find a solution approach work, if banks, for example, said: we have a problem handling all this data; why don't we create a project where we will invent transistor, create a computer, compilers, languages and databases to help manage our business.
So yes, Machine Learning is bottom up approach. ML/ DL is  a foundational technology that is becoming a solution for many problems ( for example, an RNN/LSTM algorithm can be used as generic computer that can compute anything a conventional computer can compute, and is thus a basic building block ). We have a hammer searching for nails it can successfully hit, and it looks like there are many out there.