Friday, February 17, 2017

Cloud Spanner, Apache Kudu, TensorFlow; Deep Learning is a Hammer in Search of a Nail ( and Why That is Okay )

It is only February, yet we already have a slew of important announcements in Big Data arena. Google made its monster Spanner database available to everybody via Cloud Spanner API. Spanner has a potential to make obsolete armies of inhouse IT staff maintaining primary and DR sites, copying, synchronizing, backing up and restoring, Golden Gating and generally being involved in  futile attempts to serve consistent data across the enterprise.
Spanner is the ultimate database - automatically georeplicated and monstrously scalable, ACID compliant, relational ( SQL support ), in other words it is a true cloud relational database. It might finally push hesitating enterprises into a public cloud arena, as advantages far outweigh general concerns about security and lack of control over data and infrastructure. Spanner is a generational leap over established RDBMS vendor offerings ( Oracle, IBM, Microsoft ).

Another Google driven development is a release of TensorFlow 1.0. It confirms TensorFlow traction and fast pace of innovation, as well as Google's commitment to it, with major performance ( compiler ) and usability ( layers, packaged, prebuilt models, new APIs, integration of Keras with TensorFlow announcement, Estimator API).





In another development Cloudera/Apache announced Apache Kudu 1.0 release. Kudu fills an important gap in Hadoop eco system - it is SQL/ACID compliant engine with random read/write capability. Hadoop/HDFS files can only be appended to ( with the exception of MapR distribution ), which ripples and causes problems through the whole ecosystem ( no updates, indexes, or ugly workarounds for the same). HBase has the ability for random RW, but never caught on, partially because of the lack of SQL interfrace ( Phoenix tried to solve it, but now we are talkiing about scaffold that is too flimsy and fractured, even by Hadoop standards ).
With Kudu we can expect further Hadoop ecosystem onslaught on Teradata, Netezza, Exadata turfs.

In this little post we are also addressing critiques that Machine / Deep Learning is a solution in search of a problem. Yes, it is, and no, it is not the first time that approach is taken, with great outcome. Transistor, computer, satellites were also invented first, then found its many applications. Just imagine how would top down, pick a problem, then find a solution approach work, if banks, for example, said: we have a problem handling all this data; why don't we create a project where we will invent transistor, create a computer, compilers, languages and databases to help manage our business.
So yes, Machine Learning is bottom up approach. ML/ DL is  a foundational technology that is becoming a solution for many problems ( for example, an RNN/LSTM algorithm can be used as generic computer that can compute anything a conventional computer can compute, and is thus a basic building block ). We have a hammer searching for nails it can successfully hit, and it looks like there are many out there.




Friday, January 27, 2017

Enterprise Grade Risk Modeling Using Machine Learning

All pieces of a puzzle are now in place for a productive and successful large scale risk modeling using Machine Learning. A slew of recent software and hardware announcements means that we finally have a full, brand new stack of components to have a shot at productive enterprise class Machine Learning risk modeling exercise. Google's TensorFlow makes it possible to quickly train, test and run predictive models on a variety of target devices ( CPU, GPU ). Latest TensorFlow releases incorporate tf.learn library that makes it much easier to extract features and pass datasets on to train, test and predict modules. This trend towards ease of use will continue with announced Keras incorporation into TensorFlow build. On a hardware side, IBM just released Power AI platform/appliance that, aside from TensorFlow, also incorporates Nvidia hardware and software ( GPUs, Cuda, NVLink ).




Monday, January 02, 2017

A Comical Break: Moody's Prefers Intuition or Economic Theory to Machine Learning Because It Is Important to Have Theoretical Underpinnings

Here is Moody's ( 2012) take on variable ( feature ) selection in the context of risk analysis ( Methodology for Forecasting and Stress-Testing U.S. Vehicles ABS Deals ):

A key aspect of model development is variable selection-identifying which credit and economic
variables best explain the dynamic behavior of the dependent variable in question. Aligned with principles  of modern econometrics, we prefer to choose the variables based on a combination of economic theory or intuition, together with a consideration of the statistical properties of the estimated model.
We believe models built using pure data-mining techniques or principles such as machine learning, though they may fit the existing data well, are more likely to fail in a changing external environment because they lack theoretical underpinnings. The best prediction models employ a combination of statistical rigor with a healthy dose of economic principle. Models built this way enjoy the additional benefit of ease of interpretation.


I am not sure how they can claim the above with the straight face. Moody's is one of agencies that completely failed to predict 2008 housing originated crash. There are not known, scientific, or even commonly agreed upon "theoretical principles" or economic theory. It is now clearer why agencies have a problem with prediction, which is hard, especially about the future.  The FCIC commission found that agencies' credit ratings were influenced by "flawed computer models, ...". Yet they stick to the same practices. Continuing:


Adding each economic variable helps the model improve predictive power.Generally speaking, the economic variables should be useful in both producing accurate out-of sample forecasts and providing good in-sample fit. However, we sometimes have to make tradeoff decisions to balance out between these two goals when they are conflicting. If the
discrepancy is unavoidable and very significant, we prioritize forecast accuracy rather than in-sample fit, as forecasts are end results of our models. 


Translated: but when the above practice fails - we fudge by taking whatever works better - exactly an approach they ( Moody's ) dismissed earlier.


Here they finally convince us it is actually alchemy approach, based on art and intuition ( which doesn't prevent them from sprinkling some scary looking math - just for the artistic impression:







And here Moody's finally leaves no shade of doubt we are dealing with artists, entertainers and illusionists :
Variable selection is more art than science. The criteria mentioned above are not black or white.
The bottom line is to build a theoretically sound and empirically workable model and get reasonable and
consistent forecasts that are supported by both economic intuition and statistical significance.

To their credit, and unlike many inhouse modeling practices, Moody's actually checks how model performs, but they rarely admit model is wrong:

The consistency check is the comparison of model performance across different production runs. We keep track of the model performance by comparing the forecast statistics over time. The results of the analysis may suggest revisions to the model. However, differences do not necessarily indicate that the model is in error. We should look into what causes the discrepancy and how this affects the end results. If the statistics get really worse and fall into an unacceptable range, we should modify the original model to accommodate revised performance data and changing economic conditions and make sure that the model reflects the most recent development in the auto ABS market.


Thursday, December 29, 2016

Machine Learning and Risk Modeling

Machine Learning is offering an interesting possibility to redefine the way current risk modeling occurs in large financial institutions ( banks, insurance companies ). Regulatory and product oriented risk calculations are the crucial part of financial organizations activities ( probability of default for customers, organizations, countries, bank capitalization requirements, credit scores etc.  ). Current practices are a mix of alchemy, crossing your fingers and wishing for the best, some math and the lack of backtesting to check how models are actually performing. Models take long time to develop and execute, with results everybody pretends they believe in.

Machine Learning has a potential to completely redefine ( for the better ) the way risk modeling is done.


Existing resource and time consuming  processes could be replaced with a new Machine Learning paradigm where programs/models change far less frequently. Each new model development iteration doesn't necessarily need to imply a brand new model development ( or model tweaking ) and deployment.  Getting a new, updated model would just require running the existing neural network model with the newly available data. Classic model development lifecycle is replaced with  new techniques that must be learned ( developing, testing neural network models, determining hyperparameters, dealing with bias and variance i.e. preventing under/over fit etc. )
Regulatory aspect also needs to be taken care of, as regulators need to be on board with the proposed model changes ( for capitalization related risk calculations ).

While new ( supervised ) Machine Learning based paradigm will not save us from outliers ( Black Swan events ), it will definitely be more accurate, easier to deploy and maintain than the existing one. Thus we think it is high time even for mainstream financial organizations to start establishing foothold in self-programming world.       

Thursday, December 22, 2016

Forget About AI - Machine Learning Is What Matters

IT, like any other mature industry,  shows less and less capacity for the true and radical innovation. This is proven yet again with the latest waves of AI noise ( skyrocketing NIPS attendance, mediocre Salesforce Einstein software, crazy prices for companies repackaging old Machine Learning concepts).

What is a typical (or major ) financial institution to make and do about the latest craze ?

AI is a wide discipline with many, typically siloed i.e. disparate problem areas - it offers domain specific solutions to diverse, often unrelated sets of problems ( self-driving cars; recommendation systems; speech recognition ). It is also completely empirical ( result driven ), with no theoretical foundations or explanations why artificial neural network work the way they do, for example. (" But it works " - G. Hinton  51:00).

We remember quite well noises, notions and semi-flops of the past ( CASE, OODBMS, Y2K, Hadoop ). Even the Cloud made limited direct inroads to the standard enterprise landscape.  Cloud didn't became mainstream replacement for on premise hardware and software, as majority of mission critical corporate systems still keep data and run in house.

Oracle CEO Larry Ellison Bashes 'Cloud Computing' Hype ( please note Ellison repented since this 2009 cloud bash episode ).


Consequently we think that, once AI smoke clears and mirrors are gone, all that will be left will be good old Machine Learning ( lucratively renamed Deep Learning ), which will hopefully gain some foothold in more forward thinking ( or more adventurous ) financial establishments.



Saturday, August 20, 2016

Financial Markets and Deep Learning Methods

Financial markets modeling is quite imprecise, non-scientific discipline, resembling alchemy and other futile human endeavors. We know by now that it is next to impossible to fully model and completely predict how markets will behave. There are many factors  affecting security prices and financial markets in general, with human reaction being the biggest unknown and one that is impossible to model and thus predict.
What we can strive to do is perhaps to model smaller, relatively limited domains - an approach that is  not completely dissimilar to the difference between classic and quantum physics domains, for example.
Supervised learning gives us predictable powers, but is limited because we rely on history to predict the future. That all but invalidates it for this purpose as financial markets history ( and future ) rhymes, but doesn't repeat.
Unsupervised learning, on the other hand, does not have training set ( history ) problem, but provides limited set of options regarding what we can actually do with it. Aside from running clustering algorithms ( k-means etc. ) to find out groupings in data, there isn't much in proactive department that can be done with it.
Combination of supervised and unsupervised (semisupervised )methods is one approach that could potentially be successful ( ensemble learning i.e. combining multiple models into one ). There are dozens of algorithms already implemented in many frameworks and languages that can be combined to form a useful model.
We could for example run unsupervised deep learning algorithm on massive volumes of raw data to automatically discover areas of interest (clusters of data), then perform further analysis via targeted supervised learning methods to find out how data is correlated. This approach would give us tactical advantage i.e. useful actionable information.
The whole process can be automated  and performed on massive volumes of data ( sample = ALL ) quite inexpensively.  The latest wave of technology makes it possible to store and process EVERYTHING - all indexes, stock prices, derivatives, currencies, commodity prices - as much data as we can get or  buy - and store it cheaply on Hadoop cluster, preprocess using Spark framework, then model with TensorFlow or Spark MLlib. It can all be done in the cloud, and it is all open source software.
Amazon AWS even offers GPU instances to boost processing power ( Google Cloud doesn't offer GPUs yet ).
TensorFlow can take advantage of distributed GPUs, or any combination of CPU/GPUs.
Final result is an automated system that will react and recommend ( and possibly automaticaly act )  on new insights.
We are aware that HFT and systematic trading do something similar already, but our point of interest is not short term arbitrage ( which seems to be running out of steam anyway ) - it is to exploit deeper, longer lasting knowledge about market direction.  This is not rule based, hardcoded system that acts on predefined insights. It is a live system that automatically learns, gains insights and changes behavior. We could think of it as an attempt to create AlphaGo or DeepBlue for financial markets.
Renaissance Technologies, Two  Sigma and other hedge fund high fliers probably already utilize or are building similar systems.  What is new is commoditization of such approach - suddenly almost everybody can do it. Open source nature of the latest advances in AI, Deep Learning and advances in affordability and power of parallel processing level out playing field, thus nullifying decades of incumbent advantage. The race is on to transplant the latest relevant Deep Learning advances from Silicon Valley to active segments of financial industry. This could further stir already shaken hedge fund industry.

Thursday, July 21, 2016

Deep Learning, TensorFlow and Hedge Funds

Deep Learning is in these days. It really looks like late 19th century electricity gold rush deja vu, with industry, academia, as well as regular businesses all jumping in.
Some hedge funds are already heavyweight users of computer power and models ( systematic funds like Renaissance, Two Sigma etc. ).

The complexity and size of modern, globalized markets mandates usage of more advanced, automated data analysis methods. There are already indications ( visible if we observe how hedge funds fared  during Brexit turmoil, for example) that systematic funds stand better fighting chance in today's turbo markets.

The promise of Deep Learning is that critical mass of data and bigger neural networks ( more layers, more computing power ) mean qualitative change and that a new level of modeling accuracy is  achievable. General Deep Learning algorithms can be applied to domain specific use cases, relying primarily on massive amounts of data as the fuel for insights ( rather than the particular domain knowledge i.e. coming up with clever domain specific algorithms ).

Google's TensorFlow is recently open sourced library that runs on heterogeneous ( CPU, GPU, FPGA ), distributed platforms. It can be ( and already is ) used for financial markets modeling. If history is any guide, TensorFlow will ignite a whole new industry and many products will have it as architectural foundation.

TensorFlow makes it easy to perform complex analysis ( flexibly apply Deep Learning algorithms ) to multi dimensional data ( tensors ) and come up with relatively reliable predictions on where market will be based on earlier closed markets, for example. Naturally many other ideas and hypothesis can be tested ( models can be trained and executed-interferred) with great ease  - and that is probably one of the most important TensorFlow advantages.

Since hedge funds deal with publicly available data sets then cloud infrastructure ( AWS, Google Cloud ) can be utilized to essentially rent a supercomputer and perform massive calculations on the cheap. TensorFlow can light up such virtual supercomputer with just a few  lines of code.