Sunday, April 10, 2016

Corporate Analytics and Heterogeneous CPU/GPU Clusters

A couple of relatively recent developments point to a new computing paradigm potentially making impact on corporate IT environment:

  • the rise of GPU capabilities and related frameworks ( Nvidia, AMD chips; CUDA, cuDNN, OpenCL software )
  • well publicised advances in Deep Learning  ( A. Ng was an early proponent of  GPUs in Deep Learning )
  • the Release of Google TensorFlow software ( seamless deployment of Deep Learning algorithms in heterogeneous CPU/GPU/mobile environments )
Some of these developments might eventually trickle down to corporate analytics departments as they push boundaries of what is possible in massive numeric calculation space ( especially in financial modeling/risk/stress testing, instrument pricing etc. ). Financial analytics is well positioned to take advantage of these advances as many risk calculations, for example, are numerically intensive and often embarrassingly parallel ( matrix operations, scanning large volumes of data ). 

GPU's main attraction is the ability to perform instructions in parallel (SIMD, SIMT ). Typically GPU has hundreds of cores on a single chip, as opposed of just a few cores on a standard CPU. GPUs are affordable commodity processors produced in millions for use in gaming computers.
Some of GPU drawbacks are relatively small GPU memory; logic ( branching/control ) capabilities are limited; the need for data movement transfer between CPU and GPU memory in heterogeneous, mixed workload environments.







Nvidia CUDA is a proprietary GPGPU ( General Purpose GPU ) API. OpenCL is a framework for writing programs that execute in heterogeneous CPU/GPU environment.

Some elements of a strong activity in Deep Learning area are directly applicable to financial industry  computing needs. For example, Deep Learning algorithms often involve executing large matrix operations or solving differential equations using stochastic gradient descent -  a common occurence in financial industry numerical calculations.

Deep Learning frameworks like Caffe ( single node ), Theano, as well as Google's TensorFlows are able to take advantage of both CPUs and GPUs ( they use CUDA or OpenCL for low level activities ). Google has TensorFlow framework (  single node version was open sourced in November of 2015 ).  Spark eco system is developing frameworks like SparkNet, CaffeonSpark that make it possible to execute Deep Learning algorithms in heterogeneous CPU/GPU environments.

Core Spark project announced that it might utilize OpenCL to better take advantage of GPU capabilities.

Here is speedup achieved with HeteroSpark with mixed CPU/GPU cluster:

Last but not least, Facebook released design for the Big Sur - 8 GPU card server with configurable PCI paths with intra-node parallelism. Nvidia announced DGX-1, 170 teraflop, $130,000 monster - a supercomputer in a box. Such designs might denote a shift to configurations with smaller number of more powerful servers.