BLAS-on-Flash: An alternative to large-scale ML training and inference?


Microsoft Research India


Julian Shun
Many large scale machine learning training and inference tasks are memory-bound rather than compute-bound i.e. on large data sets, the working set of these algorithms does not fit in memory for jobs that could run overnight on a few multi-core processors. This often forces an expensive redesign of the algorithm for distributed platforms such as parameter servers and Spark. BLAS-on-flash provides an inexpensive and efficient alternative based on the observation that many ML tasks admit algorithms that can be programmed with linear algebra subroutines. Our library supports a BLAS and sparseBLAS interface on large SSD-resident matrices, enabling multi-threaded code to scale to industrial scale datasets on a single workstation. Using BLAS-on-flash, we are able to process 10x larger models on 10x larger inputs in the same memory envelope in two key production pipelines: training large scale topic models and inference for extreme multi-label learning. This suggests that our approach could be an efficient alternative to expensive distributed big-data systems for scaling up structurally complex machine learning tasks.

In this talk, we will take a look at the BLAS-on-flash API, design and implementation of the runtime and the above mentioned case-studies in detail.

Relevant Paper:
Subramanya, Suhas Jayaram, et al. "BLAS-on-flash: An Efficient Alternative for Large Scale ML Training and Inference?" NSDI 2019.