To obtain scalable Bayesian inference methods, we develop algorithms to create compact “summaries” of large quantities of data. We can then quickly run standard inference algorithms on these summaries without needing to look at the whole dataset.

The use of Bayesian methods in large-scale data settings is attractive because of the rich hierarchical models, uncertainty quantification, and prior specification they provide. However, standard Bayesian inference algorithms are computationally expensive, making their direct application to large datasets difficult or infeasible. In certain models (known as exponential families) we can use sufficient statistics to summarize arbitrary amounts of data with a just a finite set of numbers. However, most models do not admit a finite number of sufficient statistics, so we must remember all of the data. Because we need to reference all the data, inference becomes very slow as datasets grow large. In this project we are developing algorithms to approximately summarize a dataset in a model-specific way. These compact data summaries can then be used in place of the full dataset when performing Bayesian inference, leading to substantial gains in computational efficiency while only decreasing the accuracy of inferences by a known amount. We are applying our methods to a range of models from generalized linear models to Bayesian nonparametric ones.