Sampling Sketches for Concave Sublinear Functions of Frequencies
Speaker
Ofir Geri
Stanford
Host
Govind Ramnarayan, Quanquan Liu, Sitan Chen
We consider massive distributed datasets that consist of elements that are key-value pairs. Our goal is to compute estimates of statistics or aggregates over the data, where the contribution of each key is weighted by a function of its frequency (sum of values of its elements). This fundamental problem has a wealth of applications in data analytics and machine learning.
A common approach is to maintain a sample of keys and estimate statistics from the sample. Ideally, to obtain low-variance estimates we sample keys with probabilities proportional to their contributions. One simple way to do so is to first aggregate the raw data to produce a table of keys and their frequencies, apply our function to the frequency values, and then apply a weighted sampling scheme. This aggregation however requires data structures of size proportional to the number of distinct keys and is too costly when the number is very large. Our main contribution is the design of composable sampling sketches that can be tailored to any concave sublinear function of the frequencies (including log, the moments x^p for 0
A common approach is to maintain a sample of keys and estimate statistics from the sample. Ideally, to obtain low-variance estimates we sample keys with probabilities proportional to their contributions. One simple way to do so is to first aggregate the raw data to produce a table of keys and their frequencies, apply our function to the frequency values, and then apply a weighted sampling scheme. This aggregation however requires data structures of size proportional to the number of distinct keys and is too costly when the number is very large. Our main contribution is the design of composable sampling sketches that can be tailored to any concave sublinear function of the frequencies (including log, the moments x^p for 0
Joint work with Edith Cohen.