Efficiently Searching for Distributions
Speaker
Sandeep Silwal
CSAIL MIT
Host
Behrooz Tahmasebi
CSAIL MIT
Abstract: How efficiently can we search distributions? The problem is modeled as follows: we are given knowledge of k discrete distributions v_i for 1 <= i <= k over the domain [n] = {1,...,n} which we can preprocess. Then we get samples from an unknown discrete distribution p, also over [n]. The goal is to output the closest distribution to p among the v_i's in TV distance (up to some small additive error). State of the art sample efficient algorithms require Theta(log k) samples and run in near linear time.
We introduce a fresh perspective on the problem and ask if we can output the closest distribution in *sublinear* time. This question is particularly motivated as it is a generalization of the traditional nearest neighbor search problem: if we take enough samples, we can learn p explicitly up to low TV distance, and then find the closest v_i in o(k) time using standard nearest neighbor search. However, this approach requires Omega(n) samples.
Thus, it is natural to ask: can we obtain both sublinear number of samples and sublinear query time? We present some nice progress towards this question and uncover a very interesting statistical-computational trade-off.
This is joint work with Anders Aamand, Alex Andoni, Justin Chen, Piotr Indyk, Shyam Narayanan, and Haike Xu.
Bio: Sandeep is a final year PhD student at MIT, advised by Piotr Indyk. His interests are broadly in fast algorithm design. Recently, he has been working in the intersection of machine learning and classical algorithms by designing provable algorithms in various ML settings, such as efficient algorithms for processing large datasets, as well as using ML to inspire algorithm design.
We introduce a fresh perspective on the problem and ask if we can output the closest distribution in *sublinear* time. This question is particularly motivated as it is a generalization of the traditional nearest neighbor search problem: if we take enough samples, we can learn p explicitly up to low TV distance, and then find the closest v_i in o(k) time using standard nearest neighbor search. However, this approach requires Omega(n) samples.
Thus, it is natural to ask: can we obtain both sublinear number of samples and sublinear query time? We present some nice progress towards this question and uncover a very interesting statistical-computational trade-off.
This is joint work with Anders Aamand, Alex Andoni, Justin Chen, Piotr Indyk, Shyam Narayanan, and Haike Xu.
Bio: Sandeep is a final year PhD student at MIT, advised by Piotr Indyk. His interests are broadly in fast algorithm design. Recently, he has been working in the intersection of machine learning and classical algorithms by designing provable algorithms in various ML settings, such as efficient algorithms for processing large datasets, as well as using ML to inspire algorithm design.