Efficiently Searching for Distributions

Speaker

Sandeep Silwal

CSAIL MIT

Host

Behrooz Tahmasebi

CSAIL MIT

Abstract: How efficiently can we search distributions? The problem is modeled as follows: we are given knowledge of k discrete distributions v_i for 1 <= i <= k over the domain [n] = {1,...,n} which we can preprocess. Then we get samples from an unknown discrete distribution p, also over [n]. The goal is to output the closest distribution to p among the v_i's in TV distance (up to some small additive error). State of the art sample efficient algorithms require Theta(log k) samples and run in near linear time.

We introduce a fresh perspective on the problem and ask if we can output the closest distribution in *sublinear* time. This question is particularly motivated as it is a generalization of the traditional nearest neighbor search problem: if we take enough samples, we can learn p explicitly up to low TV distance, and then find the closest v_i in o(k) time using standard nearest neighbor search. However, this approach requires Omega(n) samples.

Thus, it is natural to ask: can we obtain both sublinear number of samples and sublinear query time? We present some nice progress towards this question and uncover a very interesting statistical-computational trade-off.

This is joint work with Anders Aamand, Alex Andoni, Justin Chen, Piotr Indyk, Shyam Narayanan, and Haike Xu.

Bio: Sandeep is a final year PhD student at MIT, advised by Piotr Indyk. His interests are broadly in fast algorithm design. Recently, he has been working in the intersection of machine learning and classical algorithms by designing provable algorithms in various ML settings, such as efficient algorithms for processing large datasets, as well as using ML to inspire algorithm design.

Add to Calendar 2024-02-15 16:00:00 2024-02-15 16:30:00 America/New_York Efficiently Searching for Distributions Abstract: How efficiently can we search distributions? The problem is modeled as follows: we are given knowledge of k discrete distributions v_i for 1 <= i <= k over the domain [n] = {1,...,n} which we can preprocess. Then we get samples from an unknown discrete distribution p, also over [n]. The goal is to output the closest distribution to p among the v_i's in TV distance (up to some small additive error). State of the art sample efficient algorithms require Theta(log k) samples and run in near linear time.We introduce a fresh perspective on the problem and ask if we can output the closest distribution in *sublinear* time. This question is particularly motivated as it is a generalization of the traditional nearest neighbor search problem: if we take enough samples, we can learn p explicitly up to low TV distance, and then find the closest v_i in o(k) time using standard nearest neighbor search. However, this approach requires Omega(n) samples. Thus, it is natural to ask: can we obtain both sublinear number of samples and sublinear query time? We present some nice progress towards this question and uncover a very interesting statistical-computational trade-off.This is joint work with Anders Aamand, Alex Andoni, Justin Chen, Piotr Indyk, Shyam Narayanan, and Haike Xu.Bio: Sandeep is a final year PhD student at MIT, advised by Piotr Indyk. His interests are broadly in fast algorithm design. Recently, he has been working in the intersection of machine learning and classical algorithms by designing provable algorithms in various ML settings, such as efficient algorithms for processing large datasets, as well as using ML to inspire algorithm design.

Organizer & Contact

Thien Le

thienle@csail.mit.edu

Part of

ML Tea

Efficiently Searching for Distributions

Speaker

Host

February 15 2024

Location

Organizer & Contact

Part of

May 05

ML Tea: Algorithm Design with Learned Predictions

April 28

ML Tea: Evaluating Multiple Models Using Labeled and Unlabeled Data

Efficiently Searching for Distributions

Speaker

Host

February 15 2024

Location

Organizer & Contact

Part of

Related Events

May 05

ML Tea: Algorithm Design with Learned Predictions

April 28

ML Tea: Evaluating Multiple Models Using Labeled and Unlabeled Data