From Large to Small Datasets: Size Generalization for Clustering Algorithm Selection

Speaker

Ellen Vitercik

Stanford University

Host

Costis Daskalakis

CSAIL MIT

Abstract: In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm's output will be structurally close to the ground truth. We approach this problem by introducing a notion of size generalization for clustering algorithm accuracy. We identify conditions under which we can (1) subsample the massive clustering instance, (2) evaluate a set of candidate algorithms on the smaller instance, and (3) guarantee that the algorithm with the best accuracy on the small instance will have the best accuracy on the original big instance. We verify these findings both theoretically and empirically.
This is joint work with Vaggos Chatziafratis and Ishani Karmarkar.

Add to Calendar 2024-03-19 16:15:00 2024-03-19 17:15:00 America/New_York From Large to Small Datasets: Size Generalization for Clustering Algorithm Selection Abstract: In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm's output will be structurally close to the ground truth. We approach this problem by introducing a notion of size generalization for clustering algorithm accuracy. We identify conditions under which we can (1) subsample the massive clustering instance, (2) evaluate a set of candidate algorithms on the smaller instance, and (3) guarantee that the algorithm with the best accuracy on the small instance will have the best accuracy on the original big instance. We verify these findings both theoretically and empirically.This is joint work with Vaggos Chatziafratis and Ishani Karmarkar. 32-G449

Organizer & Contact

Joanne Talbot Hanley

joanne@csail.mit.edu

617-253-6054

Part of

Theory of Computation (ToC) Seminar 2024

From Large to Small Datasets: Size Generalization for Clustering Algorithm Selection

Speaker

Host

March 19 2024

Location

Organizer & Contact

Part of

May 21

Circuit Complexity from Circuit Analysis Algorithms

May 14

Low-Step Multi-Commodity Flow Emulators

From Large to Small Datasets: Size Generalization for Clustering Algorithm Selection

Speaker

Host

March 19 2024

Location

Organizer & Contact

Part of

Related Events

May 21

Circuit Complexity from Circuit Analysis Algorithms

May 14

Low-Step Multi-Commodity Flow Emulators