New Problems and Perspectives on Learning, Testing, and Sampling in the Small Data Regime

Speaker

Greg Valiant

Host

Constantinos Daskalakis

MIT CSAIL

Abstract: I will discuss several new problems related to the general challenge of understanding what conclusions can be made, given a dataset that is relatively small in comparison to the complexity or dimensionality of the underlying distribution from which it is drawn. In the first setting we consider the problem of learning a population of Bernoulli (or multinomial) parameters. This is motivated by the ``federated learning" setting where we have data from a large number of heterogeneous individuals, who each supply a very modest amount of data, and ask the extent to which the number of data sources can compensate for the lack of data from each source. Second, we will discuss the problem of estimating the ``learnability'' of a dataset: given too little labeled data to train an accurate model, we show that it is often possible to estimate the extent to which a good model exists. Specifically, given labeled data pairs (x, y) drawn from some unknown distribution over such pairs, it is possible to estimate how much of the variance of y can be explained via the best linear function of x, even in the regime where it is impossible to approximate that linear function. Finally, I will introduce the problem of data "amplification". Given n independent draws from a distribution, D, to what extent is it possible to output a set of m > n datapoints that are indistinguishable from m i.i.d. draws from D? Curiously, we show that nontrivial amplification is often possible in the regime where n is too small to learn D to any nontrivial accuracy. We also discuss connections between this setting and the challenge of interpreting the behavior of GANs and other ML/AI systems. This talk will also highlight a number of concrete and more conceptual open directions in all three veins.

This work is based on several papers, with Weihao Kong, and with Brian Axelrod, Shivam Garg, and Vatsal Sharan.

Add to Calendar 2019-05-16 16:00:00 2019-05-16 17:00:00 America/New_York New Problems and Perspectives on Learning, Testing, and Sampling in the Small Data Regime Abstract: I will discuss several new problems related to the general challenge of understanding what conclusions can be made, given a dataset that is relatively small in comparison to the complexity or dimensionality of the underlying distribution from which it is drawn. In the first setting we consider the problem of learning a population of Bernoulli (or multinomial) parameters. This is motivated by the ``federated learning" setting where we have data from a large number of heterogeneous individuals, who each supply a very modest amount of data, and ask the extent to which the number of data sources can compensate for the lack of data from each source. Second, we will discuss the problem of estimating the ``learnability'' of a dataset: given too little labeled data to train an accurate model, we show that it is often possible to estimate the extent to which a good model exists. Specifically, given labeled data pairs (x, y) drawn from some unknown distribution over such pairs, it is possible to estimate how much of the variance of y can be explained via the best linear function of x, even in the regime where it is impossible to approximate that linear function. Finally, I will introduce the problem of data "amplification". Given n independent draws from a distribution, D, to what extent is it possible to output a set of m > n datapoints that are indistinguishable from m i.i.d. draws from D? Curiously, we show that nontrivial amplification is often possible in the regime where n is too small to learn D to any nontrivial accuracy. We also discuss connections between this setting and the challenge of interpreting the behavior of GANs and other ML/AI systems. This talk will also highlight a number of concrete and more conceptual open directions in all three veins. This work is based on several papers, with Weihao Kong, and with Brian Axelrod, Shivam Garg, and Vatsal Sharan. 32-G463 (Star)

Organizer & Contact

Rebecca Yadegar

ryadegar@csail.mit.edu

Part of

Theory of Computation Seminar (ToC) 2019

New Problems and Perspectives on Learning, Testing, and Sampling in the Small Data Regime

Speaker

Host

May 16 2019

Location

Organizer & Contact

Part of

November 05

Anindya De: Junta correlation is testable.

September 24

László Végh: A Strongly Polynomial Algorithm for Linear Exchange Markets

New Problems and Perspectives on Learning, Testing, and Sampling in the Small Data Regime

Speaker

Host

May 16 2019

Location

Organizer & Contact

Part of

Related Events

November 05

Anindya De: Junta correlation is testable.

September 24

László Végh: A Strongly Polynomial Algorithm for Linear Exchange Markets