Statistical inference with computational constraints

Host

Noah Golowich
MIT
Abstract: The vast amount of digital data we create and collect has revolutionized many scientific fields and industrial sectors. However, despite our success in harnessing this transformative power of data, computational and societal trends emerging from the current practices of data science necessitate upgrading our toolkit for data analysis.
In this talk, we discuss how practical considerations such as time and memory limits affect statistical inference tasks, with a particular focus on the problem of entropy estimation of a distribution by streaming over i.i.d. samples from it. We determine how bounded memory affects the number of samples we need to solve this problem.
Recent work by Acharya et al. (NeurIPS 2019) showed how to estimate the entropy of a distribution D over an alphabet of size k up to \epsilon additive error by streaming over (k/\epsilon^3)⋅polylog(1/\epsilon) i.i.d. samples and using only O(1) words of memory. In this work, we propose a new constant memory scheme that reduces the sample complexity to (k/\epsilon^2)⋅polylog(1/\epsilon). We conjecture that this is optimal up to polylog(1/\epsilon) factors.
Based on: https://arxiv.org/abs/2205.09804
Joint work with: Andrew McGregor, Jelani Nelson, Erik Waingarten