Cardinality Estimation using Lp norms on Degree Sequences
Speaker
Dan Olteanu
University of Zurich
Host
Nesime Tatbul
Abstract
Cardinality estimation is the problem of estimating the size of the output of a query, without actually evaluating the query. The cardinality estimator is a critical piece of any query optimizer, and is often the main culprit when the optimizer chooses a poor plan.
In this talk, I will introduce LpBound, a pessimistic cardinality estimator that computes a guaranteed upper bound on the size of the query output. Besides common statistics used for cardinality estimation, such as the sizes of the input relations, the domain sizes for various columns, histograms, and the most common key values in join columns, LpBound also uses Lp norms on the degree sequences of the join columns. The computed cardinality bound is an optimal solution of a linear program whose constraints encode such statistics and Shannon inequalities.
I will also report on an on-going experimental evaluation of LpBound for a variety of queries from the JOB, STATS, and subgraph matching benchmarks: acyclic and cyclic queries, full and group-by queries, and queries with and without range and equality predicates. LpBound takes time and uses memory very close to the traditional estimators used by PostgreSQL, DuckDB, and a commercial database system. Yet the range of LpBound overestimates is much smaller than the range of the estimates of the traditional estimators, which can both underestimate and overestimate by orders of magnitude. When fed the LpBound cardinalities, PostgreSQL can take less query evaluation time than when using its own estimates.
This is joint work with Mahmoud Abo Khamis (RelationalAI), Chris Mayer (U Zurich), Dan Suciu (U Washington), and Haozhe Zhang (U Zurich).
Bio
Dan Olteanu is a professor at the University of Zurich, where he leads the Data Systems and Theory group (https://www.ifi.uzh.ch/en/dast.html), and a computer scientist at RelationalAI (https://relational.ai), where he works on incremental view maintenance, cardinality estimation, in-database machine learning, and data science projects.
Cardinality estimation is the problem of estimating the size of the output of a query, without actually evaluating the query. The cardinality estimator is a critical piece of any query optimizer, and is often the main culprit when the optimizer chooses a poor plan.
In this talk, I will introduce LpBound, a pessimistic cardinality estimator that computes a guaranteed upper bound on the size of the query output. Besides common statistics used for cardinality estimation, such as the sizes of the input relations, the domain sizes for various columns, histograms, and the most common key values in join columns, LpBound also uses Lp norms on the degree sequences of the join columns. The computed cardinality bound is an optimal solution of a linear program whose constraints encode such statistics and Shannon inequalities.
I will also report on an on-going experimental evaluation of LpBound for a variety of queries from the JOB, STATS, and subgraph matching benchmarks: acyclic and cyclic queries, full and group-by queries, and queries with and without range and equality predicates. LpBound takes time and uses memory very close to the traditional estimators used by PostgreSQL, DuckDB, and a commercial database system. Yet the range of LpBound overestimates is much smaller than the range of the estimates of the traditional estimators, which can both underestimate and overestimate by orders of magnitude. When fed the LpBound cardinalities, PostgreSQL can take less query evaluation time than when using its own estimates.
This is joint work with Mahmoud Abo Khamis (RelationalAI), Chris Mayer (U Zurich), Dan Suciu (U Washington), and Haozhe Zhang (U Zurich).
Bio
Dan Olteanu is a professor at the University of Zurich, where he leads the Data Systems and Theory group (https://www.ifi.uzh.ch/en/dast.html), and a computer scientist at RelationalAI (https://relational.ai), where he works on incremental view maintenance, cardinality estimation, in-database machine learning, and data science projects.