Amoeba: A shape changing storage system for big data
Data partitioning is crucial to improving query performance and several workload-based partitioning techniques have been proposed in database literature. However, many modern analytic applications involve ad-hoc or exploratory analysis where users do not have a representative query workload a priori. Static workload-based data partitioning techniques are therefore not suitable for such settings. To address this problem, we present a distributed storage system called Amoeba. Amoeba uses adaptive multi-attribute data partitioning to improve the data layout based on workload changes. It efficiently support ad-hoc as well as recurring queries. Amoeba requires zero set-up and tuning effort, allowing analysts to get the benefits of partitioning without requiring an upfront query workload. The key idea is to build and maintain a partitioning tree on top of the dataset. The partitioning tree allows us to answer queries with predicates by reading a subset of the data. Amoeba adapts it over time by incrementally modifying subtrees based on user queries using repartitioning. A prototype of Amoeba running on top of Apache Spark improves query performance by up to 7x over full scans and up to 2x over range-based partitioning techniques on TPC-H as well as a real-world workload.
Data Systems GroupRelated Links
Contact us
If you would like to contact us about our work, please refer to our members below and reach out to one of the group leads directly.
Last updated Oct 18 '17