Showcase | CMU 15-721 :: Advanced Database Systems (Spring 2024)

Last Updated: May 09, 2024

optd: Next Generation Query Optimizer

Students: Avery Qi, Benjamin Owad, Ritu Pathak
Source Code: https://github.com/cmu-db/optd

This project implements a standalone Cascades-based query optimizer, currently integrated with Apache Arrow DataFusion. Our team integrated a number of additional rules, including projection pushdown, filter pushdown, and unnesting arbitrary queries. We also made improvements to the core of the optimizer including partial exploration, a multi-pass architecture including heuristic based rules, and physical property support. Additionally, we were able to improve the existing testing infrastructure.

Gungnir: Query Optimizer Cost Model

Students: Patrick Wang, David Guo, Alexis Schlomer
Source Code: https://github.com/cmu-db/optd

Gungnir is an open-source Rust cost model based on high-performance parallel sketches, with advanced features like semantic join correlation and adaptive subplan cardinality caching. Our sketches reach a throughput of 10Gb/s, which is more than 100x faster than Postgres. Our cardinality estimates (without adaptivity) have a median Q-Error that lower than Postgres's by 3.5x on TPC-H SF1, 2.6x on JOB, and 1.8x on JOB-light.

Parpulse: I/O Service for Modern OLAP Database System

Students: Yuanxin Cao, Kunle Li, Lan Lou
Source Code: https://github.com/cmu-db/15721-s24-cache1

The goal of this project is to develop an I/O service for an Online Analytical Processing (OLAP) database system. This service will facilitate communication between the execution engine and remote storage solutions such as Amazon S3. Additionally, a local cache will be incorporated to store recently accessed data on the local disk, thereby accelerating future data retrievals. The I/O service is designed to manage requests from the execution engine and fetch pertinent data (e.g., Parquet files) from either the local cache or remote storage. It will process the data and return a stream of the decoded data to the execution engine. The initial phase of this project aims to construct a fully functional I/O service following the specifications outlined above. Further enhancements, such as kernel bypass and integration of io_uring, may be considered in the future.

ISTZIIO

Students: J-How Huang, Shuning Lin, Xintong(Oscar) Zhou
Source Code: https://github.com/cmu-db/15721-s24-cache2

OLAP (Online Analytical Processing) systems are critical for decision-making processes, where speed and efficiency in data handling directly impact business intelligence and outcomes. A significant challenge arises from the reliance on cloud blob storage, such as Amazon S3, which serves as the primary storage in a shared-disk architecture. Although cloud blob storage offers scalability and durability, it is accompanied by significant latency issues, which can lead to I/O bottlenecks, especially during intensive read operations inherent in OLAP systems. To mitigate latency issues, [ISTZIIO] project aims to introduce an I/O service layer running between the computation nodes (which run execution engines) and cloud blob storage, serving as a file cache that is physically closer to the computation nodes. A taxonomy of this concept is browser, web cache, and web server. It serves as an intermediary between the computation nodes and the cloud storage, reducing data retrieval time, minimizing I/O bottlenecks, and thereby accelerating query execution.To ensure ease of use for the execution engines, a specialized I/O client library will be provided. This library is designed to seamlessly integrate with the execution engines, facilitating efficient interaction with the cloud blob storage proxy without the need for complex configurations or extensive code modifications.

Eggstrain & Async Buffer Pool Manager

Students: Kyle Booker, Sarvesh Tandon, Connor Tsui
Source Code: https://github.com/cmu-db/15721-s24-ee1

This project is a combination of two systems written in Rust: an asynchronous vectorized push-based execution called Eggstrain and an asynchronous buffer pool manager built on top of io_uring. Eggstrain is based heavily on DataFusion, an open-source query engine written in Rust. Eggstrain re-implements a subset of DataFusion's operators to use lightweight tasks (similar to coroutines) for I/O and dataflow, switching to heavyweight OS threads for compute-heavy workloads. By relying on the 3rd-party crates tokio and rayon, which implement a high-performance work-stealing asynchronous runtime and a parallel thread pool respectively, we are able to match the in-memory speed of DataFusion. However, an asynchronous execution engine loses its value if the storage / buffer pool manager is synchronous / blocking. Thus the second part of this project is building an asynchronous buffer pool manager. The manager is built on top of io_uring, a modern linux interface for asynchronous I/O operation. Even though the system has zero optimizations, it still outperforms RocksDB by around 2x-5x depending on the workload.

Push-based Vectorized Execution Engine compatible with Apache Datafusion

Students: Christos Laspias, Hyoungjoo Kim, Yash Kothari
Source Code: https://github.com/cmu-db/15721-s24-ee2

This project is an execution engine for OLAP queries based on Apache Datafusion and Arrow. It uses push-based vectorized model and custom hashing strategy for aggregates and equi-joins.

OLAP System Catalog Written in Rust

Students: Aditya Ajmera, Simran Makhija
Source Code: https://github.com/cmu-db/15721-s24-catalog1

We developed a metadata catalog to help the query optimizer in modern OLAP systems, which need efficient metadata management for analytic queries. Our Rust application uses RocksDB for metadata storage, exposing a REST API for interaction. The architecture includes a data model, database, service, and controller layers, ensuring seamless metadata management. RocksDB's key-value store, chosen for its concurrency and scalability, outperformed SQLite for our needs. Axum powers our REST API for efficient asynchronous operations. Testing ensures the catalog's correctness and performance, with a design focused on optimizing read performance over frequent updates.

GOLATAC: Iceberg-Compatible OLAP Database Catalog

Students: Zilong Zhou, Yen-Ju Wu, Chien-Yu Liu
Source Code: https://github.com/cmu-db/15721-s24-catalog2

This project implements a Catalog Service in rust for an OLAP database management system. The Catalog aims for managing metadata and providing a centralized repository for storing information about the structure and organization of data within the OLAP database. This project aims to produce a functional catalog that adheres to the Iceberg catalog specification exposed through REST API.

Chronos Scheduler

Students: Aditya Chanana, George Li, Shivang Dalal
Source Code: https://github.com/cmu-db/15721-s24-scheduler1

Chronos is a portable query scheduler/coordinator for a distributed OLAP database written in Rust. Chronos parses a physical query plan generated by the query optimizer, splitting the query into fragments and orchestrating their execution via the execution engine. It provides inter- and intra-query parallelism and incorporates a priority-based scheduling algorithm based on the query priority, cost, and queueing time.

Query Task Scheduler for Lakehouse Systems

Students: Aidan Smith, Makoto Tomokiyo, Mingkang Li
Source Code: https://github.com/cmu-db/15721-s24-scheduler2/

This project implements an pull-based query scheduler service that communicates via gRPC.

Showcase Last Updated: May 09, 2024