Showcase - CMU 15-721 :: Advanced Database Systems (Spring 2023)

Fixed-Point Decimals

Students: Yuttapichai (Guide) Kerdcharoen, Taoxi (Ross) Jiang, Jekyeom Jeon
Source Code: https://github.com/pnxguide/pgfixeypointy

This project implements pgfixeypointy to support fixed-point decimal as a Postgres UDT. Also documented and optimized CMU-DB's libfixeypointy project.

Implementing and Flattening Nested LATERAL joins in DuckDB

Students: Sam Arch, Arham Chopra, Mayank Baranwal
Source Code: https://github.com/arhamchopra/duckdb

This project adds support for (and flattens) nested LATERAL joins in DuckDB i.e. LATERAL joins inside subqueries and/or subqueries inside of LATERAL joins. Although, nested LATERAL joins are already supported by popular database systems such as Postgres, these LATERAL joins are not flattened (decorrelated) and hence are evaluated using naive correlated evaluation (similar to nested loops). Therefore, we add support for (and flattened) nested LATERAL joins in DuckDB. To achieve this we modified the binder/rewriter to bind LATERAL joins, maintain depth information and performed recursive top-down flattening of subqueries and LATERAL joins. As a result, DuckDB now supports highly efficient nested LATERAL joins, necessary for high-performance execution of inlined UDFs.

Adaptive Query Optimization in PostgreSQL

Students: Jiayin Zheng, Xinyi Jiang, Aolei Zhou
Source Code: https://github.com/Xinyi7/postgres-private

This project focuses on enhancing query optimization in PostgreSQL through adaptive techniques. We introduce a background process that incorporates a KNN based machine learning model. This model learns better cardinality estimation by real-time execution statistics and historical statistics to generate improved query plans. To facilitate this, we employ an adaptive plan switching mechanism that dynamically adjusts the plan during execution. If the newly generated plan exhibits a lower estimated cost than the previous plan, the mechanism stops the ongoing query execution, reinitializes, and executes the new plan. Our scalability tests on the JOB dataset demonstrate a significant 120% improvement over the Postgres baseline in 36.7% of the queries.

PL/pgSQL UDF Compilation in DuckDB

Students: Yuchen Liu, Arvin Wu, Adrian Abedon
Source Code: https://github.com/hkulyc/15721-final-project

This project adds supports for PL/pgSQL (with no SELECT statement) to DuckDB. We create a transpiler that generates C++ code from User Defined Function and compiles it using compilers like clang. Our compiler will also automatically vectorize control flows like conditionals and loops to take full advantage of DuckDB's vectorization framework.

pgextmgrext: An extension that manages Postgres extensions

Students: Abigale Kim, Chi Zhang, Yuchen Liang
Source Code: https://github.com/cmu-db/pgextmgrext

This project implements a Postgres extension that manages other Postgres extensions. We provide a new API to Postgres extension developers that enables them to write extensions with fewer LoC, along with pg_poop, a Postgres extension that converts TEXT/VARCHAR output to poop emojis, which is a proof-of-concept extension that uses our manager's API. Lastly, we provide some analysis of the Postgres extension environment, surveying around 40 extensions and whether they conflict when installed and loaded together.

Efficiently Executing UDFs via Batching

Students: Kai Franz
Source Code: https://github.com/kai-franz/udf

UDFs are executed as a black box invoked row-by-row in most DBMSs. This is inefficient because each invocation incurs costly context switches and the optimizer can't decorrelate queries inside the UDF. Previously this problem has been approached by inlining the UDF: translating it into SQL and substituting it into the calling query. However, inlining inserts costly lateral join operators which are difficult to decorrelate. This project implements batched UDF execution, where the calling query is pushed into the UDF instead of the UDF being pulled into the calling query. Batching reduces the number of context switches and allows the query optimizer to decorrelate queries inside the UDF without relying on using lateral joins. We compare batching to Froid, a state-of-the-art inlining technique. Batching outperforms Froid by up to 63x on SQL Server.

Autonomous ML Pipeline for Postgres

Students: Lichen Jin
Source Code: https://github.com/bluecat1024/PG_ML_Pipeline

Modern autonomous DBMS automatically choose appropriate actions according to the workload in DB instances, to make queries faster and DB cost lower. To evaluate the benefit of potential actions (Tuning knobs or adding indices) on a certain workload, an accurate estimation of cost is critical. State-of-the-art works indicate that learning based approaches do a good job on predicting the potential query performance, but ML based approaches always consist of complex data collection and feature engineering from the DB instance, and also non-trivial model hyperparameter tuning, which brings quite some difficulty to designing new models. Therefore, this project is motivated to lower the trivial labors in building DB cost models, by breaking the ML pipeline down into several decoupled parts. This system uses existing ML libraries in Python daemon service, and is expected to transform different format of data flow flexibly. As a result, users can add new feature sets and modeling with minimal trivial efforts. In addition, everything in feature engineering and model selection can be configured using an xml file, so comparison experiments can be easily conducted across different approaches. For example, the AutoML feature provided by the ML frameworks can also be configured into this universal framework. Following that, a fair QPP task model comparison can be conducted.

Showcase Last Updated: May 11, 2023