The main portion of each student's grade in this course is the final group project. Students will organize into groups of three and choose to implement a project that is (1) relevant to the materials discussed in class and (2) requires a significant programming effort from all team members. The projects will vary in both scope and topic, but they must satisfy this criteria. We will discuss this more in depth during class at the beginning of the semester.
Each project is comprised of four tasks that are due at different times during the semester:
- Proposal: Each group will provide a proposal of their project topic and present it to the class.
- Project Update #1: Each group will provide a brief update to the class about the current status of their project.
- Project Update #2: Each group will provide a brief update to the class about the current status of their project.
- Final Presentation + Code Drop: Each group presents the final version of their project.
At a high-level, each project consists of three implementation tasks. The first is the actual implementation of the proposed idea in the DBMS. The second is the set of unit and regression tests that they will use to check whether their implementation is correct. The final piece is the evaluation of their implementation to determine how will the DBMS performs with it.
Each group will be assigned a single CMU-DB Github repository for all development. Everyone will be provided additional resources for testing (via Github actions) and additional Amazon AWS credits.
A project is not considered complete until the instructor has signed off on the submission.
The new research cloud-native OLAP DBMS aims to serve as a vehicle for adaptive query execution and query optimization research. Following recent trends, we will build the DBMS following a microservices architecture with disaggregated storage. Specifically, each of the five core building blocks -- catalog, optimizer, scheduler, execution engine, and blob cache -- will be a self-contained microservice. An administrator can deploy the microservices on any hardware with a local disk and scale each microservice independently. The microservices will communicate with each other following a pre-negotiated protocol.
While each component can assume access to a local disk for temporary files, actual data is persisted remotely in a cloud blob storage (i.e., S3) as Apache Parquet files. As a simplifying assumption, the DBMS will not support transactions, concurrency control, altering a table's schema online, or modifying data at the granularity of individual tuples. Instead, the DMBS will bulk load data (i.e., through
LOAD semantics) or directly install the location of a table's data into the catalog.
Before describing each building block in more detail, we will briefly describe how these building blocks interact. The DBMS will build upon the frontend exposed by Apache Datafusion to handle parsing the input SQL query. The DBMS will load relevant data from the catalog to resolve and bind the SQL query into a logical plan. The optimizer, guided by its cost model, performs expression, logical, and physical optimization to produce a Datafusion-compatible physical query plan. The query fragment-based scheduler receives the query plan and then orchestrates multiple execution nodes to execute the query. An execution node has a local instantiation of the execution engine, which executes a query plan fragment by reading parquet files from the blob store through the I/O service or receiving input tuples from another execution node. The root execution node returns the output to the user through the scheduler. We will next walk through and describe each building block in more detail.
Topic #1: Scheduler
The scheduler receives Apache Datafusion query plans as input. Remember that the scheduler orchestrates and executes the fragments across the possible execution nodes and shepherds the final result set back to the user. The scheduler must break the query plan into fragments and rewrite portions to instruct the execution engine to read tuples from/output tuples to another machine. Alternatively, the scheduler can require the optimizer group to provide already constructed query fragments, in which case the scheduler only needs to add routing information.
Topic #2: Execution Engine
The execution engine accepts as input Datafusion-compatible query plans, potentially with some alterations to read tuples from another node, and executes the query plan to produce the required output. The engine should fetch all data from the blob store through the blob cache microservice. The execution engine should also implement the ability to spill to disk for any operation that might exceed the size of the instance's memory.
Topic #3: Catalog Service
The DBMS will expose a catalog interface following the Iceberg Catalog REST interface. The REST interface provides a uniform approach for external services to extract metadata from the DBMS and a mechanism for the other building blocks to fetch/store metadata as needed. This project aims to produce a functional catalog that adheres to the Iceberg catalog specification. Note that the catalog should persist across restarts.
Topic #4: I/O Service & Distributed Cache
This componet provides an interface between the execution engine and storage (e.g., cloud object-store like AmazonS3).The service handles I/O requests made by the execution engine for a specific object. Based on the caching policy, the cache will retain a copy of the data on local disk storage to service future requests for the same blob.
Topic #5: Optimizer
The query optimizer performs a cost-based search to convert a logical query plan into an executable physical query plan. You will build upon the cmu-db/optd framework. The query optimizer receives a logical plan from Apache Datafusion and must output a compatible Datafusion plan. Being compatible means we can take the query plan output by your query optimizer and execute the plan with Datafusion. There are three broad sub-projects within this category. Each group is expected to work on a separate sub-project.
The cost model estimates the cost of executing a given query plan. Historically, cost models have been heuristics-based with magic constants and rely on statistics to estimate the work a given query plan operator will perform. Recent research has pushed toward adaptive cost models incorporating live statistics from query execution to refine the cost estimates for future invocations. The goals for this sub-project are two-fold: (1) create a functional cost model from statistics in the catalog and (2) incorporate adaptivity from live statistics.
Expression optimization focuses on optimizing the expressions that appear in the query plan. For instance, a scan node can have a filter
a <= 1 + 2 which can be optimized to the equivalent filter
a <= 3. More exotic optimizations can leverage information about the data or a partition of the data to eliminate the predicate. This sub-project aims to extend the optd framework to support evaluating and optimizing expressions.
Logical and Physical Optimization
Logical optimization focuses on rewriting queries while preserving the query's meaning. These are primarily rule-based. A simple rule could be to flip a join operator's left and right child. Another rule can push a filter through a join operator to a scan operator (i.e., predicate pushdown). A more complex rule can rewrite a subquery into a join operator. Naturally, logical optimization does not produce executable Apache Datafusion-compatible query plans. Physical optimization focuses on selecting the particular executable approach for a given logical operator. For instance, an optimizer could implement a join operator with a nested loop join, a hash join, or a broadcast join. This sub-project aims to augment optd to support a richer class of rewrites and a principled mechanism of applying rules across a large search space.
- Everyone has to work in a team of three people for this assignment.
- Groups are allowed to and strongly encouraged discuss the details about the project with others.