Project #1 - Foreign Data Wrapper - CMU 15-721 :: Advanced Database Systems (Spring 2023)

Overview

The first programming project will teach you how to extend PostgreSQL with columnar storage and a custom access method. Specifically, you are tasked to implement a foreign data wrapper as a PostgreSQL extension that can read our custom db721 file format.

Conceptually, there are three components to this project:

Foreign data wrappers: handling access to remote objects from SQL databases
PostgreSQL extensions: extending PostgreSQL with new capabilities
db721: our custom columnar storage format

By the end of this project, you should:

Know how to write basic extensions for PostgreSQL.
Know how to write simple parsers for a custom columnar storage format.
Understand the idea of foreign data as defined by the SQL/MED specification, and know how to write a foreign data wrapper for PostgreSQL.
Understand and apply various kinds of OLAP optimizations (e.g., predicate pushdown, projection pushdown), interfacing directly with the PostgreSQL optimizer.
Appreciate the performance difference between row-oriented and column-oriented access.

You will be working directly on our fork of the PostgreSQL v15 source code. We provide starter code for C and/or C++17 using PGXS. We are open to the use of other languages (e.g., Rust has pgx), but please give us a heads up and expect minimal support if you decide to go this route. Your final deliverable is a PostgreSQL extension db721_fdw that we can install and run.

This is a project with many moving parts and you will need to write a lot of code. In our opinion, although the code is not conceptually complicated, you will write more code than you did for any of your 15-445/645 projects -- it's comparable to one of the lighter OS projects at CMU. We strongly advise you to start early!

This is a single-person project that will be completed individually (i.e., no groups).

Release Date: Jan 25, 2023
Due Date: Feb 26, 2023 @ 11:59pm

Background

Conceptually, you should understand the contents of the Storage Models, Data Layout, & System Catalogs lecture before you start this project. The following are some guiding questions that you should consider as you begin the development of this project:

What is the difference between DSM and NSM?
What is the difference between OLTP and OLAP workloads?
What are the advantages and disadvantages of DSM?

PostgreSQL Extensions

The authoritative source on PostgreSQL extensions is the official documentation. You should at least skim 38.10 (defining your own functions in C), 38.17 (packaging and installing extensions), and 38.18 (extension building infrastructure).

Using the PostgreSQL extension ecosystem, you can add new types, operators, functions, and more. For this project, you should understand the hook mechanism that enables some of this extensibility.

PostgreSQL Hooks

Hooks are essentially function pointers strategically placed around the PostgreSQL source code. Extensions are able to modify these hooks at run-time. For example, every Executor will always invoke the ExecutorStart_hook:

// /src/include/executor/executor.h
// Hook type defined here: https://github.com/cmu-db/postgres/blob/2a7ce2e2ce474504a707ec03e128fde66cfb8b48/src/include/executor/executor.h#L65
typedef void (*ExecutorStart_hook_type) (QueryDesc *queryDesc, int eflags);

// /src/backend/executor/execMain.c
// Hook defined here: https://github.com/cmu-db/postgres/blob/2a7ce2e2ce474504a707ec03e128fde66cfb8b48/src/backend/executor/execMain.c#L72
ExecutorStart_hook_type ExecutorStart_hook = NULL;

// Hook used here: https://github.com/cmu-db/postgres/blob/2a7ce2e2ce474504a707ec03e128fde66cfb8b48/src/backend/executor/execMain.c#L130-L145
void ExecutorStart(QueryDesc *queryDesc, int eflags) {
    // ...
    if (ExecutorStart_hook)
        (*ExecutorStart_hook) (queryDesc, eflags);
    else
        standard_ExecutorStart(queryDesc, eflags);
}

Notice that the design of hooks implies that only one extension can have its hook loaded at a time. The way that people get around this is by "stacking" extensions on top of each other,

newest extension on top
older extension below it
...
oldest extension at the bottom
PostgreSQL

where each extension is supposed to play nice and invoke the hook installed by the extension directly below it. As you can imagine, this creates composability and incompatibility challenges (e.g., extension A must be installed before extension B, extension C cannot be installed with extension D, extension E just breaks the chain entirely). You don't need to worry about this for project 1 as your extension will be the only extension that we load.

For this project, you should not need to add new hooks. Instead, you will use existing hooks for foreign data wrappers. You will probably use hooks extensively in your future group projects. Because hooks are relatively undocumented in the official PostgreSQL manual, we've linked third-party resources (that you're not required to read):

Example Extensions

Extensions are powerful. You can build a whole new DBMS on top of PostgreSQL with them, for example:

TimescaleDB is an open-source SQL database designed for scalable time-series data that is packaged as a PostgreSQL extension.
Citus is an open-source distributed, shared-nothing DBMS that is packaged as a PostgreSQL extension.

But for this project, we're keeping things simple. PostgreSQL ships with a variety of extensions that you may find useful as a reference. For this project, some helpful extensions to study are:

postgres_fdw
- Access data stored in external PostgreSQL servers
- Written in C
- Ships with PostgreSQL, manual (F38)
- In a sense, this is the "reference" for writing foreign data wrappers
cstore_fdw
- Columnar storage inspired by the ORC (Optimized Row Columnar) format
- Written in C
- Note that Citus has since integrated cstore_fdw into their Citus extension and improved it with more features
parquet_fdw
- An experimental wrapper for reading Parquet files
- Being experimental, it may have bugs
- Written in C++

Background: Foreign Data Wrappers

The SQL/MED ("Management of External Data") standard calls for extensions to SQL that allow you to access foreign data that lives outside your DBMS. For example, you may want to query a different file format (e.g., CSV, Parquet), different instance of the same DBMS (e.g., PostgreSQL on a different machine), or access remote data (e.g., S3).

PostgreSQL implements portions of the SQL/MED specification as documented in the manual (5.12). Foreign data is accessed through a foreign data wrapper (FDW). A FDW is a library that abstracts over connecting to the external data and obtaining data from it. A non-exhaustive list of some FDWs can be found in the PostgreSQL wiki.

In this project, we are asking you to write a foreign data wrapper for our custom db721 format. You will need to implement the foreign data wrapper hooks that are described in the manual (59).

Background: db721

db721 is our custom columnar file format. Each table is stored with its metadata in its own .db721 file. The format is simple, consisting of only three components:

[Raw Data] [JSON Metadata] [JSON Metadata Size (4 bytes)]

To read this format, first you read the last 4 bytes to figure out how big the JSON metadata is. Then, you read the JSON metadata, which describes how to access the raw data:

metadata["Table"]: the table name (string)

metadata["Max Values Per Block"]: the maximum number of values in each block (int)

metadata["Columns"]: the table's columns (JSON dict)
    Keys: column names (string)
    Values: column data, see below (JSON dict)

metadata["Columns"]["Column Name"]: column data (JSON dict)
    Keys:
    "type": the column type (str), possible values are:
        "float" | "int" | "str"

    "start_offset": the offset in the file for the first block of this column (int)

    "num_blocks": the number of blocks for this column (int)

    "block_stats": statistics for the 0-indexed fixed-size blocks (JSON dict)
        Keys: block index (string)
        Values: statistics for the corresponding block (JSON dict)
            Keys:
            "num": the number of values in this block (int)
            "min": the minimum value in this block (same type as column)
            "max": the maximum value in this block (same type as column)
            "min_len": only exists for str column; the min length of a string in this block (int)
            "max_len": only exists for str column; the max length of a string in this block (int)

Within [Raw Data], the way values are written depends on their type.
    "float": written as four byte floats, [ float1 float2 float3 ... ]
    "int": written as four byte integers, [ int1 int2 int3 ... ]
    "str": written as 32-byte fixed-length null-terminated ASCII strings

Note that:

The metadata is written at the end of the file.
- (Optional) Parquet also does this. Think about why; what does this optimize for?
Column data is stored as a contiguous array of fixed-size blocks.
- NOTE: for 2023's project 1, every block is guaranteed to pack "Max Values Per Block" values per block, except for the last block which has the remainder.
The metadata contains block statistics to help you to decide whether you should read a block.

We encourage you to view a hex dump of the .db721 files using tools like xxd.

Note that this format has many shortcomings, but it is enough to allow for basic optimizations. For example, you can choose to only access the columns that you need, and you can choose to skip reading blocks if the min/max ranges are filtered by predicates. We encourage you to think about tradeoffs and improvements as you write code for this project, and we will ask you to document some of these thoughts in a .md file with your submission.

We strongly encourage you to check out some Real World formats:

Parquet
ORC

Instructions

You must use the 2023-S721-P1 branch for this project.

Note: we try to reduce the tedious parts by providing you with helper scripts, but we expect you to be able to read, understand, debug, and modify these scripts.

Development Environment

We expect you to have basic familiarity with git and setting up environments from 15-445/645. Officially, we only support Ubuntu 22.04 (LTS). Use other environments at your own peril, since this is an advanced graduate course, we assume that you know what you're doing. Your code must work in a stock Ubuntu 22.04 (LTS) environment.

Note that PostgreSQL development often involves juggling multiple terminals. Whatever you use, try to make sure that you can have two terminals visible at the same time, it'll make your life much easier. We personally like tmux.

You may also want to use the PostgreSQL doxygen to help you locate definitions (if you're using an IDE, the IDE navigation may or may not work).

Fork our PostgreSQL repository on GitHub.

Clone your fork:

git clone git@github.com:YOUR_GITHUB_ID/postgres.git

Make sure that you switch to the 2023-S721-P1 branch:
```
git checkout 2023-S721-P1
```
Install the packages necessary for compiling PostgreSQL. You can use the PostgreSQL manual, PostgreSQL wiki, or try our script:
```
sudo ./cmudb/env/packages.sh
```
From Terminal 1, compile and run the PostgreSQL server (you should read the output!):
```
./cmudb/env/local_postgres.sh
```
From Terminal 2, run the PostgreSQL client (you should read the output!):
```
./cmudb/env/local_psql.sh
```

From the client, you should run and see:

noisepage_db=# SELECT * FROM db721_farm;
ERROR:  could not devise a query plan for the given query

Once you complete these steps, you are ready to start working on your project!

Roadmap

We suggest proceeding in this order:

Generate db721 files
Write a db721 parser
Implement SELECT * for the FDW (commit when this works!)
Implement support for WHERE clauses for the FDW (commit when this works!)
Implement the required optimizations (commit before every new optimization!)
Write the reflection.md file

Generate db721 files

In ./cmudb/extensions/db721_fdw/, we've provided:

chicken_farm_gen.py: dataset generator for the ChickenFarm benchmark
- Produces: data-chickens.db721, data-farms.db721, data-chickens.csv, data-farms.csv
chicken_farm_schema.sql: helper script to set up the ChickenFarm schema and db721_fdw
- Regular tables: farm, chicken
- FDW tables: db721_chicken, db721_farm
- The contents of the regular and FDW tables should be the same, this setup is so that you can easily run performance comparisons

You should modify the chicken_farm_gen.py::main()'s scenarios to come up with different test cases. Grading will be done with ChickenFarm, but feel free to write your own benchmark for tests.

Required Optimizations

You must implement the following optimizations, otherwise you will get a zero:

Predicate pushdown: skip accessing blocks if the block statistics don't pass the predicate
- To simplify this project, it is ok if your code only works for predicates of the form "column_name OP constant" and "constant OP column_name", and it is ok if this is a hack
- Your extension must remove the responsibility of checking these predicates from the PostgreSQL executors (i.e., you must modify qpquals and your FDW is responsible for only returning tuples that satisfy these pushed down predicates)
Projection pushdown: only access the columns that the query requests

Your code may be manually reviewed to check that you are doing these optimizations. And of course, feel free to implement more optimizations!

Reflection

In the root of your submission, please include a file reflection.md that describes:

Some strengths of the file format.
Some weaknesses of the file format.
Suggestions for improving the file format.
(Optional) Feedback on this project (complexity, difficulty, duration, suggestions for future iterations).

You don't need to write much. Treat it like a reading review.

Grading Rubric

Each project submission will be graded in two phases.

Correctness

Does the submission successfully execute all of the test cases and produce the correct answer?
Does the submission implement the required optimizations?
Does the submission cause PostgreSQL to leak or crash?

Your implementation must pass this portion of the evaluation. If you fail any of these checks, then the submission is scored as zero. No partial credit will be given.

We will evaluate you on the ChickenFarm benchmark with hidden new scenarios.

Performance

If your submission satisfies the correctness checks, then your points will further be adjusted based on the difference of the performance (latency) between your implementation and the TA's implementation.

Specifically, your final grade is based on how much faster/slower you are than our reference implementation:

Difference	Your Grade
>120%	110%
110-119%	100%
100-109%	90%
90-99%	80%
80-89%	70%
70-79%	60%
<70%	0%

Late Policy

10% will deducted from the final score of the project for every 24-hour period that the assignment is late.

Only in extreme circumstances (e.g., medical emergencies) no-penalty extensions will be granted. The student is required to provide written documentation (e.g., from their advisor or from University Health Services). Please contact the instructor if you have any questions.

Submission

Comment out or remove log statements in your submission, otherwise Gradescope will crash.

If you are using PGXS,

Go to your extension folder,
```
cd cmudb/extensions/db721_fdw
```
Zip the files, (don't forget reflection.md),
```
zip -r project1.zip *
```
And submit that zip to gradescope.

Otherwise, if you are not using PGXS, you must zip up a Makefile that supports:

make install -- this must write the following files.

/postgres/build/postgres/lib/db721_fdw.so
/postgres/build/postgres/share/extension/db721_fdw.control
/postgres/build/postgres/share/extension/db721_fdw--0.0.1.sql

make clean -- this doesn't have to do anything.

Either way, you must submit your source code and any special compilation instructions.

You can submit your zip to Gradescope. There is no submission limit. Your final score will only be adjusted in exceptional cases (e.g., did not do the optimizations).

Important: Use the Gradescope course code announced on Piazza.

There is a Gradescope submission site available to non-CMU students (Entry Code: BBY8VB).

Collaboration Policy

Every student has to work individually on this assignment.
Students are allowed to discuss high-level details about the project with others.
Students are not allowed to copy the contents of a white-board after a group meeting with other students.
Students are not allowed to copy the solutions from another colleague.

WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.

Project #1 - Foreign Data Wrapper Last Updated: Feb 20, 2023