Project #2 - Concurrent Index - CMU 15-721 :: Advanced Database Systems (Spring 2020)

Overview

For this project, you are required to implement an in-memory B+Tree in CMU's yet-to-be-named DBMS. The primary goal is to understand the nuances of concurrency control in a modern, order-preserving indexing data structure.

Here's a list of useful references on in-memory data structures:

IMPORTANT: This is an open-ended assignment. There are several design choices to be made while implementing a concurrent data structure. Do not expect us to guide you in making every one of these design choices.

This is a group project. Each group must have exactly three members unless given prior authorization by the instructor.

Release Date: Feb 17, 2020
Due Date: Mar 15, 2020 @ 11:59pm

Implementation Details

In this assignment, you will need to modify at most these five files:

IMPORTANT: The "Data Structure" files are the actual B+Tree implementation. The "Index" files are a wrapper that implements our DBMS's index API. The index wrapper is mostly implemented for you already. The logic for performing visibility checks in the DataTable, deferring actions for correct visibility, etc. are all adapted from the BwTreeIndex wrapper. All that remains is to make the correct calls into your B+Tree Data Structure's API. See the FIXME comments in BPlusTreeIndex.

The DBMS also contains a Bw-Tree index you can use as a reference to better understand how the different arguments work.

Your B+Tree implementation must abstract away the details of the key data type and associated comparator and equality checker, like this:

template <typename KeyType, 
          typename ValueType,
          typename KeyComparator,
          typenameKeyEqualityChecker,
          typename ValueEqualityChecker>
class BPlusTreeIndex {
};

These classes are already implemented for you:

KeyType: The type of each key in the index. This will either be a GenericKey or CompactIntsKey.
KeyComparator: The class used to compare whether two KeyType instances are less/greater-than each other. These will be included in the KeyType implementation files.
KeyEqualityChecker: The class used to compare whether two KeyType instances are equivalent. These will be included in the KeyType implementation files.
ValueEqualityChecker: The class used to compare whether two ValueType instances are equivalent. This is needed to support non-unique indexes.

Requirements

Your index must optionally support duplicate keys with different values (i.e., this should be a configurable parameter).

You must support both forward and reverse iterators that work correctly with concurrent mutators.

The functionality of your index implementation must match the specification given in the DBMS's index API. More details are in the Doxygen documentation, but at a high level:

The Insert function inserts a new key-value pair into the index, used for non-unique key indexes.
The Delete function doesn't immediately call delete on the underlying index. It registers a commit action in the txn that will eventually register a deferred action for the GC to safely call delete on the index when no more transactions need to access the key.
The InsertUnique function inserts a new key-value pair only if any matching keys have TupleSlots that do not conflict with the calling txn.
The ScanKey function finds all the values associated with the given key in our index.
The ScanAscending function finds all the values between the given keys in our index, sorted in ascending order.
The ScanDescending function finds all the values between the given keys in our index, sorted in descending order.
The ScanLimitDescending function finds the first limit # of values between the given keys in our index, sorted in descending order.
The PerformGarbageCollection function garbage collects deleted index entries. This function is only necessary if you are doing epoch-based GC (i.e., memory pool).
The GetHeapUsage function returns the amount of heap space used by the index.

IMPORTANT: Your B+Tree implementation should mimic the existing Bw-Tree index. If you are not sure about what the correct output or behavior should be for your index, then exercise the BwTree and see what data it returns. Your index implementation should be a drop-in replacement for it.

You should add an utility function to validate the structural integrity of the index data structure. If any of the integrity checks fail, then this function should fix the integrity issue. We cannot provide you with this because we do not know your internal representation. This functionality is similar to what fsck provides for a file system.

You should implement a latch-free memory tracker if doing epoch-based GC. We will discuss different methods next week.

You should add multiple unit test cases in the BPlusTreeTests testing suite to test the functionality of your index data structure.

You must ensure that your branch can pass our repository's coding standards checks. Your branch will be tested with:

make check-format
make check-lint
make check-censored
make check-clang-tidy
doxygen apidoc/Doxyfile.in

There should not be any failures from these checks, so please run them before submitting.

Testing

You can test the concurrent index by making use of the BPlusTreeIndexTests testing suite. The testing suite contains a set of single and multi-threaded test cases designed to validate the functionality and correctness of the BPlusTreeIndex wrapper.

You should write data structure test cases against your API in BPlusTreeTests testing suite. You should exercise both simple and complex data structure access patterns, as well as the required functionality like heap usage and (possibly) garbage collection.

Lastly, the TPC-C unit test has been modified to run using the BPlusTreeIndex instead of the BwTreeIndex and evaluate the correctness. You can really on this as an extra sanity check that your index is working correctly, but will probably be too complex to debug with easily.

For performance testing, there is a simple index wrapper benchmark that you can use to evaluate the performance of your BPlusTreeIndex against the BwTreeIndex. Note that this only profiles point-query lookups, rather than range scans. If you want to fully evaluate the performance of your BPlusTreeIndex, then you will need to write more benchmarks.

Instructions

You must use the 15721-s20-project2 branch instead of the master branch for this project.

We assume that you have already setup the project repository and the development environment following the instructions in the first assignment. We have added new files for this project. You should ensure that the latest changes in the online repository are propagated to your local repository by using these commands:

git remote add upstream git://github.com/cmu-db/terrier.git
git fetch upstream
git rebase -i upstream/15721-s20-project2

Development Hints

To reduce the number of memory allocations, you will want to reuse memory. You can use the DBMS's existing MemoryPool implementation instead of rolling your own.
Instead of using std::cout statements for debugging, use the INDEX_LOG_* macros for logging information like this:
```
INDEX_LOG_INFO("Nested loop join executor -- %d children", num_children);
INDEX_LOG_DEBUG("Advance the right buffer iterator.");
```
More information is available here. There is already an index logger class that can be included in your source files for access to these various debug logging macros.
Test-driven development can be useful while building your data structure since there is already a reference solution for these tests with the BwTreeIndex. It can be helpful to disable all of the relevant tests (since they'll all fail at the start of the project anyway), and then enable them as you target specific features.
For performance and correctness reasons, you should use an aligned memory allocator (not necessarily this one) for node allocations and be mindful of how you reinterpret the buffer if you intend to perform std::atomic operations.

Grading Rubric

Each project submission will be graded based on the following criteria:

Does the submission successfully execute all of the test cases and produce the correct answer?
Does the submission execute without any memory leaks according to ASAN?
Does the submission implement the API correctly?
Does the submission contain functions that perform unit testing and validate structural integrity?
Does the submission use the proper formatting, coding style, and documentation for its source code?
Does the submission perform better than the existing Bw-Tree for Insert, Delete, and ScanKey operations on a variety of workloads and thread counts?

Note that we will use additional test cases that are more complex and go beyond the sample test cases that we provide you.

Bonus points will be given to the top 3 groups with the fastest implementations. The amount of bonus points given to these groups with the best implementations is left up to the discretion of the instructor and contingent on an additional evaluation of the source code.

Late Policy

25% will deducted from the final score of the project for every 24-hour period that the assignment is late.

Only in extreme circumstances (e.g., medical emergencies) no-penalty extensions will be granted. The student is required to provide written documentation from the University health center. Please contact the instructor if you have any questions.

Submission

The project submission is divided into two checkpoints. The first checkpoint is to test whether your index supports basic functionality to insert and retrieve keys, as well as to report the total memory size of the data structure. The second checkpoint will test the same functionality as the first checkpoint, as well as the ability to delete keys, perform range scans, and garbage collection. The final grade of the project will be the cumlative score of the two checkpoints.

Checkpoint #1 — Due Date: Mar 8 @ 11:59pm (Final Grade Portion: 40%)

Supported Functions: Insert, InsertUnique, ScanKey, GetHeapUsage.

Checkpoint #2 — Due Date: Mar 15 @ 11:59pm (Final Grade Portion: 60%)

Supported Functions: Delete, ScanAscending, ScanDescending, ScanLimitDescending, PerformGarbageCollection.

You can submit your implementation of these files to Gradescope (https://www.gradescope.com/courses/81879).

Collaboration Policy

Everyone has to work in a team of three people for this assignment.
Groups are allowed to discuss high-level details about the project with others.
Groups are not allowed to copy the contents of a white-board after a group meeting with other students.
Groups are not allowed to copy the solutions from other colleagues.

WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.