Overview
For this project, you are required to implement an in-memory B+Tree in CMU's yet-to-be-named DBMS. The primary goal is to understand the nuances of concurrency control in a modern, order-preserving indexing data structure.
Here's a list of useful references on in-memory data structures:
- Modern B-Tree Methods
- Techniques for Implementing Concurrent Data Structures on Modern Multicore Machines
- stx::B+tree (replaced by tlx::B+Tree)
- Making B+Trees Cache Conscious in Main Memory
IMPORTANT: This is an open-ended assignment. There are several design choices to be made while implementing a concurrent data structure. Do not expect us to guide you in making every one of these design choices.
This is a group project. Each group must have exactly three members unless given prior authorization by the instructor.
- Release Date: Feb 17, 2020
- Due Date: Mar 15, 2020 @ 11:59pm
Implementation Details
In this assignment, you will need to modify at most these five files:
- B+Tree Data Structure source file
- B+Tree Data Structure header file
- B+Tree Index source file
- B+Tree Index header file
- B+Tree Data Structure testing suite
IMPORTANT: The "Data Structure" files are the actual B+Tree implementation. The "Index" files are a wrapper that implements our DBMS's index API. The index wrapper is mostly implemented for you already. The logic for performing visibility checks in the DataTable, deferring actions for correct visibility, etc. are all adapted from the BwTreeIndex wrapper. All that remains is to make the correct calls into your B+Tree Data Structure's API. See the FIXME comments in BPlusTreeIndex.
The DBMS also contains a Bw-Tree index you can use as a reference to better understand how the different arguments work.
Your B+Tree implementation must abstract away the details of the key data type and associated comparator and equality checker, like this:
template <typename KeyType, typename ValueType, typename KeyComparator, typenameKeyEqualityChecker, typename ValueEqualityChecker> class BPlusTreeIndex { };
These classes are already implemented for you:
KeyType
: The type of each key in the index. This will either be aGenericKey
orCompactIntsKey
.KeyComparator
: The class used to compare whether twoKeyType
instances are less/greater-than each other. These will be included in theKeyType
implementation files.KeyEqualityChecker
: The class used to compare whether twoKeyType
instances are equivalent. These will be included in theKeyType
implementation files.ValueEqualityChecker
: The class used to compare whether twoValueType
instances are equivalent. This is needed to support non-unique indexes.
Requirements
Your index must optionally support duplicate keys with different values (i.e., this should be a configurable parameter).
You must support both forward and reverse iterators that work correctly with concurrent mutators.
The functionality of your index implementation must match the specification given in the DBMS's index API. More details are in the Doxygen documentation, but at a high level:
- The
Insert
function inserts a new key-value pair into the index, used for non-unique key indexes. - The
Delete
function doesn't immediately call delete on the underlying index. It registers a commit action in the txn that will eventually register a deferred action for the GC to safely call delete on the index when no more transactions need to access the key. - The
InsertUnique
function inserts a new key-value pair only if any matching keys have TupleSlots that do not conflict with the calling txn. - The
ScanKey
function finds all the values associated with the given key in our index. - The
ScanAscending
function finds all the values between the given keys in our index, sorted in ascending order. - The
ScanDescending
function finds all the values between the given keys in our index, sorted in descending order. - The
ScanLimitDescending
function finds the first limit # of values between the given keys in our index, sorted in descending order. - The
PerformGarbageCollection
function garbage collects deleted index entries. This function is only necessary if you are doing epoch-based GC (i.e., memory pool). - The
GetHeapUsage
function returns the amount of heap space used by the index.
IMPORTANT: Your B+Tree implementation should mimic the existing Bw-Tree index. If you are not sure about what the correct output or behavior should be for your index, then exercise the BwTree
and see what data it returns. Your index implementation should be a drop-in replacement for it.
You should add an utility function to validate the structural integrity of the index data structure. If any of the integrity checks fail, then this function should fix the integrity issue. We cannot provide you with this because we do not know your internal representation. This functionality is similar to what fsck
provides for a file system.
You should implement a latch-free memory tracker if doing epoch-based GC. We will discuss different methods next week.
You should add multiple unit test cases in the BPlusTreeTests
testing suite to test the functionality of your index data structure.
You must ensure that your branch can pass our repository's coding standards checks. Your branch will be tested with:
make check-format make check-lint make check-censored make check-clang-tidy doxygen apidoc/Doxyfile.in
There should not be any failures from these checks, so please run them before submitting.
Testing
You can test the concurrent index by making use of the BPlusTreeIndexTests
testing suite. The testing suite contains a set of single and multi-threaded test cases designed to validate the functionality and correctness of the BPlusTreeIndex wrapper.
You should write data structure test cases against your API in BPlusTreeTests
testing suite. You should exercise both simple and complex data structure access patterns, as well as the required functionality like heap usage and (possibly) garbage collection.
Lastly, the TPC-C unit test has been modified to run using the BPlusTreeIndex
instead of the BwTreeIndex
and evaluate the correctness. You can really on this as an extra sanity check that your index is working correctly, but will probably be too complex to debug with easily.
For performance testing, there is a simple index wrapper benchmark that you can use to evaluate the performance of your BPlusTreeIndex
against the BwTreeIndex
. Note that this only profiles point-query lookups, rather than range scans. If you want to fully evaluate the performance of your BPlusTreeIndex
, then you will need to write more benchmarks.
Instructions
You must use the 15721-s20-project2 branch instead of the master branch for this project.
We assume that you have already setup the project repository and the development environment following the instructions in the first assignment. We have added new files for this project. You should ensure that the latest changes in the online repository are propagated to your local repository by using these commands:
git remote add upstream git://github.com/cmu-db/terrier.git git fetch upstream git rebase -i upstream/15721-s20-project2
Development Hints
-
To reduce the number of memory allocations, you will want to reuse memory. You can use the DBMS's existing
MemoryPool
implementation instead of rolling your own. -
Instead of using
std::cout
statements for debugging, use theINDEX_LOG_*
macros for logging information like this:INDEX_LOG_INFO("Nested loop join executor -- %d children", num_children); INDEX_LOG_DEBUG("Advance the right buffer iterator.");
More information is available here. There is already an index logger class that can be included in your source files for access to these various debug logging macros. -
Test-driven development can be useful while building your data structure since there is already a reference solution for these tests with the BwTreeIndex. It can be helpful to disable all of the relevant tests (since they'll all fail at the start of the project anyway), and then enable them as you target specific features.
-
For performance and correctness reasons, you should use an aligned memory allocator (not necessarily this one) for node allocations and be mindful of how you reinterpret the buffer if you intend to perform std::atomic operations.
Grading Rubric
Each project submission will be graded based on the following criteria:
- Does the submission successfully execute all of the test cases and produce the correct answer?
- Does the submission execute without any memory leaks according to ASAN?
- Does the submission implement the API correctly?
- Does the submission contain functions that perform unit testing and validate structural integrity?
- Does the submission use the proper formatting, coding style, and documentation for its source code?
- Does the submission perform better than the existing Bw-Tree for
Insert
,Delete
, andScanKey
operations on a variety of workloads and thread counts?
Note that we will use additional test cases that are more complex and go beyond the sample test cases that we provide you.
Bonus points will be given to the top 3 groups with the fastest implementations. The amount of bonus points given to these groups with the best implementations is left up to the discretion of the instructor and contingent on an additional evaluation of the source code.
Late Policy
25% will deducted from the final score of the project for every 24-hour period that the assignment is late.
Only in extreme circumstances (e.g., medical emergencies) no-penalty extensions will be granted. The student is required to provide written documentation from the University health center. Please contact the instructor if you have any questions.
Submission
The project submission is divided into two checkpoints. The first checkpoint is to test whether your index supports basic functionality to insert and retrieve keys, as well as to report the total memory size of the data structure. The second checkpoint will test the same functionality as the first checkpoint, as well as the ability to delete keys, perform range scans, and garbage collection. The final grade of the project will be the cumlative score of the two checkpoints.
Checkpoint #1 — Due Date: Mar 8 @ 11:59pm (Final Grade Portion: 40%)
- Supported Functions:
Insert
,InsertUnique
,ScanKey
,GetHeapUsage
.
Checkpoint #2 — Due Date: Mar 15 @ 11:59pm (Final Grade Portion: 60%)
- Supported Functions:
Delete
,ScanAscending
,ScanDescending
,ScanLimitDescending
,PerformGarbageCollection
.
You can submit your implementation of these files to Gradescope (https://www.gradescope.com/courses/81879).
Collaboration Policy
- Everyone has to work in a team of three people for this assignment.
- Groups are allowed to discuss high-level details about the project with others.
- Groups are not allowed to copy the contents of a white-board after a group meeting with other students.
- Groups are not allowed to copy the solutions from other colleagues.
WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.