Project #2 - Concurrent Index - CMU 15-721 :: Advanced Database Systems (Spring 2017)

Overview

For this project, you are required to implement a latch-free Skip List in the Peloton database management system. The primary goal is to understand the nuances of concurrency control in a modern, order-preserving indexing data structure. Go over this style guide for information on writing code in Peloton.

Here's a list of useful references on latch-free data structures:

IMPORTANT: This is an open-ended assignment. There are several design choices to be made while implementing a concurrent data structure. Do not expect us to guide you in making every one of these design choices. In general, try to export an interface in your Skip List that is similar to that of the STX B+tree.

This is a group project. Each group must have exactly three members unless given prior authorization by the instructor.

Release Date: Jan 31, 2017
Due Date: Mar 09, 2017 @ 11:59pm

Instructions

Updating Your Project Repository

We assume that you have already setup the project repository and the development environment following the instructions in the first assignment. We have made a lot of changes. You should ensure that the latest changes in the online repository are propagated to your local repository by using these commands:

git remote add upstream git://github.com/cmu-db/peloton.git
git fetch upstream
git rebase -i upstream/master
git push --force

Implementation

In this assignment, you will need to modify only these five files:

IMPORTANT: The "Data Structure" files are the actual skip list implementation. The "Index" files are a wrapper that implements Peloton's index API.

Peloton also contains a Bw-Tree index you can use as a reference to better understand how the different arguments work.

Your Skip List implementation must abstract away the details of the key data type and associated comparator and equality checker, like this:

template <typename KeyType, 
          typename ValueType,
          typename KeyComparator,
          typenameKeyEqualityChecker,
          typename ValueEqualityChecker>
class SkipListIndex {
};

These classes are already implemented for you:

KeyType: The type of each key in the index. This will either be a GenericKey, CompactIntsKey, or TupleKey
ValueType: The type of each value in the index. This will usually be a 64-bit ItemPointer.
KeyComparator: The class used to compare whether two KeyType instances are less/greater-than each other. These will be included in the KeyType implementation files.
KeyEqualityChecker: The class used to compare whether two KeyType instances are equivalent. These will be included in the KeyType implementation files.
ValueEqualityChecker: The class used to compare whether two ValueType instances are equivalent. This is needed to support non-unique indexes.

Requirements

Your index must optionally support duplicate keys with different values (i.e., this should be a configurable parameter).

You must support both forward and reverse iterators that work correctly with concurrent mutators.

The functionality of your index implementation must match the specification given in Peloton's index API:

The InsertEntry function inserts a key-value pair into the index. If it is configured to not support duplicate keys, then the InsertEntry function should return false when we try to insert a duplicate key into the index. On the other hand, if it is configured to support duplicate keys, then it should not return false when we try to insert a duplicate key into the index.
The DeleteEntry function should erase only the index entry matching the specific <key, value> pair.
The CondInsertEntry function finds all key-value pairs that match the given key and tests whether any of them satisfy the given predicate. If none of the key-value pairs satisfy the predicate, then we insert the given key-value pair into the index, set predicate_satisfied to true, and return true. Otherwise, we set predicate_satisfied to false and return false.
The Scan function scans a range inside the index and returns all elements in the result vector. The type of scan to perform is specified by the given scan predicate, csp_p, and is one of the following: a full scan, a range scan, or a point query. Whether to perform a forward or backward scan is determined by the given scan_direction.
The ScanLimit function has the same behavior as the Scan function except that the scan stops after it reads offset + limit elements. Therefore, maximum of limit elements are returned in the result vector.
The ScanAllKeys function scans all keys in the index and stores the values in the result vector.
The ScanKey function finds all values in the index with the given key and returns them in the result vector. If the index is not configured to store duplicate keys then at most one value will be returned.
The PerformGC function garbage collects deleted index entries.
The NeedGC function returns true returns true if garbage collection is needed on the index.
The GetMemoryFootprint function returns the amount of heap space used by the index.

IMPORTANT: Your skip list implementation should mimic the existing Bw-Tree index. If you are not sure about what the correct output or behavior should be for your index, then execute the Bw-Tree and see what data it returns. Your index implementation should be a drop-in replacement for it.

You should add an utility function to validate the structural integrity of the index data structure. If any of the integrity checks fail, then this function should fix the integrity issue. This functionality is similar to what fsck provides for a file system.

You should implement a latch-free memory tracker (e.g., epoch-based or RCU). We will discuss different methods next week.

You must periodically garbage collect deleted index entries. The way our API works is that a separate thread will invoke NeedGC() to check whether the index needs to be garbage collected. To test your garbage collection implementation, we will compute the memory footprint after inserting a set of entries into the index. Then, we will delete a subset of those entries. Finally, after invoking the PerformGC function on your index, we will recompute the memory footprint. If the memory footprint does not drop as expected, then your submission will be penalized.

Your implementation must maintain proper statistics by invoking the stats methods provided by the index API when appropriate.

You should add multiple unit test cases in the index testing suite to test the functionality of your index data structure.

Testing

You can test the concurrent index by making use of the index test suite. The implementation of these different test cases is found in TestingIndexUtil The testing suite contains a set of multi-threaded test cases designed to validate the functionality and evaluate the performance of the index. To specifically run the join test case, follow these testing instructions. To specifically run the index test suite, follow these testing instructions.

cd build
make skiplist_index_test
valgrind --trace-children=yes \
   --leak-check=full \
   --track-origins=yes \
   --soname-synonyms=somalloc=*jemalloc* \
   --error-exitcode=1 \
   --suppressions=../third_party/valgrind/valgrind.supp \
   ./test/skiplist_index_test

Development Hints

In any large software system, it is critical to follow some conventions to allow everyone to build on top of each other's work. Here's a short list of rules that you should follow while hacking on Peloton. For instance, use class UpperCaseCamelCase for type names, int lower_case_with_underscores for variable/method/function names. Always use descriptive names and comment generously.
We use ClangFormat to ensure that everyone follows the same code formatting style. Before submitting your assignment, ensure that you follow these guidelines by running this command:
```
clang-format-3.6 --style=file ./src/index/skiplist_index.cpp | diff ./src/index/skiplist_index.cpp -
```
ClangFormat supports two ways to provide custom style options: directly specify the style configuration in the -style= command line option or use -style=file and put the style configuration file in the .clang-format file in the project directory. We follow the second approach. The style file is located here in the Peloton repository. To modify the source code file to adhere to the style file, you can use the following command as describe in the manpage:
```
clang-format-3.6 --style=file -i ./src/index/skiplist_index.cpp
```
Instead of using printf statements for debugging, use the LOG_* macros for logging information like this:
```
LOG_INFO("Nested loop join executor -- %d children", num_children);
LOG_DEBUG("Advance the right buffer iterator.");
```
To enable logging in peloton, you will need to reconfigure it like this :
```
cd build
cmake -DCMAKE_BUILD_TYPE=DEBUG ..
make
```
More information is available here. The different logging levels are defined in this header file. After enabling logging, the logging level defaults to LOG_LEVEL_INFO. So, any logging method with a logging level that equal to or higher than LOG_LEVEL_INFO, like LOG_INFO, LOG_WARN, and LOG_ERROR will emit logging information in the peloton debugging log file. The debugging log file is written to stdout.
Note: Our source validator pre-commit hook will throw an error if you try to push code that includes forbidden commands (e.g., printf, cout, cerr). See these instructions on how to set it up.

Grading Rubric

Each project submission will be graded based on the following criteria:

Does the submission successfully execute all of the test cases and produce the correct answer?
Does the submission execute without any memory leaks according to Valgrind?
Does the submission implement the garbage collection API correctly?
Does the submission contain functions that perform unit testing and validate structural integrity?
Does the submission use the proper formatting and documentation for its source code?

Note that we will use additional test cases that are more complex and go beyond the sample test cases that we provide you.

Bonus points will be given to the top 3 groups with the fastest implementations. The amount of bonus points given to these groups with the best implementations is left up to the discretion of the instructor and contingent on an additional evaluation of the source code.

Late Policy

10% will deducted from the final score of the project for every 24 hour period that the assignment is late.

Only in extreme circumstances (e.g., medical emergencies) no-penalty extensions will be granted. The student is required to provide written documentation from the University health center. Please contact the instructor if you have any questions.

Submission

Once you have completed the assignment, compress the five modified files as a tar file named handin.tar, and submit it to Autolab:

https://autolab.andrew.cmu.edu/courses/15721-s17

Let's assume that you have copied the five modified files into a directory named handin. Compress the handin directory into a tar file like this :

tar -cvf handin.tar handin

You can now submit handin.tar file to Autolab. Our grading script will decompress the tar file like this and copy them over into our peloton project repository before running our tests:

tar -xvf handin.tar
cd handin

You may submit your answers as many times as you like. You will receive instant feedback on your autograded assignment. Your score will be sent via email to your andrew account within a few minutes after your submission.

Collaboration Policy

Everyone has to work in a team of three people for this assignment.
Groups are allowed to discuss high-level details about the project with others.
Groups are not allowed to copy the contents of a white-board after a group meeting with other students.
Groups are not allowed to copy the solutions from other colleagues.

WARNING: All of the code for this project must be your own. You may not copy source code from other groups or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.