Project #1 - Hash Join Operator

Overview

The first programming assignment is to implement an in-memory hash join operator within the Peloton DBMS. The primary goal of this assignment is to become familiar with the low-level implementation details of the DBMS and to understand how a SQL query gets executed. All the code in this programming assignment must be written in C++ (specifically C++11). If you have not used C++ before, here's a short tutorial on the language. Even if you are familiar with C++, go over this guide for additional information on writing code in the Peloton DBMS.

Peloton uses a tile-based storage model architecture for all tuples. Read this overview for more information about this architecture.

This is a single-person project that will be completed individually (i.e., no groups).

  • Release Date: Jan 13, 2016
  • Due Date: Feb 08, 2016 @ 11:59pm

Instructions

Creating Your Own Project Repository

First, sign up for a Github Student Account to get free private repositories.

Then go to the Peloton Github page and click the Fork button to create a copy of the repository under your account. Be sure to mark this repository as private so that others can not see your code (SettingsMake this repository private).

You can then clone your private repository on your local machine in the filesystem, say under the ~/git directory. If you have not used git before, here's some information on source code management using git.

Setting Up Your Development Environment

The next step is to setup your development environment for the project on your machine. Peloton currently only be built on 64-bit Linux. We currently only support development environments in Linux and OSX operating systems:

  • If you are using Linux, then follow these instructions.
  • If you are using OSX, then install VirtualBox and then use Vagrant to setup an Ubuntu 14.04 LTS Desktop VM. Detailed instructions are provided here. You can also follow this tutorial on using Vagrant in OSX Yosemite.

Building Peloton

After setting up your development environment, you can now build Peloton by following these installation instructions.

We use the GNU build system (also known as the autotools) in Peloton. If you have never used autotools before, here's a short introduction.

Implementation

  1. In this assignment, you will need to modify only one file (src/backend/executor/hash_join_executor.cpp). A function stub for the hash join executor is provided. You will not need to make any changes to any other file in the system.

  2. We adopt a tile-based architecture in Peloton. Read this overview for more information about this architecture. The storage engine stores tuples in tile groups that are composed of physical tiles. The execution engine operates on logical tiles.

  3. The hash join operator that you are to implement will consume logical tiles from its two child operators in the query plan tree and produce a set of join logical tiles, like this. For instance, the left child operator can be a left table sequential scan operator, while the right child operator is a hash operator. The child operator of this hash operator is in turn a right table sequential scan operator.

  4. Peloton contains an implementation of the nested-loop join operator. All the join operator implementations -- like nested-loop join operator, merge join operator, and hash join operator -- derive from an abstract join operator base class. This class contains several functions for simplifying the implementation of different join algorithms. Make use of these functions in your hash join operator implementation. For instance, the AbstractJoinExecutor class contains functions for building an output join logical tile (BuildOutputLogicalTile), buffering logical tiles (BufferLeftTile, BufferRightTile), constructing position lists (BuildPostitionLists), handling different join types (RecordMatchedRightRow, BuildOuterJoinOutput), etc. Go over the nested-loop join operator to understand how to make use of the functions in the AbstractJoinExecutor class.

  5. You are provided with a hash operator that takes in a stream of logical tiles, for instance from a sequential scan operator, and builds a hash table on top of it.

  6. IMPORTANT: Here's a high-level overview of the expected hash join operator implementation. First, buffer the logical tiles from the left and right child operator. Then, build an output logical tile based on the left and right logical tile. Get the hash table from the hash executor (hash_executor_). Then, go over the left logical tile one tuple at a time. For every tuple in the left logical tile, we look up any matching tuples in the hash table built on top of the right child's table. Next, we record matching left and right rows. Finally, we build an output join logical tile if we have any matching tuples. We make use of BuildOuterJoinOutput when we are done processing logical tiles from the left child or right child. Notice the similarities between the hash join operator's description and the nested-loop join operator's implementation.

  7. Your hash join operator must support INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, and FULL OUTER JOIN join types for any data type or table size. You can assume that there is always sufficient memory available (i.e., you do not need to spill intermediate data to disk). The nested-loop join operator supports all these join types.

Testing

  1. Peloton Testing Framework : You can test the hash join operator by making use of the join test case. The join test case exercises all the join types in the different join operator implementations. Take a look at the join test case to understand how the planner uses the hash plan node as a child node of the hash join plan node.

    Uncomment this line to test the hash join operator. To specifically run the join test case, follow these testing instructions.

    cd build/tests
    make check-build
    valgrind --leak-check=yes --trace-children=yes --track-origins=yes ./join_test
    

  2. PSQL Command-line Interface: We are also providing a SQL script for testing your implementation and also getting used to the Peloton's command line interface. Follow these instructions to launch Peloton's psql terminal, and then load in this SQL script to run it.

    You can cross check your output with the expected output. If you are using the command-line terminal to test the hash join operator, you will need to explicitly disable the other join operators to force the optimizer to pick the hash join operator as shown here.

Development Hints

  1. In any large software system, it is critical to follow some conventions to allow everyone to build on top of each other's work. Here's a short list of rules that you should follow while hacking on Peloton. For instance, use class UpperCaseCamelCase for type names, int lower_case_with_underscores for variable/method/function names. Always use descriptive names and comment generously.

    We use ClangFormat to ensure that everyone follows the same code formatting style. Before submitting your assignment, ensure that you follow these guidelines by running this command:

    clang-format-3.6 --style=file ./src/backend/executor/hash_join_executor.cpp | diff ./src/backend/executor/hash_join_executor.cpp -

    ClangFormat supports two ways to provide custom style options: directly specify the style configuration in the -style= command line option or use -style=file and put the style configuration file in the .clang-format file in the project directory. We follow the second approach. The style file is located here in the Peloton repository. To modify the source code file to adhere to the style file, you can use the following command as describe in the manpage:

    clang-format-3.6 --style=file -i ./src/backend/executor/hash_join_executor.cpp

  2. To understand how Peloton runs an executor tree and emits the result tuples, take a look at the plan executor. Another important place to look at is the design of the abstract executor.

  3. Instead of using printf statements for debugging, use the LOG_* macros for logging information like this :

    LOG_INFO("Nested loop join executor -- %d children", num_children);
    LOG_DEBUG("Advance the right buffer iterator.");

    To enable logging in peloton, you will need to reconfigure it like this :

    ../configure CXXFLAGS="-O0 -g -ggdb" --enable-debug
    make -j4

    More information is available here. The different logging levels are defined in this header file. After enabling logging, the logging level defaults to LOG_LEVEL_INFO. So, any logging method with a logging level that equal to or higher than LOG_LEVEL_INFO, like LOG_INFO, LOG_WARN, and LOG_ERROR will emit logging information in the peloton debugging log file. The debugging log file is, by default, stored inside the data directory -- ~/git/peloton/build/data/pg_log/pg.log.

Grading Rubric

Each project submission will be graded based on the following criteria:

  1. Does the submission successfully execute all of the test cases and produce the correct answer?
  2. Does the submission execute without any memory leaks according to Valgrind?
  3. Does the submission use the proper formatting for its source code?

Note that we will use additional test cases that are more complex and go beyond the sample test cases that we provide you.

Bonus points will be given to the top 5 students with the fastest implementations. The amount of bonus points given to these students with the best implementations is left up to the discretion of the instructor and contingent on an additional evaluation of the source code.

Late Policy

10% will deducted from the final score of the project for every 24 hour period that the assignment is late.

Only in extreme circumstances (e.g., medical emergencies) no-penalty extensions will be granted. The student is required to provide written documentation from the University health center. Please contact the instructor if you have any questions.

Submission

After completing the assignment, you can submit your implementation of hash_join_executor.cpp (only one file) to Autolab : https://autolab.andrew.cmu.edu/courses/15721-s16. You can submit your answers as many times as you like and get immediate feedback. Your score will be sent via email to your andrew account within a few minutes after your submission.

Collaboration Policy

  • Every student has to work individually on this assignment.
  • Students are allowed to discuss high-level details about the project with others.
  • Students are not allowed to copy the contents of a white-board after a group meeting with other students.
  • Students are not allowed to copy the solutions from another colleague.

WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.