Project #1 - Contention Hotspot

Overview

The first programming project will teach you how to identify and fix a hotspot under concurrent workloads in an in-memory DBMS. The primary goal of this assignment is to become familiar with the low-level implementation details of CMU's yet-to-be-named DBMS and to learn how to use profiling tools like PERF. All the code in this programming assignment must be written in C++ (specifically C++17). If you have not used C++ before, here's a short tutorial on the language. Even if you are familiar with C++, go over this guide for additional information on writing code in the system.

You will be working on the DataTable::Scan function. The DataTable provides table access methods over blocks of memory. The Scan function performs a sequential scan on a table by reading data from blocks of memory. It uses a SlotIterator object that denotes the starting position of the scan (in terms of blocks) and then populates a ProjectedColumns output buffer as it reads data.

This is a single-person project that will be completed individually (i.e., no groups).

  • Release Date: Jan 22, 2020
  • Due Date: Feb 16, 2020 @ 11:59pm

Implementation Details

You can refer to this PERF examples documentation or this PERF tutorial on how to use PERF to profile the system. You can also refer to the thread-local storage concept taught in class to reduce the contention in the system.

In this assignment, you will need to modify the following files:

  • src/storage/data_table.cpp
  • src/include/storage/data_table.h

You will not need to make any changes to any other file in the system. You can locally modify the test cases and benchmarks already included in the system to verify the correctness/performance of your implementation. But you will not submit those files.

You will also need to write a report on how you identified the hotspot with PERF profiling and how the profiling changes after your fix.

There are three steps to identify and fix the hotspot for the concurrent read workload in the DBMS:

  1. Analyze the PERF Profiling Results
  2. Reduce the Contention with Per-Thread Data Structures
  3. Test, Profile again, and Evaluate the Performance

Step #1 - Analyze the PERF Profiling Results

The first step is to run PERF against the SlotIteratorBenchmark benchmark:

perf record slot_iterator_benchmark

To execute it with more threads, you can use the TERRIER_BENCHMARK_THREADS environment variable to control the amount of parallel execution:

TERRIER_BENCHMARK_THREADS="16" perf record slot_iterator_benchmark

The benchmark only uses four concurrent threads by default. In order to properly observe the contention bottleneck, you will need a machine with more than 16 cores (see AWS Credit). This particular hotspot generally shows up with higher than 8 threads, which means that you are unlikely to observe its effects on your laptop.

The PERF command will generate a result file perf.data in the folder that you run PERF command. Make sure that you use the Release or Relwithdebinfo build type when profiling. Then you need to use PERF again to analyze the profiling result.

perf report

The percentages of the sampled on-CPU functions will show up in the window. You will see two functions in the TransactionManager are the at the top (except for the kernel scheduling function), which are both higher than 10%. You will then select into the two hottest functions in the PERF results and examine the annotated code from PERF. The contention comes from some shared data structures in the DataTable that are protected by a latch. But it is your job to identify which data structures they are and which latch it is.

You can also use a different perf record option to record the CPU stack traces for the hot functions to help your analysis.

perf record -F 150 --call-graph dwarf slot_iterator_benchmark

Then you can use a different perf report option to examine the profiling results with stack traces.

perf report --no-children -g graph

HINT: You will need to submit screenshots of your PERF analysis in your report (see Submission).

Step #2 - Reduce the Contention with Per-Thread Data Structures

You next need to fix the contention hotspot on the data structures in DataTable by replacing them with the same data structures but duplicated for each thread. You should come up with your own solution in this project assignment.

HINT: Having per-thread data structures does not mean that there cannot be concurrent operations on those data structures. It just reduces the level of contention. When there are concurrent operations on the per-thread data structures, you still need to protect them appropriately. You will not get any point for the programming part of the project if your solution does not guarantee correctness.

Step #3 - Test, Profile again, and Evaluate the Performance

You need to make sure that your implementation is correct before proceed to evaluation. We have implemented some unit tests and basic benchmarks in our system (see Testing). You should also extend the tests by writing your own test cases or scaling up the number of CPUs in the benchmarks.

Then you need to repeat the profiling process from Step #1 to verify that your implementation has reduced the on-CPU percentages of the hot functions in the DataTable. Finally you should compare the performance (throughput) numbers of the benchmark before and after your fix, which will be printed out on the terminal after you execute the benchmark. Make sure that you use the Release build type when evaluating the performance.

HINT: You also need to submit screenshots of your new PERF analysis in your report (see Submission).

Instructions

You must use to the 15721-s20-project1 branch instead of the master branch for this project.

Creating Your Own Project Repository

Go to the DBMS GitHub page and click the Fork button to create a copy of the repository under your account.

You can then clone your private repository on your local machine in the filesystem, say under the ~/git directory. If you have not used git before, here's some information on source code management using git.

Make sure that you switch to the 15721-s20-project1 branch:

git checkout 15721-s20-project1

Setting Up Your Development Environment

The next step is to setup your development environment for the project on your machine. We currently only support development environments in Ubuntu 18.04 (LTS) and macOS 10.13+ operating systems.

Building DBMS

After setting up your development environment, you can now build the DBMS by following these installation instructions. We use the Cmake build system.

Testing

You can test the correctness of your implementation by following these testing instructions.

For example, You can execute all the unit tests.

cd build
make unittest

You can also execute all the test cases including unit tests and benchmarks.

cd build
make test
We will use a more comprehensive testing script to grade your assignment.

Development Hints

  1. In any large software system, it is critical to follow some conventions to allow everyone to build on top of each other's work. Here is a short list of rules that you should follow while hacking on the system. You should also follow the formatting rules, which is generally based on the Google C++ style guide. For instance, use class UpperCaseCamelCase for type names/methods/functions, int lower_case_with_underscores for variable names. Always use descriptive names and comment generously.

    We use ClangFormat to ensure that everyone follows the same code formatting style. Before submitting your assignment, ensure that you follow these guidelines by running this command:

    cd build
    make format
    

    You can also refer to these instructions for additional style checking tools.

  2. To speed up compilation, you can use the -j command-line flag for make to tell it how many threads. For example, if you want to use four cores to build the system, you would execute make -j 4.

AWS Credit

Because you will need to profile your implementation on a machine with a large number of cores, we are providing every student with credits for Amazon AWS. You will be able to deploy a instance, install dependencies, and run your experiments.

You will want to use the c5.9xlarge instance type (36 cores). You can also consider using instances with more CPUs when testing the correctness of your implementation and debugging race condition.

You will need to make sure that you monitor your credits so that you do not run out of money. Use spot instances to reduce your costs. Setup billing alerts to make sure that you are not charged for resources that you forgot to disable.

WARNING: We will not be able to reimburse you if you use more money than the provided credits.

Grading Rubric

Each project submission will be graded in two phases.

Correctness

  1. Does the submission successfully execute all of the test cases and produce the correct answer?
  2. Does the submission execute without any memory leaks according to AddressSanitizer?
  3. Does the submission use the proper formatting for its source code?

Your implementation must pass this portion of the evaluation. If you fail any of these checks, then the submission is scored as zero. No partial credit will be given.

IMPORTANT: The Gradescope submission machine is a single-threaded. This means that it will not adequately check race conditions. You will need to run the test suite on a machine with more cores. The immediate result that you get on Gradescope will not be your final grade for this part of the assignment.

Performance

If your submission satisfies the correctness checks, then your points will further be adjusted based on the difference of the performance (throughput) between your implementation and the TA's implementation evaluated on the slot_iterator_benchmark. More specifically, your final grade is based on how much faster/slower you are than our reference implementation:

Difference Your Grade
>120% 110%
110-119% 100%
100-109% 90%
90-99% 80%
80-89% 70%
70-79% 60%
<70% 0%

For your reference, we run the slot_iterator_benchmark on the c5.9xlarge instance for 20 times with both the initial project implementation and our reference implementation. The average throughput number reported by the Google Benchmark library for the initial implementation in the handout is 473k items/sec with a standard deviation of 305k items/sec (These numbers are prelimary as of 2020-01-24 and will change). The average throughput number of the reference implementation is TODO items/sec with a standard deviation of TODO items/sec.

Late Policy

25% will deducted from the final score of the project for every 24-hour period that the assignment is late.

Only in extreme circumstances (e.g., medical emergencies) no-penalty extensions will be granted. The student is required to provide written documentation from the University health center. Please contact the instructor if you have any questions.

Submission

You can submit your implementation of these files to Gradescope (https://www.gradescope.com/courses/81879).

Important: Use the Gradescope course code announced on Piazza.

  • src/storage/data_table.cpp
  • src/include/storage/data_table.h

You should also include a report.pdf in your submission that contains:

  1. A screenshot of the PERF profiling results which shows the two hottest functions in the TransactionManager before your fix.
  2. A screenshot of the PERF profiling results which shows the bottleneck in any one of the above two functions with the annotated code before your fix.
  3. A brief analysis on how you identify the hotspot and the data structures under contention with the help of the above profiling results.
  4. A screenshot of the PERF profiling results which shows the new percentages of the on-CPU functions after your fix.
  5. A screenshot of the PERF profiling results which shows the new bottleneck in any one of the original two hottest functions with the annotated code after your fix.
  6. A brief analysis on how your implementation helps reduce the contention hotspot in the system with the evidence in the above two screenshots.

You can submit your answers as many times as you like and get immediate feedback on whether your program compiles successfully. Again, since Gradescope is single-threaded, we cannot evaluate the performance of your implementation nor can we test the robustness of your implementation (e.g., whether the system crashes) under high number of threads. You should extend the test cases and conduct those tests by yourself on the AWS machine with more CPUs. We will evaluate the correctness and the performance of your implementation off-line after the project due date.

Collaboration Policy

  • Every student has to work individually on this assignment.
  • Students are allowed to discuss high-level details about the project with others.
  • Students are not allowed to copy the contents of a white-board after a group meeting with other students.
  • Students are not allowed to copy the solutions from another colleague.

WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.