The Carnegie Mellon Database Research Group is writing an on-line encyclopedia of database management systems (both commercial and academic). The key difference between our proposed site and existing encyclopedias (e.g., Wikipedia) is that its content will be semi-structured. Instead of allowing information to be entered in an unstructured (i.e., free-form) format like Wikipedia's mark-up syntax, we will define a taxonomy that is universal across all database systems. Each system will have a fixed set of categories that correspond to the key concepts of the system. For example, there are roughly two dozen concurrency control algorithms, and thus an article author will be able to select which one their system supports from these. Enforcing this taxonomy allows us to avoid having to do entity resolution to make sure that the articles use the same spelling and connotations. It will also enable us to provide search tools on the site that allow us to compare and contrast systems across multiple dimensions. For example, the conventional wisdom is that MVCC has become the predominant concurrency control scheme in the last decade, whereas two-phase locking was more prevalent in the 1980s and 1990s. We will be able to use the site's built-in search tools to view these trends.
Students will have need to collect information about their DBMS from multiple sources, including documentation, academic papers, white papers, and possibly by interviewing the system developers themselves.
Each student can earn an extra 10% for their final grade in the course if they write a professional, well-cited article about one DBMS. This assignment is entirely optional.
- Release Date: Mar 09, 2017
- Due Date: May 15, 2017 @ 11:59pm
Each student will select one system to research for their article. No two students are allowed to choose the same DBMS. Students may not work together on this assignment with each other.
Note that newer and more popular DBMSs will have more material available for you to use, thus if you choose one of these then you will be expected to have a more complete and thorough article. For older and more obscure systems, it may be harder to find the information that you need. In that case, you will want to provide a high-level synopsis of the sources that you checked.
The sign-up form will be posted on Piazza.
Each article is comprised of three sections. All of the free-form text will be written using the Markdown mark-up language.
The first section is free-form text about the DBMS and is comprised of two parts:
- Description: A high-level overview for the main idea of the system. This can include what kind of workload or application it seeks to support, what kind system architecture that it uses, and any other interesting or novel aspects about the system.
- History: A brief description of how the project came about and its current status. This can include information about name changes, the key developers that helped build it if they are notable (e.g., Stonebraker, DeWitt), whether it is a derivative system (i.e., forked from another DBMS), whether it came out of an academic project, and whether the project has since gone defunct.
The second section is the detailed technical information about the DBMS. Each category is called a "feature" and contains two parts. The first part is list of options that can be selected per feature. For example, under the "Concurrency Control" feature, one can select the different algorithms that the DBMS supports. The second part is a Markdown field where you can elaborate or explain the feature selections. For example, if the DBMS uses "Two-Phase Locking" for its concurrency control scheme, you would describe how they handle deadlocks in the feature text field. Each feature includes a field to allow you to provide Bibtex references for where you found this information. The list of features are as follows (note that this list is not exhaustive and additional ones may be added):
- System Architecture: Is it a shared-memory, shared-disk, or shared-nothing DBMS? Does it rely on special hardware (e.g., GPU, FPGA)?
- Query Interface: What language or API does the DBMS support for the application to load data and execute queries.
- Storage Model: What kind of storage models does the DBMS support (e.g., NSM, DSM)?
- Storage Architecture: Is the system a disk-oriented or in-memory DBMS? If the latter, does it support larger-than-memory databases?
- Concurrency Control: Does the DBMS support transactions and if so what concurrency control scheme does it use?
- Isolation Levels: What isolation levels does it support? Which one is the default? How does it implement each one?
- Indexes: What kind of indexes does the DBMS support (e.g., primary key, secondary, derived, partial)? What data structures does the DBMS support? What is the default?
- Foreign Keys: Does the system support foreign key constraints?
- Logging: How does the system support data durability? What kind of logging scheme does it use (e.g., physical, logical, physiological)?
- Checkpoints: How does the DBMS take checkpoints? What kind of checkpoints are they (e.g., fuzzy vs. non-fuzzy)?
- Views: Does the DBMS support views or materialized views? How complex of a query does it support?
- Query Execution: What query processing model does the DBMS support (e.g., iterator vs. vectorized)? What kind of intra-query parallelism does it support?
- Stored Procedures: Does the DBMS support stored procedures? If so, what language(s) can they be written in?
- Joins: What join algorithms does the DBMS support? What is notable or special about them (e.g., low-memory, parallelism)?
- Code Gen: Does the support code generation or JIT optimizations? How does it do this (e.g., LLVM, templates, code gen)?
Lastly, the third section contains additional meta-data that you can provide about the DBMS. This part does not include the supplemental Markdown text field for each meta-data entry like the features in the previous section.
- Website: The URL of the main page for the DBMS. This is different than the URL for the company that develops the DBMS. For example, "www.mongodb.org" is the URL for the DBMS while "www.mongodb.com" is the URL for the company.
- Programming Language: What language(s) was the DBMS written in.
- OS: What operating systems does the DBMS support.
- Project Type: Whether the DBMS an academic or commercial project, or something in between.
- Developer: The name of the company or organization behind the development of the system.
- Start Date: The year when the project was first started or announced.
- End Data: The year when the project was terminated, renamed, or abandoned.
- Derived From: The system(s) that this DBMS was derived from. For example, SAP HANA is a combination of T-REX, MaxDB, and P*TIME. Vertica was derived from Postgres.
- License: Is the DBMS open-source. If so, then what license is it under.
The template for each article is provided as a JSON file. Each student will email the instructor their complete file before the deadline.
This file contains different field types for each system feature listed above (note that not all features have all of the field types):
- Option Field: These are the different options that are available for a particular DBMS feature. You should delete any option that does not apply to your DBMS. You are not allowed to add new options without first asking the instructor.
- Description Field: These are the Markdown fields where you will explain why you selected for particular values for the option field. For example, if a system uses "MVCC" for its concurrency control scheme, you can explain how it is actually implemented.
- Citation Field: You must URLs to the websites where you found the information about a particular system feature. Be sure to include
httpsfor each URL.
You should test the validity of your article file to make sure that it is valid JSON syntax. You can easily do this with the following Python code snippet:
import json json.loads(open("article.json", "r").read())
All articles will be licensed under the Creative Commons Attribution-ShareAlike 4.0 International license.
- Every student has to work individually on this assignment.
- Students are not allowed copy material (text or images) from other sources.
WARNING: All of the material for this project must be your own. You may not copy text or images that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.