Project #2 - Database System Report

Overview

The Carnegie Mellon Database Research Group is writing an on-line encyclopedia of database management systems. The key difference between our site and existing encyclopedias (e.g., Wikipedia) is that its content is semi-structured. That is, instead of allowing information to be entered in an unstructured (i.e., free-form) format like Wikipedia's mark-up syntax, we define a taxonomy that is universal across all database systems. Each system will have a fixed set of categories that correspond to the key concepts of the system. For example, there are roughly a dozen concurrency control algorithms, and thus an article author will be able to select which one their system supports from these. Enforcing this taxonomy allows us to avoid having to do entity resolution to make sure that the articles use the same spelling and connotations. It will also enable us to provide search tools on the site that allow us to compare and contrast systems across multiple dimensions. We will be able to use the site's built-in search tools to view historical trends.

Students will have need to collect information about their DBMS from multiple sources, including documentation, academic papers, white papers, and possibly by interviewing the system developers themselves.

  • Release Date: Feb 01, 2023
  • Due Date: May 01, 2023 @ 11:59pm

Instructions

Each student will select one system to research for their article. No two students are allowed to choose the same DBMS. Students may not work together on this assignment with each other.

Note that newer and more popular DBMSs will have more material available for you to use, thus if you choose one of these then you will be expected to have a more complete and thorough article. For older and more obscure systems, it may be harder to find the information that you need. In that case, you will want to provide a high-level synopsis of the sources that you checked.

The sign-up form will be posted on Piazza. Note that there is a list of "banned" systems. These are ones where we already have entry in the encyclopedia and thus you cannot choose it again.

Article

Each article is comprised of two sections. All of the free-form text will be written using the Markdown mark-up language.

The first section is free-form text about the DBMS and is comprised of two parts:

  • Description: A high-level overview for the main idea of the system. This can include what kind of workload or application it seeks to support, what kind system architecture that it uses, and any other interesting or novel aspects about the system.
  • History: A brief description of how the project came about and its current status. This can include information about name changes, the key developers that helped build it if they are notable (e.g., Stonebraker, DeWitt), whether it is a derivative system (i.e., forked from another DBMS), whether it came out of an academic project, and whether the project has since gone defunct.

There are additional meta-data fields. You should fill these in whenever you can:

  • URL: The current URL for the homepage of the DBMS. You can use the Github repository if none exists.
  • Developer: The name of the organization (e.g., company, university, person) that was the original developer of the system. Use a comma-separated list if there are multiple organizations.
  • Source Code URL: The URL of where to find the source code for the DBMS (if available). This can be either a public repository (e.g., Github, Gitlab, Bitbucket) or a web page where you can get a source code tarball. Always prefer the source repository (e.g., some projects use Github as a mirror of the main repository on Gitlab).
  • Tech docs: The URL of where to get the latest documentation for this DBMS. You should link to a general download page (if available) and not a PDF of a specific version. For example, instead of using the URL https://docs.blazingdb.com/v2.0/docs, you should use https://docs.blazingdb.com since this will automatically take you to the latest version.
  • Wikipedia URL: The URL of an Wikipedia page about the DBMS. It should be about the DBMS and not the organization developing it. For example, you want to link to "MongoDB and not "MongoDB Inc".
  • Start Year: The year when development of the DBMS first started. This could be either when the company was formed, when the project started in academia, or the first commit.
  • End Year: The year when the DBMS became defunct (i.e., no longer actively maintained). This could be either when the company closed, the date of the last commit, the last update for the project (e.g., version release, blog article, tweet, website update), or when the project's homepage became unavailable (according to archive.org). A DBMS is considered to be "defunct" if there has not been an update in more than two years.
  • Former Names: The previous names of the DBMS. For example, the DBMS "HeavyDB" was formerly called "OmniSci", which was previously called "MapD". Use a comma-separated list if there are multiple names.
  • Acquired By: If organization supporting the DBMS was acquired by another company, then include the list of the acquiring companies. Use a comma to separate multiple entries.
  • Countries of Origin: The list of countries of where the DBMS company or project started. This should not be where company has offices or headquarters. It is about where the co-founders and/or originally developers were living when they started the DBMS.
  • Licenses: The software licenses that the DBMS uses. This should be for the core DBMS source code and not peripheral software (e.g., drivers, management tools). If the DBMS is not open-source, then use "Proprietary".
  • Operating Systems: What operating systems does the DBMS support. If it is written in Java, then you can use "All OS with Java VM". If the DBMS is only available as a cloud service, then set the OS to "Hosted".
  • Supported Languages: What programming languages does the DBMS support. If it is an embedded DBMS (e.g., SQLite), then this would be what languages do they provide library bindings for. If the DBMS uses a client-server model, then it would be what languages do provide drivers for. Note that this is not asking you what language you would use to write queries on a database with the DBMS (e.g., SQL).
  • Written In: What programming language(s) are the core DBMS components written in. This should only be the code that is specifically written by the organization developing the DBMS and not any libraries that they rely on. For example, CockroachDB is written in Go but it uses RocksDB as its storage engine, which is written in C++. But we would just say that it is written in Go.
  • Systems Derived From: The other systems that this DBMS's is based on. Such a derivation means that it contains source code from another existing DBMS. For example, Vertica is a fork of PostgreSQL and MariaDB is a fork of MySQL.
  • Systems Embedded: The other systems that this DBMS embeds or run on top of. This is either (1) using a DBMS as its back-end storage engine (e.g., CockroachDB uses RocksDB to store data) or (2) using a DBMS's extension API to expand its capabilities (e.g., TimeScale is built on top Postgres).
  • Systems Inspired By: The other systems that this DBMS's design is based on. This should not be a subjective decision on your part. You should only select this option be if the developer's of the DBMS specifically say that they were inspired by another system (they often state this in the README).
  • Systems Compatible With: The other systems that this DBMS strives to be compatible with. Such compatibility could be either supporting the same wire protocol (e.g., MemSQL speaks the MySQL wire protocol) or because the system is derived from that DBMS (e.g., TimeScale is built on top Postgres, so therefore it is compatible with it).

The second section is the detailed technical information about the DBMS. Each category is called a "feature" and contains two parts. The first part is list of options that can be selected per feature. For example, under the "Concurrency Control" feature, one can select the different algorithms that the DBMS supports. The second part is a Markdown field where you can elaborate or explain the feature selections. For example, if the DBMS uses "Two-Phase Locking" for its concurrency control scheme, you would describe how they handle deadlocks in the feature text field. Each feature includes a field to allow you to provide references for where you found this information. The list of features are as follows (note that this list is not exhaustive and additional ones may be added):

  • Checkpoints: How does the DBMS take checkpoints? What kind of checkpoints are they (e.g., fuzzy vs. non-fuzzy)?
  • Compression: Does the DBMS support database compression? What algorithms do they use? Do they have to decompress data first before processing it during query execution?
  • Concurrency Control: Does the DBMS support transactions and if so what concurrency control scheme does it use?
  • Data Model: What data model does the DBMS support? If the system is multi-model (e.g., ArangoDB), select all of the data models that they support.
  • Foreign Keys: Does the system support foreign key constraints?
  • Hardware Acceleration: Does the system rely on specialized hardware (e.g., GPUs, FPGAs) to speed up query execution?
  • Indexes: What index data structures does the DBMS support? What is the default?
  • Isolation Levels: What isolation levels does it support? Which one is the default?
  • Joins: What join algorithms does the DBMS support? What is notable or special about them (e.g., low-memory, parallelism)?
  • Logging: How does the system support data durability? What kind of logging scheme does it use (e.g., physical, logical, physiological)?
  • Parallel Execution: Does the system support executing a single query with multiple workers?
  • Query Compilation: Does the support code generation or JIT optimizations? How does it do this (e.g., LLVM, templates, code gen)?
  • Query Execution: What query processing model does the DBMS support (e.g., iterator vs. vectorized)? What kind of intra-query parallelism does it support?
  • Query Interface: What language or API does the DBMS support for the application to load data and execute queries?
  • Storage Architecture: Is the system a disk-oriented or in-memory DBMS? If the latter, does it support larger-than-memory databases?
  • Storage Format: What data file formats does this DBMS natively support? You can ignore any external tools for a DBMS that convert a format into another format before loading it into the DBMS.
  • Storage Model: What kind of storage models does the DBMS support (e.g., NSM, DSM)?
  • Storage Organization: How is the DBMS's underlying storage manager implemented?
  • Stored Procedures: Does the DBMS support stored procedures? If so, what language(s) can they be written in?
  • System Architecture: Is it a shared-everything, shared-memory, shared-disk, or shared-nothing DBMS? If it runs in-process, then it is an embedded DBMS (e.g., SQLite, DuckDB).
  • Views: Does the DBMS support views or materialized views? How complex of a query does it support?

Important: If your DBMS uses a feature option that is not available, then please contact Andy so that we can decide whether to add it.

Tips for Writing a Good Article

  1. Avoid marketing terminology and language — You are writing a technical article about a complex piece of software. That means all of the statements about the DBMS's internals must be factual. You should avoid subjective terms about the DBMS that are often found in marketing literature. For example, you should avoid phrasing like "System X is fast because it uses multi-version concurrency control". Instead you want to just say that "System X uses multi-version concurrency control". Your article should only contain content in which we have scientific evidence for their truth and correctness.

  2. Focus on latest DBMS Version — The core DBMS architecture will likely remain the same (e.g., concurrency control method), but new features will be added over time. Be sure to base your article on the latest version. It is acceptable to use citations to older versions but always check to see whether there is a newer version of the documentation.

  3. Contact developers if you cannot find the necessary information — If you reach out to the developers (e.g., on IRC, Slack, Glitter) and tell them that you are a CMU student trying to right an article about their DBMS, then they will likely be eager to help you find the right information that you need. This is mostly true for active open-source projects. Ask Andy if you are unable to make contact.

  4. Check Archive.org — If you are writing an article on a defunct DBMS that no longer has a webpage, then you can try to find documentation about it from Archive.org's WayBackMachine.

  5. Make sure you run spellchecker — Your name will forever be attached to this article, so you want to make sure to do a good job.

Grading Rubric

Each submission will be graded based on the following criteria:

  1. Is the article's technical content thorough and accurate?
  2. Does the article only use factual statements about the DBMS's functionality (i.e., it avoids subjective statements about speed and performance)?
  3. Does the article include proper citations for all of the sub-entries?
  4. Does the article use correct English grammar and punctuation?

The quality of the article and extra credit points are left up to the discretion of the instructor.

Late Policy

There are no allowed late days for this extra credit assignment. The task must be completed before the deadline.

Submission

Feedback Phase

Students are encouraged to submit their article for review before the final submission. The instructor and teaching assistants will provide feedback and guidance on how to improve each article. Students will only be able to receive the full extra credit points if they submit their submission for feedback.

Submit the URL for your article to this form by Saturday April 1st @ 11:59pm: https://forms.gle/ZgMoJgpbRYPDdPen6

Final Submission

Submit the URL for your article to this form by May 01, 2023 @ 11:59pm: https://forms.gle/gyKK5U8zznRxrwmw5

Licensing

All articles will be licensed under the Creative Commons Attribution-ShareAlike 4.0 International license.

Collaboration Policy

  • Every student has to work individually on this assignment.
  • Students are not allowed copy material (text or images) from other sources.
  • These rules apply for both the review submission and the final submission.

WARNING: All of the material for this project must be your own. You may not copy text or images that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.