## Carnegie Mellon University ADVANCED DATABASE SYSTEMS # Vectorized Execution Andy Pavlo // 15-721 // Spring 2023 ## LAST CLASS We discussed how the DBMS will divide up tasks among its workers to execute a query. The DBMS needs to be aware of the location of data to avoid non-local memory access. ## TODAY'S AGENDA Background Implementation Approaches SIMD Fundamentals Vectorized DBMS Algorithms ## **VECTORIZATION** The process of converting an algorithm's scalar implementation that processes a single pair of operands at a time, to a vector implementation that processes one operation on multiple pairs of operands at once. #### WHY THIS MATTERS Say we can parallelize our algorithm over 32 cores. Assume each core has a 4-wide SIMD registers. Potential Speed-up: $32x \times 4x = 128x$ ## SINGLE INSTRUCTION, MULTIPLE DATA A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously. All major ISAs have microarchitecture support SIMD operations. - → **x86**: MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512 - → **PowerPC**: Altivec - $\rightarrow$ **ARM**: NEON, <u>SVE</u> - $\rightarrow$ RISC-V: RVV Z **ECMU-DB** 15-721 (Spring 2023) $$X + Y = Z$$ $$\begin{bmatrix} x_1 \\ x_2 \\ \vdots \end{bmatrix} + \begin{bmatrix} y_1 \\ y_2 \\ \vdots \end{bmatrix} = \begin{bmatrix} x_1 + y_1 \\ x_2 + y_2 \\ \vdots \end{bmatrix}$$ SISD + Z 15-721 (Spring 2023) **ECMU-DB**15-721 (Spring 2023) $$X + Y = Z$$ $$\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} + \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} x_1 + y_1 \\ x_2 + y_2 \\ \vdots \\ x_n + y_n \end{bmatrix}$$ **ECMU-DB**15-721 (Spring 2023) 15-721 (Spring 2023) #### VECTORIZATION DIRECTION #### Approach #1: Horizontal → Perform operation on all elements together within a single vector. #### Approach #2: Vertical → Perform operation in an elementwise manner on elements of each vector. Source: Przemysław Karpiński **ECMU-DB**15-721 (Spring 2023) ## SIMD INSTRUCTIONS (1) #### **Data Movement** → Moving data in and out of vector registers #### **Arithmetic Operations** - → Apply operation on multiple data items (e.g., 2 doubles, 4 floats, 16 bytes) - → Example: ADD, SUB, MUL, DIV, SQRT, MAX, MIN #### **Logical Instructions** - → Logical operations on multiple data items - → Example: AND, OR, XOR, ANDN, ANDPS, ANDNPS ## SIMD INSTRUCTIONS (2) #### **Comparison Instructions** $\rightarrow$ Comparing multiple data items (==,<,<=,>,>=,!=) #### Shuffle instructions → Move data between SIMD registers #### Miscellaneous - → Conversion: Transform data between x86 and SIMD registers. - → Cache Control: Move data directly from SIMD registers to memory (bypassing CPU cache). ## INTEL SIMD EXTENSIONS | | | Width | Integers | Single-P | Double-P | |------|---------|----------|----------|----------------|---------------| | 1997 | MMX | 64 bits | ✓ | | | | 1999 | SSE | 128 bits | ✓ | <b>√</b> (×4) | | | 2001 | SSE2 | 128 bits | ✓ | <b>√</b> | <b>√</b> (×2) | | 2004 | SSE3 | 128 bits | ✓ | <b>√</b> | <b>√</b> | | 2006 | SSSE 3 | 128 bits | ✓ | <b>√</b> | <b>√</b> | | 2006 | SSE 4.1 | 128 bits | ✓ | <b>√</b> | <b>√</b> | | 2008 | SSE 4.2 | 128 bits | ✓ | <b>√</b> | <b>√</b> | | 2011 | AVX | 256 bits | ✓ | <b>√</b> (×8) | <b>√</b> (×4) | | 2013 | AVX2 | 256 bits | ✓ | <b>√</b> | <b>√</b> | | 2017 | AVX-512 | 512 bits | ✓ | <b>√</b> (×16) | <b>√</b> (×8) | Source: <u>James Reinders</u> **ECMU-DB** 15-721 (Spring 2023) ## SIMD TRADE-OFFS #### Advantages: → Significant performance gains and resource utilization if an algorithm can be vectorized. #### Disadvantages: - → Implementing an algorithm using SIMD is still mostly a manual process. - → SIMD may have restrictions on data alignment. - → Gathering data into SIMD registers and scattering it to the correct locations is tricky and/or inefficient. No Longer True in AVX-512! ### AVX-512 Intel's 512-bit extensions to the AVX2 instructions. → Provides new operations to support data conversions, scatter, and permutations. Unlike previous SIMD extensions, Intel split AVX-512 into groups that CPUs can selectively provide (except for "foundation" extension AVX-512F). ### AVX-512 #### Intel's 512-bit extensions to the AVX2 instructions. → Provides new operations to support data conversions, scatter, and permutations. ## AVX-512 #### **IMPLEMENTATION** Choice #1: Automatic Vectorization **Choice #2: Compiler Hints** **Choice #3: Explicit Vectorization** Source: James Reinders SCMU-DB 15-721 (Spring 2023) ## AUTOMATIC VECTORIZATION The compiler can identify when instructions inside of a loop can be rewritten as a vectorized operation. Works for simple loops only and is rare in database operators. Requires hardware support for SIMD instructions. ### AUTOMATIC VECTORIZATION These might point to the same address! This loop is not legal to automatically vectorize. The code is written such that the addition is described sequentially. #### COMPILER HINTS Provide the compiler with additional information about the code to let it know that is safe to vectorize. #### Two approaches: - $\rightarrow$ Give explicit information about memory locations. - $\rightarrow$ Tell the compiler to ignore vector dependencies. #### COMPILER HINTS The **restrict** keyword in C++ tells the compiler that the arrays are distinct locations in memory. #### COMPILER HINTS This pragma tells the compiler to ignore loop dependencies for the vectors. It is up to the DBMS developer to make sure that this is correct. #### EXPLICIT VECTORIZATION Use CPU intrinsics to manually marshal data between SIMD registers and execute vectorized instructions. $\rightarrow$ Not portable across CPUs (ISAs / versions). There are libraries that hide the underlying calls to SIMD intrinsics. - → Google Highway - $\rightarrow$ Simd - → Expressive Vector Engine (EVE) - → <u>std::simd</u> (Experimental) ### EXPLICIT VECTORIZATION ``` void add(int *X, int *Y, int *Z) { __mm128i *vecX = (__m128i*)X; __mm128i *vecY = (__m128i*)Y; __mm128i *vecZ = (__m128i*)Z; for (int i=0; i<MAX/4; i++) { _mm_store_si128(vecZ++, $\_mm_add_epi32(*vecX++, $ *vecY++)); ``` Store the vectors in 128-bit SIMD registers. Then invoke the intrinsic to add together the vectors and write them to the output location. ## VECTORIZATION FUNDAMENTALS There are fundamental SIMD operations that the DBMS will use to build more complex functionality: - → Masking - → Permute - → Selective Load/Store - → Compress/Expand - → Selective Gather/Scatter MAKE THE MOST OUT OF YOUR SIMD INVESTMENTS: COUNTER CONTROL FLOW DIVERGENCE IN COMPILED QUERY PIPELINES VLDB JOURNAL 2020 ## SIMD MASKING Almost all AVX-512 operations support **predication** variants whereby the CPU only performs operations on lanes specified by an input bitmask. ## SIMD MASKING Almost all AVX-512 operations support **predication** variants whereby the CPU only performs operations on lanes specified by an input bitmask. ## SIMD MASKING Almost all AVX-512 operations support **predication** variants whereby the CPU only performs operations on lanes specified by an input bitmask. #### **PERMUTE** For each lane, copy values in the **input vector** specified by the offset in the **index vector** into the **destination vector**. Prior to AVX-512, the DBMS had to write data from the SIMD register to memory then back to the SIMD register. #### **Permute** #### **PERMUTE** For each lane, copy values in the **input vector** specified by the offset in the **index vector** into the **destination vector**. Prior to AVX-512, the DBMS had to write data from the SIMD register to memory then back to the SIMD register. #### **Permute** #### **PERMUTE** For each lane, copy values in the **input vector** specified by the offset in the **index vector** into the **destination vector**. Prior to AVX-512, the DBMS had to write data from the SIMD register to memory then back to the SIMD register. #### **Permute** ## SELECTIVE LOAD/STORE #### Selective Load #### Selective Load #### Selective Load #### Selective Store #### Selective Load #### Selective Store #### Selective Load #### Selective Store ### Compress Selective Gather #### Selective Gather #### Selective Gather #### Selective Gather #### Selective Scatter #### Selective Gather #### Selective Scatter #### Selective Gather #### Selective Scatter ### VECTORIZED DBMS ALGORITHMS Principles for efficient vectorization by using fundamental vector operations to construct more advanced functionality. - → Favor *vertical* vectorization by processing different input data per lane. - → Maximize lane utilization by executing unique data items per lane subset (i.e., no useless computations). ## VECTORIZED OPERATORS Selection Scans Hash Tables Partitioning / Histograms #### **₩CMU·DB** 15-721 (Spring 2023) ### SELECTION SCANS ### Scalar (Branchless) ``` SELECT * FROM table WHERE key >= $low AND key <= $high</pre> ``` ### SELECTION SCANS #### Vectorized ``` SELECT * FROM table WHERE key >= "0" AND key <= "U"</pre> ``` # SELECTION - ◆ Scalar (Branching) - Scalar (Branchless) MIC (Xeon Phi 7120P – 61 Cores + 4×HT) 75v3 – 4 Cores + 2×HT) Source: Orestis Polychroniou **ECMU-DB** 15-721 (Spring 2023) ### SELECTION SCANS - ◆ Scalar (Branching) - Scalar (Branchless) MIC (Xeon Phi 7120P – 61 Cores + 4×HT) - ▲ Vectorized (Early Mat) - Vectorized (Late Mat) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT) Source: Orestis Polychroniou 15-721 (Spring 2023) ### SELECTION SCANS - ◆ Scalar (Branching) - Scalar (Branchless) - ▲ Vectorized (Early Mat) - Vectorized (Late Mat) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT) 6.0 billion tuples Memory Bandwidth 2.0 0.0 20 50 100 10 Selectivity (%) Source: Orestis Polychroniou 15-721 (Spring 2023) ### **OBSERVATION** For each batch, the SIMD vectors may contain tuples that are no longer valid (they were disqualified by some previous check). ``` SELECT COUNT(*) FROM table WHERE age > 20 GROUP BY city; ``` ``` agg = dict() for t in table: if t.age > 20: agg[t.city]['count']++ for t in agg: emit(t) ``` ### **OBSERVATION** For each batch, the SIMD vectors may contain tuples that are no longer valid (they were disqualified by some previous check). ### **OBSERVATION** For each batch, the SIMD vectors may contain tuples that are no longer valid (they were disqualified by some previous check). ### RELAXED OPERATOR FUSION Vectorized processing model designed for query compilation execution engines. Decompose pipelines into **stages** that operate on vectors of tuples. - → Each stage may contain multiple operators. - → Communicate through cache-resident buffers. - $\rightarrow$ Stages are granularity of vectorization + fusion. RELAXED OPERATOR FUSION FOR IN-MEMORY DATABASES: MAKING COMPILATION, VECTORIZATION, AND PREFETCHING WORK TOGETHER AT LAST VLDB 2017 SELECT COUNT(\*) FROM table WHERE age > 20 GROUP BY city; ``` SELECT COUNT(*) FROM table WHERE age > 20 GROUP BY city; ``` ``` agg = dict() for v<sub>t</sub> in table step 1024: buffer = simd_cmp_gt(v<sub>t</sub>, 20, 1024) if |buffer| >= MAX: for t in buffer: agg[t.city]['count']++ for t in agg: emit(t) ``` ``` SELECT COUNT(*) FROM table WHERE age > 20 GROUP BY city; ....................... Emit agg = dict() for v_t in table step 1024: Agg Stage #2 buffer = simd_cmp_gt(v_t, 20, 1024) if |buffer| >= MAX: 🧲 Stage Buffer for t in buffer: agg[t.city]['count']++ for t in agg: emit(t) Filter Stage #1 Scan ``` ``` SELECT COUNT(*) FROM table WHERE age > 20 GROUP BY city; ......... Emit agg = dict() for v_t in table step 1024: Agg Stage #2 buffer = simd_cmp_gt(v_t, 20, 1024) if |buffer| >= MAX: 🧲 Stage Buffer for t in buffer: agg[t.city]['count']++ for t in agg: emit(t) Filter Stage #1 Scan ``` ``` SELECT COUNT(*) FROM table WHERE age > 20 GROUP BY city; ....................... Emit agg = dict() for v_t in table step 1024: Stage #2 buffer = simd_cmp_gt(v_t, 20, 1024) if |buffer| >= MAX: 🥃 Stage Buffer for t in buffer: agg[t.city]['count']++ for t in agg: emit(t) Filter Stage #1 Scan ``` ### ROF SOFTWARE PREFETCHING The DBMS can tell the CPU to grab the next vector while it works on the current batch. - → Prefetch-enabled operators define start of new stage. - $\rightarrow$ Hides the cache miss latency. Any prefetching technique is suitable - → Group prefetching, software pipelining, AMAC. - → Group prefetching works and is simple to implement. ### ROF EVALUATION Dual Socket Intel Xeon E5-2630v4 @ 2.20GHz TPC-H 10 GB Database ■ LLVM ■ LLVM + ROF Source: Prashanth Menon **ECMU-DB** 15-721 (Spring 2023) ## ROF EVALUATION - TPC-H Q19 Dual Socket Intel Xeon E5-2630v4 @ 2.20GHz TPC-H 10 GB Database Source: Prashanth Menon **ECMU-DB** 15-721 (Spring 2023) ## VECTORIZED OPERATORS Selection Scans Hash Tables Partitioning / Histograms #### **₩CMU·DB** ### Scalar ### Vectorized (Horizontal) #### Linear Probing Bucketized Hash Table Four Keys Four Values ### Scalar ### Vectorized (Horizontal) ### Vectorized (Vertical) ### Vectorized (Vertical) ### Vectorized (Vertical) #### Linear Probing Hash Table | KEY | PAYLOAD | |-----|---------| | k99 | | | | | | | | | k1 | | | | | | k6 | | | | | | k4 | | | | | | k5 | | | | | | k88 | | ### Vectorized (Vertical) #### Linear Probing Hash Table | KEY | PAYLOAD | |-----|---------| | k99 | | | | | | | | | k1 | | | | | | k6 | | | | | | k4 | | | | | | k5 | | | | | | k88 | | ### Vectorized (Vertical) ▲ Vectorized (Horizontal) Vectorized (Vertical) Hash Table Size Hash Table Size Source: Orestis Polychroniou 15-721 (Spring 2023) ▲ Vectorized (Horizontal) Vectorized (Vertical) Hash Table Size Hash Table Size Source: Orestis Polychroniou 15-721 (Spring 2023) Use scatter and gathers to increment counts. Replicate the histogram to handle collisions. Use scatter and gathers to increment counts. Replicate the histogram to handle collisions. Use scatter and gathers to increment counts. Replicate the histogram to handle collisions. Use scatter and gathers to increment counts. Replicate the histogram to handle collisions. # CAVEAT EMPTOR AVX-512 is **not** always faster than AVX2. H. Lang et al. <sup>2</sup> Please note that throughout our (multi-threaded) experiments, we did not observe any performance penalties through downclocking. Both processors KNL and SKX run stable at 1.4 GHz and 4.0 GHz, respec- this issue, among other things, in the following section. #### 5.3 Discussion and implications The two strategies are not mutually exclusive. Within a single pipeline, both strategies can be applied to individual operators as long as buffering operators are aware of protected lanes (mixed strategy). Moreover, the query compiler might decide to not apply any refill strategy to certain operators. Especially, when a sequence of operators is quite cheap, divergence might be acceptable as long as the costs for refill operations are not amortized. Naturally, this is a physical query optimization problem that we will leave for future work. Nevertheless, we briefly discuss the advantages and disadvantages, as this is the first work in which we present the basic principles of vector-processing in compiled query As mentioned above, consume everything requires additional registers, which increases the register | | Intel<br>Knights landing<br>(KNL) | Intel<br>Skylake-X<br>(SKX) | | | |----------------------------------------------------|----------------------------------------|----------------------------------|--|--| | Model Cores (SMT) SIMD [bit] Max. clock rate [GHz] | Phi 7210<br>64 (× 4)<br>2 × 512<br>1.5 | i9-7900X<br>10 (× 2)<br>2 × 512 | | | | 1 cache<br>2 cache<br>3 cache | 64 KiB<br>I MiB | 4.5<br>32 KiB<br>1 MiB<br>14 MiB | | | table scan and (ii) a hash join. Additionally, we experiment with a more complex operator, an approximate geospatial join. The experiments were conducted on an Intel Skylake-X (SKX) and an Intel Knights Landing (KNL) processor (cf., Table 1). The experiments were implemented in C++ and at optimization level three (-03) set to knl. If not stated othernts in parallel using two threads work in batches to the individual tween 216 and 220 tuples. On the he data in high-bandwidth memexperiments would have been . To measure the throughputs, for at least three seconds, possimultiple times. divergence handling in table algorithms into the AVX-512 ery I of Gubner et al. [6]. Addiintegrated the materialization ion et al. in [16]. live, TPC-H Query 1 (or short query that operates on a sinwith a single scan predicate. at involves several fixed-point arithmetic operations in the aggregation based on the group by clause. In total, five additional attributes are accessed to compute eight aggregated values per group. Almost all tuples survive the selection (i.e., selectivity $\approx$ 0.98). Therefore, in its original form, Q1 does not suffer from control flow divergence. To simulate control flow divergence and the resulting underutilization of SIMD The greater the number of buffers, the greater the number of permute instructions that need to be executed, whereas the number of required buffers depends on (i) the number of attributes passed along the pipeline and optionally on (ii) the number of registers required to save the internal state of the operator (e.g., a pointer to the current tree node). #### 6 Evaluation We evaluate our approach with two major sources of control flow divergence, (i) predicate evaluation as part of a D Springer Please note that throughout our (multi-threaded) experiments, we did not observe any performance penalties through downclocking. Both processors KNL and SKX run stable at 1.4GHz and 4.0GHz, respec- SECMU-DB 15-721 (Spring 2023) tively. ### CAVEAT EMPTOR AVX-512 is **not** always faster than AVX2. Some CPUs downgrade their clockspeed when switching to AVX-512 mode. → Compilers will prefer 256-bit SIMD operations. If only a small portion of the process uses AVX-512, then it is not worth the downclock penalty. There are three frequency levels, so-called licenses, from fastest to slowest: L0, L1 and L2. L0 is the "nominal" speed you'll see written on the box: when the chip says "3.5 GHz turbo", they are referring to the single-core L0 turbo. L1 is a lower speed sometimes called AVX turbo or AVX2 turbo<sup>5</sup>, originally associated with AVX and AVX2 instructions<sup>1</sup>. L2 is a lower speed than L1, sometimes called "AVX-512 turbo". The exact speeds for each license also depend on the number of active cores. For up to date tables, you can usually consult WikiChip. For example, the table for the Xeon Gold 5120 is here: | | | | | | | | | | | | | 11 | 12 | 13 | 14 | |---------|-------------------------------------|-------------|-------------|-----------|-------------|-----------|-----------|-----------|-------------|-------------|-----------|-----------|------------------------------|-----------|----------| | | Base<br>2,200 MHz | | uency/Activ | ve Cores | | | | 7 | 8 | 9 | 10 | | 12<br>2,700 MHz<br>2,300 MHz | 2,600 MHz | 2,600 MH | | | 0.00 | Turbo Freq | gency/rem | 3 | 4 | 5 | 6 | | 2 000 MHz | 2,700 MHz | 2,700 MHz | 2,700 MHZ | Z,100 IIII | 2 200 MHz | 2,200 MH | | Mode | Base | 1 | 2 | | 2 000 MHz | 2,900 MHz | 2,900 MHz | 2,900 MHZ | 2,700 11114 | 2 300 MHz | 2,300 MHz | 2,300 MHz | 2,300 MHZ | Z,Zee min | 1 600 MI | | | 2,200 MHz<br>1,800 MHz<br>1,200 MHz | 3,200 MHz | 3,200 MHz | 3,000 MHZ | 3,000 11117 | 2 700 MHz | 2,700 MHz | 2,700 MHz | 2,700 MHZ | 2,500 11117 | 1 600 MHz | 1,600 MHz | 1,600 MHz | 1,600 MHz | 1,000 | | formai | 2,200 | 2 100 MHz | 3,100 MHz | 2,900 MHz | 2,900 MHZ | A COLUMN | 1 000 MHz | 1,900 MHz | 1,900 MHz | 1,600 8812 | | | | | | | AVX2 | 1,800 MHZ | 3,100 11111 | 0.000 MHz | 2.500 MHz | 2,500 MHz | 1,900 MHz | 1,500 | | | | | | | | Nato | | AVX 512 | 1,200 MHz | 2,900 MHZ | 2,900 mm | 4111 | | | | | | | | 2000 | respec | tively. | More | The Normal, AVX2 and AVX512 rows correspond to the L0, L1 and L2 licenses respectively. Note that the relative slowdown for L1 and L2 licenses generally gets worse as the number of cores increase: for 1 or 2 active cores the L1 and L2 speeds are 97% and 91% of L0, but for 13 or 14 cores they are 85% and 62% respectively. This varies by chip, but the general trend is usually the Those preliminaries out of the way, let's get to what I think you are asking: which instructions cause which licenses to be activated? Here's a table, showing the implied license for instructions based on their width and their categorization as light or heavy: | Scalar L0 N/A 128-bit L0 L0 256-bit L0 L1* 512-bit L1 L2* | | |-----------------------------------------------------------|------------------------------------------------| | *soft transition (see below) | and all 128-bit wide instructions <sup>2</sup> | So we immediately see that all scalar (non-SIMD) instructions and all 128-bit wide instructions<sup>2</sup> always run at full speed in the L0 license. # **EMPTOR** ster than AVX2. eir clockspeed when SIMD operations. e process uses AVX-512, vnclock penalty. ### PARTING THOUGHTS Vectorization is essential for OLAP queries. We can combine all the intra-query parallelism optimizations we've talked about in a DBMS. - → Multiple threads processing the same query. - → Each thread can execute a compiled plan. - $\rightarrow$ The compiled plan can invoke vectorized operations. ### **NEXT CLASS** Query Compilation Project #3 Topics