Work-in-progress. Code available here.

This is a worklog on optimizing a single-precision generalized matrix-multiply (GEMM) kernel in C to land close to Intel MKL performance. In the process of learning this for myself, I found the following sources really helpful. Building on the following, this article aims to approach the design decisions of GEMM at the level of a chip ISA.

Introduction

Let us start by describing the pointwise operation:

Pointwise GEMM Operation

void gemm_naive(float *C, 
                const float *A, 
                const float *B, 
                int M, 
                int N, 
                int K) {

  for (int i = 0; i < M; i++) {
    for (int j = 0; j < N; j++) {
      for (int k = 0; k < K; k++) {
        C[i * N + j] += A[i * K + k] * B[k * N + j];
      }
    }
  }
}

It takes ~1.2 seconds for this kernel to multiply two 1000-size square matrices. This is absurdly slow for a CPU of this day and age. The Intel-MKL library, in comparison, finishes this operation in 13ms, about 100x faster. Before we start optimizing this naive kernel, lets take stock of how to reason about the performance of a kernel.

Roofline Analysis

System specs:

We will measure the performance of Generalized Matrix Multiply (GEMM) kernels in GFLOP/s (giga floating point operations per second). How many operations? GEMM involves K dot products across each row and column of A and B respectively to furnish each element of result matrix C.

$$ A^{M \times K} \times B^{K \times N} = 2 \cdot M \cdot N \cdot (K - 1) \approx 2 \cdot MNK $$

We apply a scaling factor of 2 because we count multiply and adds as two separate ops. For equal matrix dimensions, this is roughly $2N^3$ FLOPs. Assuming single precision floats, each input matrix A, B of size $4 \cdot N^2$ bytes will be read once. Output matrix of size $4 \cdot N^2$ bytes will be read and written back to memory. Therefore any kernel will read/write at least $4 \cdot (4 \cdot N^2)$ bytes.

It is clear that total math ops grow faster than total memory read/writes. This ratio of floating point ops per byte of data moved is called the arithmetic intensity of an operation. For GEMM, arithmetic intensity $\alpha$ is:

$$ \alpha = \frac{\text{ \char"0023 operations }}{\text{ \char"0023 bytes transferred }} = \frac{2N^3}{16N^2} = \frac{N}{8} \text{ FLOPs/byte } $$

This means that as matrix sizes grow, GEMM operation becomes compute-bound. In fact, if we know the compute and memory bandwidth of a machine, we can find the machine's 'ridge point'. Any kernel that uses less FLOPs/byte from the ridge point is said to be memory-bound, and vice versa.

Intel Golden Cove core has two FMA (fused multiply-add) units that can operate on 256-bit width vectors simultaneously. In single precision, this means 8 floats per vector. Each core also has 16 registers that are accessible in a single clock cycle. Registers sit on top of the memory hierarchy. FMA units have a latency of 4 clock cycles, and a throughput of 2 IPC2. Instructions per clock. Here it means that both FMA units can dispatch in parallel (ref).. GEMM being an arithmetic demanding operation, we care only about FMA streaming throughput. The first-dispatch latency therefore is not relevant. This gives us enough information to calculate the compute bandwidth:

$$ 2 \text{ ops } \times 2 \text{ IPC } \times 8 \text{ floats/cycle } \times 2.5 \text{ GHz } = 80 \text{ GFLOP/s } $$

The DRAM bandwidth on our setup is 10GB/s per thread from a simple mbw benchmark. Therefore, the ridge point $\gamma$ of this CPU across DRAM is:

$$ \gamma = \frac{\text{compute BW}}{\text{memory BW}} = 8 \text{ FLOPs/byte } $$

Memory Hierarchy Roofline Model

In practice, the ridge point depends on cache reuse, branching, and other instruction overheads. For instance, if we manage to keep the entire working set of a GEMM operation within cache boundary (which we will see soon with a cache-blocked GEMM kernel), the arithmetic intensity necessary to saturate compute units is quite lower. Here is a roofline model of the entire memory hierarchy and the corresponding ridge points. The focus of this worklog is on single-threaded operations only.

Memory Layout

As mentioned earlier, our arrays store floats in a row-major order, i.e., elements of a row are laid out consecutively. CPUs fetch contiguous blocks of memory (called a cache line) in the hope that consecutive memory elements will be needed for further processing. If a computation does not utilize all items in a cache line optimally, CPU cycles are wasted.

Row-Major Memory Layout

This gives us a couple of observations:

Spatial Locality

Iterating over rows of B is the problem. Notice that the nested for-loops are order-independent, and array C does not depend on dimension K. Therefore, we can reorder the loops such that iterating over K dimension is slower, and hence less costly for B.

/** Basic loop-reordered, pointwise GEMM kernel. */
void gemm_loop_reorder(float* __restrict C, 
                        const float* __restrict A, 
                        const float* __restrict B, 
                        int M, 
                        int N, 
                        int K) {

  for (int i = 0; i < M; i++) {
    for (int k = 0; k < K; k++) {
      for (int j = 0; j < N; j++) {
        C[i * N + j] += A[i * K + k] * B[k * N + j];
      }
    }
  }
}

By swapping j <-> k, we retain cache-friendliness of A and C, while reusing the element B[k][j] for N iterations before incurring a cache miss. We still incur the same number of misses. We are simply amortizing the cost of each cache-miss by reusing the fetched element as long as possible.

SGEMM Benchmark

On small matrices, this simple tweak boosts our GFLOP/s by 10-25x, saturating on the lower end as matrices grow large. What explains this jump? Can performance be sustained over large matrices?

Implicit Vectorization

Even though our loop-reordered kernel defines scalar operations, improved spatial locality on B allows the compiler (with -O2 flag) to fuse scalar operations into vector FMA instructions. We can see this in the disassembly of the kernel.

.LBB0_19:
    vmovups ymm2, ymmword ptr [rsi + 4*r10]
    vmovups ymm3, ymmword ptr [rsi + 4*r10 + 32]
    vmovups ymm4, ymmword ptr [rsi + 4*r10 + 64]
    vmovups ymm5, ymmword ptr [rsi + 4*r10 + 96]
    vfmadd213ps     ymm2, ymm1, ymmword ptr [r14 + 4*r10 - 64]
    vfmadd213ps     ymm3, ymm1, ymmword ptr [r14 + 4*r10 - 32]
    vfmadd213ps     ymm4, ymm1, ymmword ptr [r14 + 4*r10]
    vfmadd213ps     ymm5, ymm1, ymmword ptr [r14 + 4*r10 + 32]
    vmovups ymmword ptr [r14 + 4*r10 - 64], ymm2
    vmovups ymmword ptr [r14 + 4*r10 - 32], ymm3
    vmovups ymmword ptr [r14 + 4*r10], ymm4
    vmovups ymmword ptr [r14 + 4*r10 + 32], ymm5

This core has vector registers for single-precision floats: 16 ymm registers, each holding 8 floats. (Other instruction sets like SSE and AVX-512 use different register widths, but we focus on ymm here.) These registers sit at the top of the memory hierarchy and are the fastest to access.

When using ymm registers with 256-bit vectors, the performance ceiling is limited by the load/store capacity, as we'll analyze in detail. On small matrices, this loop-reordered kernel is an order of magnitude faster because the active blocks fit within the L2 cache boundary. As the matrix size grows, performance plateaus until active blocks fit L3. For even larger matrices, the active blocks exceed cache boundary, and require multiple read/writes into the main memory.

Cache blocking

Cache size is limited. As matrix dimensions grow, there is a possibility of older cache lines being 'evicted' to fetch elements for the next iteration. This leads to wasteful load/stores and lower arithmetic intensity for large matrix sizes. We solve this by slicing each of the three dimensions into 'tiles', and executing smaller, cache-friendly matrix multiplies on those tiles. Tuned how?

Tiling

/** Cache-blocking across dimensions. */
#define TILE_K 128
#define TILE_N 2048
#define TILE_M 1024

void gemm_cache_blocking(float* __restrict C, 
                          const float* __restrict A, 
                          const float* __restrict B, 
                          int M, 
                          int N, 
                          int K) {

  // Tile across each dimension
  for (int i = 0; i < M; i += TILE_M) {
    const int mc = min(TILE_M, M - i);
    for (int k = 0; k < K; k += TILE_K) {
      const int kc = min(TILE_K, K - k);
      for (int j = 0; j < N; j += TILE_N) {
        const int nc = min(TILE_N, N - j);

        // Update partials on each tile
        for (int ir = 0; ir < mc; ir++) {
          for (int p = 0; p < kc; p++) {
            for (int jc = 0; jc < nc; jc++) {
              C[(i + ir) * N + (j + jc)] += 
              A[(i + ir) * K + (k + p)] * B[(k + p) * N + (j + jc)];
            }
          }
        }
      }
    }
  }
}

With cache-blocking, performance is consistent across all matrix sizes. The disassembly of this kernel is same as before. This is expected because the same instructions now run on 'tiles' of matrices.

SGEMM Benchmark

Performance ceiling

So our kernel uses 256-bit FMAs, and cache-blocking to sustain GFLOP/s. Recall from our roofline analysis, the performance ceiling is 80 GFLOP/s. To understand the reason behind saturation at 40 GFLOP/s, review the disassembly:

U-op count for the GEMM hot loop.

From the Golden Cove microarchitecture, we find the following uOp capacities:

Op Capacity (per cycle) Requirement Cycles
Loads $3 \times 256$ $8 \times 256$ $2.67$
Stores $2 \times 256$ $4 \times 256$ $2$
FMAs $2$ $4$ $2$

Loads take approximately 2.67 cycles3. A 32-bit scalar from A is broadcasted to ymm1 and reused for the entire iteration. The load cost is negligible compared to the rest, hence ignored in calculations.. FMAs execute as soon as the operands are ready, and hence the load ops 'mask' the 2 cycles consumed by FMAs. Stores take 2 cycles after FMAs retire. So the percentage of 'useful' multiply-add work:

$$ \frac{2 \text{ FMA}}{2.67 \text{ loads } + 2 \text{ stores}} = \frac{2}{4.67} \approx 0.43 $$

Our performance ceiling with this kernel is $0.43 \times 80 = 34.4 \text{ GFLOP/s}$. This matches the GFLOP/s we see in practice.

Outer Product

So far we have been looking at matrix multiplication as repeated dot products between rows of A and columns of B: $$ C_{ij} = \sum_{k=1}^K A_{ik} \cdot B_{kj} $$

Dot products are inefficient on hardware for the following reasons:

Matrix-multiply as an outer product

Matrix multiply can be rewritten as a cumuluative outer-product between columns of A and rows of B: $$ C = A \times B = \sum_{k=0}^{K-1} A_{:,k} \otimes B_{k,:} $$

Here is an example of a vector outer product with $a^{2 \times 1} \otimes b^{1 \times 2}$: $$ a \otimes b = \begin{bmatrix} a_0 \\ a_1 \end{bmatrix} \begin{bmatrix} b_0 & b_1 \end{bmatrix} = \begin{bmatrix} a_0 b_0 & a_0 b_1 \\ a_1 b_0 & a_1 b_1 \\ \end{bmatrix} $$

In the notation: * $A_{:,k}$ is the $k$-th column of A (an $M \times 1$ vector). * $B_{k,:}$ is the $k$-th row of B (a $1 \times N$ vector).

Their outer product ($\otimes$) produces an $M \times N$ matrix where each element is $A_{i,k} \cdot B_{k,j}$. Summing these over all $k$ gives the full $C$.

This is algebraically identical to the dot-product view but shifts the focus: instead of accumulating inward along $k$ for each fixed $(i,j)$, we are broadcasting outward from each $k$, adding a full "layer" to $C$ at a time.

Outer Product view of A, B, C.

What motivates this reformulation?

Outer Product using Registers

CPUs do not have an intrinsic for vector outer product, which means we need to compute one iteratively using vector FMAs.

Consider loading $\text{MR}$ scalars from $A$ across the column, and $\text{NR}$ scalars from $B$ across the row4. You may (rightly) wonder that accesses across $A$ are not cache-friendly. In practice, we transpose a tile of A into a buffer, which gets passed into the outer-product microkernel. Transposed A is cache-friendly and reuses the same for K outer products. Check code for details..

We iteratively broadcast + FMA each of the scalars from $A$ to vectors of $B$, cumulatively storing the result in an $\text{MR} \times \text{NR}$ register tile of $C$.

Outer Product view of A, B, C.

Here is a pseudocode of the inner loop:

fn micro_gemm(...):
  <!-- m=MR scalars of A -->
  <!-- n=NR/8 vectors of B -->
  a = {}
  b[NR/8] = {}
  c[MR][NR/8] = {}

  <!-- Load tile from C -->
  for m in MR:
    for n in NR/8:
      c[m][n] = load(C[m][n])

  <!-- Loop over inner dimension -->
  for p in K:
    b[1], ..., b[NR/8] = load(B[:NR])

    <!-- One iteration (hot FMA loop) -->
    for m in MR:
      a = broadcast(load(A[m]))
      <!-- Outer product within registers -->
      for n in NR/8:
        c[m][n] = fma(a, b[n], c[m][n])

    A += MR
    B += NR

  <!-- Store back to C -->
  for m in MR:
    for n in NR/8:
      store(c[m][n], C[m][n])

Optimal Tile Size

When using YMM vector registers, we have a limit of 16. The scalars we load from $B$ of size $\text{NR}$ must be a multiple of 8 to fit in one register. Hence $B$ vector will use $\text{NR}/8$ registers. Each scalar from $A$ uses 1 register: the scalar is broadcasted to the entire vector. The $C$ accumulator fully resides in registers, requiring $\text{MR} \times \text{NR}/8$ registers. Therefore, we need to satisfy the inequality:

$$ \text{MR} \cdot \frac{\text{NR}}{8} + \frac{\text{NR}}{8} + 1 \leq 16 $$

Since $\text{MR} \ge 1$ and $\text{NR} \ge 8$ is necessary, we have the following acceptable combinations:

MR NR YMM register ct. Loads/iter (bytes) FLOPs/iter FLOPs/byte
1 56 15 228 112 0.491
2 40 16 168 160 0.952
4 24 16 112 192 1.714
6 16 15 88 192 2.182
14 8 16 88 224 2.545

Only the $6 \times 16$ and $14 \times 8$ size micro-kernels are capable of saturating the core within L3 boundary (recall from the roofline plot, $2.16 \text{ FLOPs/byte}$), so we can discard other candidates. Of the two that remain, $14 \times 8$ tile actually ends up being load bound. The disassembly shows a memory broadcast on every FMA; compilers tend to generate memory-source FMAs instead of separating the load and broadcast into registers. As a result, even though the total number of bytes accessed is similar, each scalar requires its own load op during the FMA. This leads to roughly 15 load instructions per iteration (14 scalar loads plus one 256-bit vector load).

By contrast, the $6 \times 16$ micro-kernel performs six scalar loads and two 256-bit vector loads, for a total of eight loads. This produces a much better balance between load throughput and FMA issue rate, allowing the kernel to approach core saturation. This explains the popular choice of $6 \times 16$ in various BLAS libraries using AVX intrinsics.

Outer Product Kernel Benchmark

Cache Blocking (again)

We tile across of the three dimensions of the matrices and sequentially compute GEMM on each one of them.

#define MR 6
#define NR 16

#define KC NR * 256
#define NC 128
#define MC MR * 256

void gemm_outer_product_cache_blocking(float * __restrict C, 
                                      const float * __restrict A, 
                                      const float * __restrict B, 
                                      int M, 
                                      int N, 
                                      int K) {
  for (int i = 0; i < M; i += MC) {
    const int mc = min(MC, M - i);
    for (int p = 0; p < K; p += KC) {
      const int kc = min(KC, K - p);
      pack_tileA(blockA, &A[i * K + p], mc, kc, K);
      for (int j = 0; j < N; j += NC) {
        const int nc = min(NC, N - j);
        pack_tileB(blockB, &B[p * N + j], nc, kc, N);

        // Update each tile.
        for (int ir = 0; ir < mc; ir += MR) {
          for (int jr = 0; jr < nc; jr += NR) {
            const int mr = min(MR, mc - ir);
            const int nr = min(NR, nc - jr);
            micro_gemm(&C[(i + ir) * N + (j + jr)], 
                       &blockA[ir * kc], 
                       &blockB[jr * kc], 
                       mr, nr, kc, N);
          }
        }
      }
    }
  }
}

This brings us within ~50% range of Intel-MKL. Note that because we use half of the full 512-bit vector instructions, this saturates the core at the given width!

Outer Product with Cache Blocking

Wider Registers

We now have a playbook. We rewrite the same kernel with 512-bit intrinsics, and compute ideal values of MR and NR that saturate the FLOPs/byte. In my case, MR=8, NR=48 comes within 95% of Intel MKL.