Deceptively Simple SIMD Matrix Multiplication (in Zig)

6/17/2025
5-10 min read

Matrix multiplication is a fundamental operation in many fields, and there are various ways to implement it. I've recently been on a journey of implementing a matrix library that can handle matrix multiplication in Zig, and I wanted to share my experience with creating a library that fulfilled my requirements. Before we begin, lets define the requirements,

  • The library should be more performant than a naive implementation for small matrices (think smaller than 16x16).
  • It should support different data types and matrix sizes.
  • The implementation should be dead simple.

Lets quickly define what we mean by matrix multiplication. Given two matrices AA and BB, the result of multiplying them is a new matrix CC such that each element Ci,jC_{i,j} is the dot product of the ii-th row of AA and the jj-th column of BB. Important to note is that the resulting size of CC is determined by the number of rows in AA and the number of columns in BB.


Before we begin implementing anything, lets think about what the implications of a naive implementation of the above definition are. For one, we know we have a lower bound on the time complexity of O(n2)O(n^2), simply due to us having to at least visit every element in the resulting matrix at least once to compute it. On top of that, we need to compute the dot product of a row vector from AA and column vector from BB, which is an O(n)O(n) operation, which will be done for every element in the resulting matrix, thus we have a time complexity of O(n2n)=O(n3)O(n^2 * n) = O(n^3). Now, to be clear, this is the time complexity of the naive algorithm, and any application of SIMD can not change this. However, using SIMD, we can indeed compute the dot product for 2 vectors in O(1)O(1) but only if the length of the vector does not exceed the SIMD registers available on the hardware we are running it on. Admittedly, this is quite a huge caveat, nevertheless I think it is interesting to see what type of performance we can achieve doing this.


Lets try to create a SIMD implementation of matrix multiplication, we begin by defining the matrix itself.

pub fn Matrix(comptime T: type, cols: comptime_int, rows: comptime_int) type {
    return struct {
        data: @Vector(cols * rows, T) = @splat(0),
        // ...
    };
}

One thing that should stand out is that we define the matrix as a 1D vector, essentially a flattened representation of the data. Another thing to note is that we have quite a big restriction on the size of the matrix, the size needs to be known at compile time. This is due to us using the builtin

@Vector
type which requires the size to be known at compile time. For reference, a more common way of defining a matrix would be to use a 2D slice, like so

pub fn Matrix(comptime T: type) type {
    return struct {
        data: [][]T,
        // Or something like this. Note that we need to know the size of the matrix at compile time.
        // data: @Vector(@Vector(T, cols), rows),
    };
}

Now, lets implement the multiplication function.

pub fn mul(comptime T: type, m1: anytype, m2: anytype) Matrix(T, @TypeOf(m1)._rows(), @TypeOf(m2)._cols()) {
    const t_m2 = @TypeOf(m2);
    const t_m1 = @TypeOf(m1);

    const m2_rows = t_m2._rows();
    const m2_cols = t_m2._cols();

    const m1_rows = t_m1._rows();
    const m1_cols = t_m1._cols();

    if (m1_cols != m2_rows) {
        @compileError("m1 columns must equal m2 rows");
    }

    const m1_row_shuffles = comptime rowShuffles(T, m1_cols, m1_rows);
    const m2_col_shuffles = comptime colShuffles(T, m2_cols, m2_rows);

    var res = Matrix(T, m1_rows, m2_cols){};
    var ij: usize = 0;
    for (0..m1_rows) |i| {
        for (0..m2_cols) |j| {
            defer ij += 1;
            res.data[ij] = dot(T, m1_row_shuffles[i](m1.data), m2_col_shuffles[j](m2.data));
        }
    }

    return res;
}

Lets focus on the important parts of this function, line 15 and 16 where we define the shuffle functions, specifically

colShuffles
.

fn ShuffleFn(comptime T: type, len: comptime_int, mask_len: comptime_int) type {
    return *const fn (@Vector(len, T)) @Vector(mask_len, T);
}

fn colShuffles(comptime T: type, cols: comptime_int, rows: comptime_int) [cols]ShuffleFn(T, cols * rows, rows) {
    comptime {
        var shuffle_fns: [cols]ShuffleFn(T, cols * rows, rows) = undefined;
        for (0..cols) |col| {
            var col_vec: @Vector(rows, i32) = @splat(0);
            for (0..rows) |row| {
                col_vec[row] = col + row * cols;
            }
            shuffle_fns[col] = (struct {
                fn shuffle(v: @Vector(cols * rows, T)) @Vector(rows, T) {
                    return @shuffle(T, v, undefined, col_vec);
                }
            }).shuffle;
        }
        return shuffle_fns;
    }
}

Essentially,

colShuffles
generates an array of vector shuffle functions, where a function at index ii is a function which can be applied to a matrix of size
cols * rows
to return the ii-th column of the matrix. Importantly, we are generating these functions at compile time, which allows us in the
mul
function to "quickly" access our columns, rather than constructing them at runtime which would be an O(cols)O(cols) operation. The same applies to
rowShuffles
which generates functions to access the rows of the matrix. The implementation of
rowShuffles
is very similar to
colShuffles
so I won't go into detail about it. It is important that we use the builtin
@shuffle
here so this operation becomes vectorized if the hardware supports it, and the vector is of an appropriate length.


Finally, lets take a look at the

dot
function

inline fn dot(comptime T: type, v1: anytype, v2: anytype) T {
    return @reduce(.Add, v1 * v2);
}

Okay, not much to see here, we are just making sure to use

@reduce
to so this operation becomes vectorized.


And that is actually all for the implementation, let's do some benchmarking. Specifically, we want to see a very big improvement over a naive implementation which does not utilise SIMD on "small" matrices. Our benchmark consists of 10,000 matrix multiplications of two random matrices. We will be comparing the total time taken for the multiplication of all 10,000 matrices. We will benchmark matrices of various sizes.

sizenaivesimd
2x2811 µs0 ns
3x32 ms100 ns
4x43 ms100 ns
5x55 ms11 ms
6x68 ms24 ms
7x717 ms41 ms
8x812 ms16 ms
12x1239 ms411 ms
16x1653 ms533 ms
24x24125 ms15275 ms
32x32225 ms11443 ms

From the benchmark results we can see that the SIMD implementation greatly outperforms the naive implementation for very small matrices. This should not be too surprising, what might be surprising however is that the SIMD implementation gets drastically worse as the size of the matrices increase. But really this should not be surprising either, what is happening essentially is that the matrices (and thus the vector) are growing in size such that they no longer fit inside our SIMD registers. Since we made the choice to store the matrix data in a flattened structure as one big vector, this happens rather early as the size of the vector is growing in the order of n2n^2. Thus, the optimizations we did with

@shuffle
quickly become redundant. Lets try to improve our SIMD implementation, as a goal, we should outperform the naive implementation for at least matrices of size 16x16 and smaller.


Since we can not store the matrix as a flattened vector, what if we store each row and column as its own separate vector?

pub fn Matrix(comptime T: type, cols: comptime_int, rows: comptime_int) type {
    fn vecArr(comptime T: type, arr_len: comptime_int, vec_len: comptime_int) [arr_len]@Vector(vec_len, T) {
        comptime {
            @setEvalBranchQuota(100000); // Needed to support larger matrices.
            var arr: [arr_len]@Vector(vec_len, T) = undefined;
            for (0..arr_len) |i| {
                arr[i] = @splat(0);
            }
            return arr;
        }
    }
    return struct {
        rows: [rows]@Vector(cols, T) = vecArr(T, rows, cols),
        cols: [cols]@Vector(rows, T) = vecArr(T, cols, rows),
        // ...
    };
}

One immediate drawback is that we have essentially doubled the space needed to store the matrix in memory, due to every matrix element being present once in one of the row vectors and once in one of the column vectors. However, since we are storing much smaller vectors (order of nn), and we no longer need to do our shuffling magic to retrieve the column vectors and row vectors during multiplication, we should hopefully see a measurable difference in performance for larger matrices. Below is the new

mul
and
dot
implementation.

inline fn dot(comptime T: type, v1: anytype, v2: anytype) T {
    return @reduce(.Add, v1 * v2);
}

pub fn mul(comptime T: type, m1: anytype, m2: anytype) Matrix(T, @TypeOf(m1)._rows(), @TypeOf(m2)._cols()) {
    const t_m2 = @TypeOf(m2);
    const t_m1 = @TypeOf(m1);

    const m2_rows = t_m2._rows();
    const m2_cols = t_m2._cols();

    const m1_rows = t_m1._rows();
    const m1_cols = t_m1._cols();

    if (m1_cols != m2_rows) {
        @compileError("m1 columns must equal m2 rows");
    }

    var res = Matrix(T, m1_rows, m2_cols){};
    for (0..m1_rows) |i| {
        for (0..m2_cols) |j| {
            res.set(i, j, dot(T, m1.rows[i], m2.cols[j]));
        }
    }

    return res;
}

Lets do another round of benchmarks and see what results we get

sizenaivesimd
2x21 ms41 ns
3x33 ms41 ns
4x44 ms41 ns
5x56 ms0 ns
6x610 ms41 ns
7x715 ms0 ns
8x820 ms0 ns
12x1262 ms17 ms
16x1641 ms19 ms
24x24216 ms70 ms
32x32255 ms127 ms
48x48738 ms890 ms
64x641651 ms874 ms
96x965788 ms20691 ms
128x12812015 ms18866 ms

Looks better, we get much better results for all matrices up to around 96x96. I might make a follow up post comparing this implementation to the one in popular libraries, such as numpy.