Ali Shahwali

Deceptively Simple SIMD Matrix Multiplication (in Zig)

6/17/2025

5-10 min read

Matrix multiplication is a fundamental operation in many fields, and there are various ways to implement it. I've recently been on a journey of implementing a matrix library that can handle matrix multiplication in Zig, and I wanted to share my experience with creating a library that fulfilled my requirements. Before we begin, lets define the requirements,

The library should be more performant than a naive implementation for small matrices (think smaller than 16x16).
It should support different data types and matrix sizes.
The implementation should be dead simple.

Lets quickly define what we mean by matrix multiplication. Given two matrices $A$ and $B$ , the result of multiplying them is a new matrix $C$ such that each element $C_{i,j}$ is the dot product of the $i$ -th row of $A$ and the $j$ -th column of $B$ . Important to note is that the resulting size of $C$ is determined by the number of rows in $A$ and the number of columns in $B$ .

Before we begin implementing anything, lets think about what the implications of a naive implementation of the above definition are. For one, we know we have a lower bound on the time complexity of $O(n^2)$ , simply due to us having to at least visit every element in the resulting matrix at least once to compute it. On top of that, we need to compute the dot product of a row vector from $A$ and column vector from $B$ , which is an $O(n)$ operation, which will be done for every element in the resulting matrix, thus we have a time complexity of $O(n^2 * n) = O(n^3)$ . Now, to be clear, this is the time complexity of the naive algorithm, and any application of SIMD can not change this. However, using SIMD, we can indeed compute the dot product for 2 vectors in $O(1)$ but only if the length of the vector does not exceed the SIMD registers available on the hardware we are running it on. Admittedly, this is quite a huge caveat, nevertheless I think it is interesting to see what type of performance we can achieve doing this.

Lets try to create a SIMD implementation of matrix multiplication, we begin by defining the matrix itself.

pub fn Matrix(comptime T: type, cols: comptime_int, rows: comptime_int) type {
    return struct {
        data: @Vector(cols * rows, T) = @splat(0),
        // ...
    };
}

One thing that should stand out is that we define the matrix as a 1D vector, essentially a flattened representation of the data. Another thing to note is that we have quite a big restriction on the size of the matrix, the size needs to be known at compile time. This is due to us using the builtin

@Vector

type which requires the size to be known at compile time. For reference, a more common way of defining a matrix would be to use a 2D slice, like so

pub fn Matrix(comptime T: type) type {
    return struct {
        data: [][]T,
        // Or something like this. Note that we need to know the size of the matrix at compile time.
        // data: @Vector(@Vector(T, cols), rows),
    };
}

Now, lets implement the multiplication function.

pub fn mul(comptime T: type, m1: anytype, m2: anytype) Matrix(T, @TypeOf(m1)._rows(), @TypeOf(m2)._cols()) {
    const t_m2 = @TypeOf(m2);
    const t_m1 = @TypeOf(m1);

    const m2_rows = t_m2._rows();
    const m2_cols = t_m2._cols();

    const m1_rows = t_m1._rows();
    const m1_cols = t_m1._cols();

    if (m1_cols != m2_rows) {
        @compileError("m1 columns must equal m2 rows");
    }

    const m1_row_shuffles = comptime rowShuffles(T, m1_cols, m1_rows);
    const m2_col_shuffles = comptime colShuffles(T, m2_cols, m2_rows);

    var res = Matrix(T, m1_rows, m2_cols){};
    var ij: usize = 0;
    for (0..m1_rows) |i| {
        for (0..m2_cols) |j| {
            defer ij += 1;
            res.data[ij] = dot(T, m1_row_shuffles[i](m1.data), m2_col_shuffles[j](m2.data));
        }
    }

    return res;
}

Lets focus on the important parts of this function, line 15 and 16 where we define the shuffle functions, specifically

colShuffles

fn ShuffleFn(comptime T: type, len: comptime_int, mask_len: comptime_int) type {
    return *const fn (@Vector(len, T)) @Vector(mask_len, T);
}

fn colShuffles(comptime T: type, cols: comptime_int, rows: comptime_int) [cols]ShuffleFn(T, cols * rows, rows) {
    comptime {
        var shuffle_fns: [cols]ShuffleFn(T, cols * rows, rows) = undefined;
        for (0..cols) |col| {
            var col_vec: @Vector(rows, i32) = @splat(0);
            for (0..rows) |row| {
                col_vec[row] = col + row * cols;
            }
            shuffle_fns[col] = (struct {
                fn shuffle(v: @Vector(cols * rows, T)) @Vector(rows, T) {
                    return @shuffle(T, v, undefined, col_vec);
                }
            }).shuffle;
        }
        return shuffle_fns;
    }
}

Essentially,

colShuffles

generates an array of vector shuffle functions, where a function at index

i

is a function which can be applied to a matrix of size

cols * rows

to return the

i

-th column of the matrix. Importantly, we are generating these functions at compile time, which allows us in the

mul

function to "quickly" access our columns, rather than constructing them at runtime which would be an

O(cols)

operation. The same applies to

rowShuffles

which generates functions to access the rows of the matrix. The implementation of

rowShuffles

is very similar to

colShuffles

so I won't go into detail about it. It is important that we use the builtin

@shuffle

here so this operation becomes vectorized if the hardware supports it, and the vector is of an appropriate length.

Finally, lets take a look at the

dot

function

inline fn dot(comptime T: type, v1: anytype, v2: anytype) T {
    return @reduce(.Add, v1 * v2);
}

Okay, not much to see here, we are just making sure to use

@reduce

to so this operation becomes vectorized.

And that is actually all for the implementation, let's do some benchmarking. Specifically, we want to see a very big improvement over a naive implementation which does not utilise SIMD on "small" matrices. Our benchmark consists of 10,000 matrix multiplications of two random matrices. We will be comparing the total time taken for the multiplication of all 10,000 matrices. We will benchmark matrices of various sizes.

size	naive	simd
2x2	811 µs	0 ns
3x3	2 ms	100 ns
4x4	3 ms	100 ns
5x5	5 ms	11 ms
6x6	8 ms	24 ms
7x7	17 ms	41 ms
8x8	12 ms	16 ms
12x12	39 ms	411 ms
16x16	53 ms	533 ms
24x24	125 ms	15275 ms
32x32	225 ms	11443 ms

From the benchmark results we can see that the SIMD implementation greatly outperforms the naive implementation for very small matrices. This should not be too surprising, what might be surprising however is that the SIMD implementation gets drastically worse as the size of the matrices increase. But really this should not be surprising either, what is happening essentially is that the matrices (and thus the vector) are growing in size such that they no longer fit inside our SIMD registers. Since we made the choice to store the matrix data in a flattened structure as one big vector, this happens rather early as the size of the vector is growing in the order of $n^2$ . Thus, the optimizations we did with

@shuffle

quickly become redundant. Lets try to improve our SIMD implementation, as a goal, we should outperform the naive implementation for at least matrices of size 16x16 and smaller.

Since we can not store the matrix as a flattened vector, what if we store each row and column as its own separate vector?

pub fn Matrix(comptime T: type, cols: comptime_int, rows: comptime_int) type {
    fn vecArr(comptime T: type, arr_len: comptime_int, vec_len: comptime_int) [arr_len]@Vector(vec_len, T) {
        comptime {
            @setEvalBranchQuota(100000); // Needed to support larger matrices.
            var arr: [arr_len]@Vector(vec_len, T) = undefined;
            for (0..arr_len) |i| {
                arr[i] = @splat(0);
            }
            return arr;
        }
    }
    return struct {
        rows: [rows]@Vector(cols, T) = vecArr(T, rows, cols),
        cols: [cols]@Vector(rows, T) = vecArr(T, cols, rows),
        // ...
    };
}

One immediate drawback is that we have essentially doubled the space needed to store the matrix in memory, due to every matrix element being present once in one of the row vectors and once in one of the column vectors. However, since we are storing much smaller vectors (order of $n$ ), and we no longer need to do our shuffling magic to retrieve the column vectors and row vectors during multiplication, we should hopefully see a measurable difference in performance for larger matrices. Below is the new

mul

and

dot

implementation.

inline fn dot(comptime T: type, v1: anytype, v2: anytype) T {
    return @reduce(.Add, v1 * v2);
}

pub fn mul(comptime T: type, m1: anytype, m2: anytype) Matrix(T, @TypeOf(m1)._rows(), @TypeOf(m2)._cols()) {
    const t_m2 = @TypeOf(m2);
    const t_m1 = @TypeOf(m1);

    const m2_rows = t_m2._rows();
    const m2_cols = t_m2._cols();

    const m1_rows = t_m1._rows();
    const m1_cols = t_m1._cols();

    if (m1_cols != m2_rows) {
        @compileError("m1 columns must equal m2 rows");
    }

    var res = Matrix(T, m1_rows, m2_cols){};
    for (0..m1_rows) |i| {
        for (0..m2_cols) |j| {
            res.set(i, j, dot(T, m1.rows[i], m2.cols[j]));
        }
    }

    return res;
}

Lets do another round of benchmarks and see what results we get

size	naive	simd
2x2	1 ms	41 ns
3x3	3 ms	41 ns
4x4	4 ms	41 ns
5x5	6 ms	0 ns
6x6	10 ms	41 ns
7x7	15 ms	0 ns
8x8	20 ms	0 ns
12x12	62 ms	17 ms
16x16	41 ms	19 ms
24x24	216 ms	70 ms
32x32	255 ms	127 ms
48x48	738 ms	890 ms
64x64	1651 ms	874 ms
96x96	5788 ms	20691 ms
128x128	12015 ms	18866 ms

Looks better, we get much better results for all matrices up to around 96x96. I might make a follow up post comparing this implementation to the one in popular libraries, such as numpy.