Deceptively Simple SIMD Matrix Multiplication (in Zig)
Matrix multiplication is a fundamental operation in many fields, and there are various ways to implement it. I've recently been on a journey of implementing a matrix library that can handle matrix multiplication in Zig, and I wanted to share my experience with creating a library that fulfilled my requirements. Before we begin, lets define the requirements,
- The library should be more performant than a naive implementation for small matrices (think smaller than 16x16).
- It should support different data types and matrix sizes.
- The implementation should be dead simple.
Lets quickly define what we mean by matrix multiplication. Given two matrices and , the result of multiplying them is a new matrix such that each element is the dot product of the -th row of and the -th column of . Important to note is that the resulting size of is determined by the number of rows in and the number of columns in .
Before we begin implementing anything, lets think about what the implications of a naive implementation of the above definition are. For one, we know we have a lower bound on the time complexity of , simply due to us having to at least visit every element in the resulting matrix at least once to compute it. On top of that, we need to compute the dot product of a row vector from and column vector from , which is an operation, which will be done for every element in the resulting matrix, thus we have a time complexity of . Now, to be clear, this is the time complexity of the naive algorithm, and any application of SIMD can not change this. However, using SIMD, we can indeed compute the dot product for 2 vectors in but only if the length of the vector does not exceed the SIMD registers available on the hardware we are running it on. Admittedly, this is quite a huge caveat, nevertheless I think it is interesting to see what type of performance we can achieve doing this.
Lets try to create a SIMD implementation of matrix multiplication, we begin by defining the matrix itself.
pub fn Matrix(comptime T: type, cols: comptime_int, rows: comptime_int) type {
return struct {
data: @Vector(cols * rows, T) = @splat(0),
// ...
};
}
One thing that should stand out is that we define the matrix as a 1D vector, essentially a flattened representation of the data. Another thing to note is that we have quite a big restriction on the size of the matrix, the size needs to be known at compile time. This is due to us using the builtin
type which requires the size to be known at compile time. For reference, a more common way of defining a matrix would be to use a 2D slice, like so@Vector
pub fn Matrix(comptime T: type) type {
return struct {
data: [][]T,
// Or something like this. Note that we need to know the size of the matrix at compile time.
// data: @Vector(@Vector(T, cols), rows),
};
}
Now, lets implement the multiplication function.
pub fn mul(comptime T: type, m1: anytype, m2: anytype) Matrix(T, @TypeOf(m1)._rows(), @TypeOf(m2)._cols()) {
const t_m2 = @TypeOf(m2);
const t_m1 = @TypeOf(m1);
const m2_rows = t_m2._rows();
const m2_cols = t_m2._cols();
const m1_rows = t_m1._rows();
const m1_cols = t_m1._cols();
if (m1_cols != m2_rows) {
@compileError("m1 columns must equal m2 rows");
}
const m1_row_shuffles = comptime rowShuffles(T, m1_cols, m1_rows);
const m2_col_shuffles = comptime colShuffles(T, m2_cols, m2_rows);
var res = Matrix(T, m1_rows, m2_cols){};
var ij: usize = 0;
for (0..m1_rows) |i| {
for (0..m2_cols) |j| {
defer ij += 1;
res.data[ij] = dot(T, m1_row_shuffles[i](m1.data), m2_col_shuffles[j](m2.data));
}
}
return res;
}
Lets focus on the important parts of this function, line 15 and 16 where we define the shuffle functions, specifically
.colShuffles
fn ShuffleFn(comptime T: type, len: comptime_int, mask_len: comptime_int) type {
return *const fn (@Vector(len, T)) @Vector(mask_len, T);
}
fn colShuffles(comptime T: type, cols: comptime_int, rows: comptime_int) [cols]ShuffleFn(T, cols * rows, rows) {
comptime {
var shuffle_fns: [cols]ShuffleFn(T, cols * rows, rows) = undefined;
for (0..cols) |col| {
var col_vec: @Vector(rows, i32) = @splat(0);
for (0..rows) |row| {
col_vec[row] = col + row * cols;
}
shuffle_fns[col] = (struct {
fn shuffle(v: @Vector(cols * rows, T)) @Vector(rows, T) {
return @shuffle(T, v, undefined, col_vec);
}
}).shuffle;
}
return shuffle_fns;
}
}
Essentially,
generates an array of vector shuffle functions, where a function at index is a function which can be applied to a matrix of size colShuffles
to return the -th column of the matrix. Importantly, we are generating these functions at compile time, which allows us in the cols * rows
function to "quickly" access our columns, rather than constructing them at runtime which would be an operation. The same applies to mul
which generates functions to access the rows of the matrix. The implementation of rowShuffles
is very similar to rowShuffles
so I won't go into detail about it. It is important that we use the builtin colShuffles
here so this operation becomes vectorized if the hardware supports it, and the vector is of an appropriate length.@shuffle
Finally, lets take a look at the
functiondot
inline fn dot(comptime T: type, v1: anytype, v2: anytype) T {
return @reduce(.Add, v1 * v2);
}
Okay, not much to see here, we are just making sure to use
to so this operation becomes vectorized.@reduce
And that is actually all for the implementation, let's do some benchmarking. Specifically, we want to see a very big improvement over a naive implementation which does not utilise SIMD on "small" matrices. Our benchmark consists of 10,000 matrix multiplications of two random matrices. We will be comparing the total time taken for the multiplication of all 10,000 matrices. We will benchmark matrices of various sizes.
size | naive | simd |
---|---|---|
2x2 | 811 µs | 0 ns |
3x3 | 2 ms | 100 ns |
4x4 | 3 ms | 100 ns |
5x5 | 5 ms | 11 ms |
6x6 | 8 ms | 24 ms |
7x7 | 17 ms | 41 ms |
8x8 | 12 ms | 16 ms |
12x12 | 39 ms | 411 ms |
16x16 | 53 ms | 533 ms |
24x24 | 125 ms | 15275 ms |
32x32 | 225 ms | 11443 ms |
From the benchmark results we can see that the SIMD implementation greatly outperforms the naive implementation for very small matrices. This should not be too surprising, what might be surprising however is that the SIMD implementation gets drastically worse as the size of the matrices increase. But really this should not be surprising either, what is happening essentially is that the matrices (and thus the vector) are growing in size such that they no longer fit inside our SIMD registers. Since we made the choice to store the matrix data in a flattened structure as one big vector, this happens rather early as the size of the vector is growing in the order of . Thus, the optimizations we did with
quickly become redundant. Lets try to improve our SIMD implementation, as a goal, we should outperform the naive implementation for at least matrices of size 16x16 and smaller.@shuffle
Since we can not store the matrix as a flattened vector, what if we store each row and column as its own separate vector?
pub fn Matrix(comptime T: type, cols: comptime_int, rows: comptime_int) type {
fn vecArr(comptime T: type, arr_len: comptime_int, vec_len: comptime_int) [arr_len]@Vector(vec_len, T) {
comptime {
@setEvalBranchQuota(100000); // Needed to support larger matrices.
var arr: [arr_len]@Vector(vec_len, T) = undefined;
for (0..arr_len) |i| {
arr[i] = @splat(0);
}
return arr;
}
}
return struct {
rows: [rows]@Vector(cols, T) = vecArr(T, rows, cols),
cols: [cols]@Vector(rows, T) = vecArr(T, cols, rows),
// ...
};
}
One immediate drawback is that we have essentially doubled the space needed to store the matrix in memory, due to every matrix element being present once in one of the row vectors and once in one of the column vectors. However, since we are storing much smaller vectors (order of ), and we no longer need to do our shuffling magic to retrieve the column vectors and row vectors during multiplication, we should hopefully see a measurable difference in performance for larger matrices. Below is the new
and mul
implementation.dot
inline fn dot(comptime T: type, v1: anytype, v2: anytype) T {
return @reduce(.Add, v1 * v2);
}
pub fn mul(comptime T: type, m1: anytype, m2: anytype) Matrix(T, @TypeOf(m1)._rows(), @TypeOf(m2)._cols()) {
const t_m2 = @TypeOf(m2);
const t_m1 = @TypeOf(m1);
const m2_rows = t_m2._rows();
const m2_cols = t_m2._cols();
const m1_rows = t_m1._rows();
const m1_cols = t_m1._cols();
if (m1_cols != m2_rows) {
@compileError("m1 columns must equal m2 rows");
}
var res = Matrix(T, m1_rows, m2_cols){};
for (0..m1_rows) |i| {
for (0..m2_cols) |j| {
res.set(i, j, dot(T, m1.rows[i], m2.cols[j]));
}
}
return res;
}
Lets do another round of benchmarks and see what results we get
size | naive | simd |
---|---|---|
2x2 | 1 ms | 41 ns |
3x3 | 3 ms | 41 ns |
4x4 | 4 ms | 41 ns |
5x5 | 6 ms | 0 ns |
6x6 | 10 ms | 41 ns |
7x7 | 15 ms | 0 ns |
8x8 | 20 ms | 0 ns |
12x12 | 62 ms | 17 ms |
16x16 | 41 ms | 19 ms |
24x24 | 216 ms | 70 ms |
32x32 | 255 ms | 127 ms |
48x48 | 738 ms | 890 ms |
64x64 | 1651 ms | 874 ms |
96x96 | 5788 ms | 20691 ms |
128x128 | 12015 ms | 18866 ms |
Looks better, we get much better results for all matrices up to around 96x96. I might make a follow up post comparing this implementation to the one in popular libraries, such as numpy.