Sparse matrix-dense vector multiplication with matrix known at compile time

Question

I have a sparse matrix with only zeros and ones as entries (and, for example, with shape 32k x 64k and 0.01% non-zero entries and no patterns to exploit in terms of where the non-zero entries are). The matrix is known at compile time. I want to perform matrix-vector multiplication (modulo 2) with non-sparse vectors (not known at compile time) containing 50% ones and zeros. I want this to be efficient, in particular, I'm trying to make use of the fact that the matrix is known at compile time.

Storing the matrix in an efficient format (saving only the indices of the "ones") will always take a few Mbytes of memory and directly embedding the matrix into the executable seems like a good idea to me. My first idea was to just automatically generate the C++ code that just assigns all the result vector entries to the sum of the correct input entries. This looks like this:

constexpr std::size_t N = 64'000;
constexpr std::size_t M = 32'000;

template<typename Bit>
void multiply(const std::array<Bit, N> &in, std::array<Bit, M> &out) {
    out[0] = (in[11200] + in[21960] + in[29430] + in[36850] + in[44352] + in[49019] + in[52014] + in[54585] + in[57077] + in[59238] + in[60360] + in[61120] + in[61867] + in[62608] + in[63352] ) % 2;
    out[1] = (in[1] + in[11201] + in[21961] + in[29431] + in[36851] + in[44353] + in[49020] + in[52015] + in[54586] + in[57078] + in[59239] + in[60361] + in[61121] + in[61868] + in[62609] + in[63353] ) % 2;
    out[2] = (in[11202] + in[21962] + in[29432] + in[36852] + in[44354] + in[49021] + in[52016] + in[54587] + in[57079] + in[59240] + in[60362] + in[61122] + in[61869] + in[62610] + in[63354] ) % 2;
    out[3] = (in[56836] + in[11203] + in[21963] + in[29433] + in[36853] + in[44355] + in[49022] + in[52017] + in[54588] + in[57080] + in[59241] + in[60110] + in[61123] + in[61870] + in[62588] + in[63355] ) % 2;
    // LOTS more of this...
    out[31999] = (in[10208] + in[21245] + in[29208] + in[36797] + in[40359] + in[48193] + in[52009] + in[54545] + in[56941] + in[59093] + in[60255] + in[61025] + in[61779] + in[62309] + in[62616] + in[63858] ) % 2;
}

This does in fact work (takes ages to compile). However, it actually seems to be very slow (more than 10x slower than the same Sparse vector-matrix multiplication in Julia) and also to blow up the executable size significantly more than I would have thought necessary. I tried this with both std::array and std::vector, and with the individual entries (represented as Bit) being bool, std::uint8_t and int, to no progress worth mentioning. I also tried replacing the modulo and addition by XOR. In conclusion, this is a terrible idea. I'm not sure why though - is the sheer codesize slowing it down that much? Does this kind of code rule out compiler optimization?

I haven't tried any alternatives yet. The next idea I have is storing the indices as compile-time constant arrays (still giving me huge .cpp files) and looping over them. Initially, I expected doing this would lead the compiler optimization to generate the same binary as from my automatically generated C++ code. Do you think this is worth trying (I guess I will try anyway on monday)?

Another idea would be to try storing the input (and maybe also output?) vector as packed bits and perform the calculation like that. I would expect one can't get around a lot of bit-shifting or and-operations and this would end up being slower and worse overall.

Do you have any other ideas on how this might be done?

As a general rule of thumb if you have to ask how to beat a well implemented math kernel you won't beat a well implemented math kernel. Just use OpenBLAS or MKL. — CJR, Apr 25 '21 at 13:54
@CJR the reason why I (think that I) want to avoid using such libraries is that first, I was trying to make use of the fact that the matrix is known at compile time, and second, that I'm considering putting this on a microcontroller eventually, so I would prefer not to have to compile to ARM and link large libraries when I only use one function from them. The answer by Jérôme Richard seems to suggest that there is no way to make use of compile-time knowledge in this case. Would using a numeric library (I don't use floating point!) be viable on an embedded platform and possibly make it faster? — Adomas Baliuka, Apr 25 '21 at 15:05

score 3 · Accepted Answer · answered Apr 24 '21 at 23:30

I'm not sure why though - is the sheer codesize slowing it down that much?

The problem is that the executable is big, the the OS will fetch a lot of pages from your storage device. This process is very slow. The processor will often stall waiting for data to be loaded. And even the code would be already loaded in the RAM (OS caching), it would be inefficient because the speed of the RAM (latency + throughput) is quite bad. The main issue here is that all the instructions are executed only once. If you reuse the function, then the code need to be reloaded from the cache and if it is to big to fit in the cache, it will be loaded from the slow RAM. Thus, the overhead of loading the code is very high compared to its actual execution. To overcome this problem, you need to use a quite small code with loops iterating on a fairly small amount of data.

Does this kind of code rule out compiler optimization?

This is dependent of the compiler, but most mainstream compilers (eg. GCC or Clang) will optimize the code the same way (hence the slow compilation time).

Do you think this is worth trying (I guess I will try anyway on monday)?

Yes, this solution is clearly better, especially if the indices are stored in a compact way. In your case, you can store them using an uint16_t type. All the indices can be put in a big buffer. The starting/ending position of the indices for each line can be specified in another buffer referencing the first one (or using pointers). This buffer can be loaded once in memory in the beginning of your application from a dedicated file to reduce the size of the resulting program (and avoid fetches from the storage device in a critical loop). With a probability of 0.01% of having non-zero values, the resulting data structure will take less than 500 KiB of RAM. On an average mainstream desktop processor, it can fit in the L3 cache (that is rather quite fast) and I think that your computation should not take more than 1ms assuming the code of multiply is carefully optimized.

Another idea would be to try storing the input (and maybe also output?) vector as packed bits and perform the calculation like that.

Bit-packing is good only if your matrix is not too sparse. With a matrix filled with 50% of non-zero values, the bit-packing method is great. With 0.01% of non-zero values, the bit-packing method is clearly bad as it takes too much space.

I would expect one can't get around a lot of bit-shifting or and-operations and this would end up being slower and worse overall.

As previously said, loading data from the storage device or the RAM is very slow. Doing some bit-shifts is very fast on any modern mainstream processor (and much much faster than loading data).

Here is the approximate timings for various operations that a computer can do:

Thanks for the answer! I will try looping over the sparse storage. However, your answer seems to suggest that there is no way to make use of the fact that the matrix is known at compile time. Somehow, this seems counterintuitive to me. Also, surely there must be a way of storing the matrix within the executable anyway without a performance penalty? — Adomas Baliuka, Apr 25 '21 at 13:31
I agree that it sounds counter-intuitive, but the matrix is really big and the access pattern seems random preventing many advanced optimizations at compile time. Your computation can be optimized on the hardware-side itself but AFAIK not much in software. Modern processors can make very advanced optimizations removing most of the overhead of loops and sometimes even dynamic operations (like moving data between registers). However, a actual huge problem on modern computers is *memory* and more specifically data transfers. They are much more costly than computation so you should avoid them. — Jérôme Richard, Apr 25 '21 at 15:35
You can store it in the executable without a performance penalty, but to do so, you need to load the matrix data before so that the OS can fetch is from the storage device (OS are lazy: they move data only when it is needed). As long as the matrix is loaded in RAM and clearly fit in the cache, this is fine. — Jérôme Richard, Apr 25 '21 at 15:39
"you need to load the matrix data before" I don't understand what this means. Load it before what? Load it how? My current plan is to try saving the "compressed sparse column" representation (except the value-array, since all values are 1) as two constexpr arrays, where I again automatically generate a large `.cpp` file defining them. Would you do it differently? — Adomas Baliuka, Apr 25 '21 at 15:43
Your executable is not fully loaded in RAM by your OS when you run it. To force the matrix to be loaded in RAM, you need to read it in your program (eg in the main before anything for example). If `multiply` is called many times or does not need to have a low latency, it may not be a problem though. Your solution should work, but it put a lot of pressure on the compiler. Some compilers can crash or raise errors when you define huge constant arrays in cpp files. Loading data from a binary (compressed) file at startup time (in the beginning of the main) should be safer and probably as fast. — Jérôme Richard, Apr 25 '21 at 15:54
@JérômeRichard That image about the latency is pretty nice — dreamcrash, Apr 25 '21 at 19:44
@dreamcrash Yeah (but unfortunately not quite up to date ^^'') — Jérôme Richard, Apr 26 '21 at 23:40

Adomas Baliuka · Answer 2 · 2021-04-25T19:51:52.230

I implemented the second method (constexpr arrays storing the matrix in compressed column storage format) and it is a lot better. It takes (for a 64'000 x 22'000 binary matrix containing 35'000 ones) <1min to compile with -O3 and performs one multiplication in <300 microseconds on my laptop (Julia takes around 350 microseconds for the same calculation). The total executable size is ~1 Mbyte.

Probably one can still do a lot better. If anyone has an idea, let me know!

Below is a code example (showing a 5x10 matrix) illustrating what I did.

#include <iostream>
#include <array>

// Compressed sparse column storage for binary matrix
constexpr std::size_t M = 5;
constexpr std::size_t N = 10;
constexpr std::size_t num_nz = 5;
constexpr std::array<std::uint16_t, N + 1> colptr = {
0x0,0x1,0x2,0x3,0x4,0x5,0x5,0x5,0x5,0x5,0x5
};
constexpr std::array<std::uint16_t, num_nz> row_idx = {
0x0,0x1,0x2,0x3,0x4
};

template<typename Bit>
constexpr void encode(const std::array<Bit, N>& in, std::array<Bit, M>& out) {

    for (std::size_t col = 0; col < N; col++) {
        for (std::size_t j = colptr[col]; j < colptr[col + 1]; j++) {
            out[row_idx[j]] = (static_cast<bool>(out[row_idx[j]]) != static_cast<bool>(in[col]));
        }
    }
}

int main() {
    using Bit = bool;
    std::array<Bit, N> input{1, 0, 1, 0, 1, 1, 0, 1, 0, 1};
    std::array<Bit, M> output{};
    
    for (auto i : input) std::cout << i;
    std::cout << std::endl;

    encode(input, output);

    for (auto i : output) std::cout << i;
}

Sparse matrix-dense vector multiplication with matrix known at compile time

2 Answers2