Masked gemm: How to write a matrix-matrix product A=B*c so that only some elements in A are calculated?

Question

I am trying to calculate the product of two matrices, say A=B×C. But I only care about some elements in A, not all. For example, A_ij needs to be calculated, if E_ij > 0. Is there any relevant c/c++ or python library to accomplish this task.

I can't calculate all the elements in A because I have to do it in a shorter time. Any help is appreciated.

What did you try?

Eigen: (E > 0).select(B*C), verly slow.

for loop: ret = B * C

#pragma omp parallel for
for (int i = 0; i < B.rows(); i++){
    vector<int> cal_ind;
    for (int j = 0; j < C.columns(); j++){
        if (E(i, j) > 0){
            cal_ind.push_back(j);
        }
    }
    if (cal_ind.size() == 0){
        continue;
    }
    VectorXd d_xi_c = B.row(i) * C(Eigen::all, cal_ind);
}

blaze: I modified the mmm function so that it only calculates some elements, but both the original mmm and the modified mmm are very slow.

what were you expecting?

For example, size(A) = (100, 100), there are 10,000 elements in A, if only 1,000 elements need to be calculated, what operation can be used to complete this task in a shorter time?

Is there really no structure to `E` such as it removing entire rows or columns? And 10% desired values is a realistic number? — Homer512, Jun 16 '23 at 14:03
Well you could note the fact that Aij is the inner product of row i of B with column j of C and maybe the library will allow you to do the inner product fast. Actually, due to the order of storage, it may be quicker to transpose C first. Then Aij is the inner product of row i of B with row j of (C transpose) and, with luck, whole rows may be in cache together. — lastchance, Jun 16 '23 at 14:26
`(E > 0).select(B*C)` can't possibly help, because in this case the entire product needs to be evaluated. You can try `(E > 0).select(B.lazyProduct(C), 0.0)` -- but this likely will only be beneficial, if `E` has really few entries. — chtz, Jun 16 '23 at 15:23
@Homer512 There is no structure to E. In other words, it can be considered that the positions of elements greater than 0 are random. — shenfei pei, Jun 18 '23 at 07:52
@lastchance Thanks, I know it, but I don't think it should be as fast as the above example where two for loop is involved and Eigen library is used. — shenfei pei, Jun 18 '23 at 07:57

score 3 · Answer 1 · answered Jun 16 '23 at 13:44

I don't think you can easily achieve what you want: when doing a full matrix multiplication, the main speedup of the current matrix multiplication algorithms comes from the fact that they do the multiplication block-wise and leverage the CPU's SIMD instructions as well as are optimized towards the CPU's various cache sizes. An optimized algorithm in a library is easily 10-100 times faster than doing it the naive way by 3 nested loops. However, those algorithms don't calculate individual elements one at a time, but calculate multiple elements all at once. (As I said: they operate block-wise.)

If in your case you only want to calculate a certain subset of elements, unless the elements are organized in a systematic way (e.g. only the left N columns, or top M rows, or something like that), then the only way to restrict the calculation to just those elements is to explicitly write all 3 nested loops. And that will be less efficient than using the optimized GEMM algorithms modern libraries provide.

The best you can probably do is probably roughly what you're doing in your OpenMP example, because there you're already using the product implementation of your library for the inner-most loop. (You could probably avoid some memory allocations for the index vector if you do some clever tricks, but other than that I don't see much optimization potential there.)

If there is some additional systematic towards the elements (e.g. nearly all cluster in a specific block of the matrix), then you could probably optimize that a bit more for your specific use case, using that information. (Assuming you know this behavior in advance and don't have to detect that during runtime.)

But to be honest, if the numbers you provide are realistic, and you are calculating 1000 elements out of 10'000, then I don't think you'll ever be faster than using the full GEMM implementation of a decent library to calculate all elements, even if most are not needed, just because of how much speedup the full algorithms provide compared to calculating the elements individually. You either need to only require a calculation of a lot less elements than that, or you need to have a very systematic way of determining which elements need to be calculated. (That doesn't depend on evaluating a condition for each element individually.)

Perhaps if you provide more information about the actual problem you want to solve here, there may be a clever way of achieving that - but when looking at just the matrix multiplication part, I doubt you'll be able to speed that up much, if at all.

Thank you for your reply. First of all, There is no structure to E. In other words, it can be considered that the positions of elements greater than 0 are random. In addition, the size provided is just an example, the realistic number is A=B×C, size(B)=10000, 256; size(C)=256, 10575. — shenfei pei, Jun 18 '23 at 08:04

Masked gemm: How to write a matrix-matrix product A=B*c so that only some elements in A are calculated?

1 Answers1