0

I have C code for generic-sub-matrix-vector (gesubmv) multiplication that computes y(rinds) = A(rinds, cinds)*x(cinds) + y(rinds), where rinds and cinds are vectors of row and column (not necessarily sorted) indices, respectively. For genericity, rinds = rindsA_data(rindsA_1:rindsA_m) and cindsA_data(cindsA_1:cindsA_n).

My code is the following:

void gesubmv(const double A_data[], const int A_size[2],
             const int rindsA_data[], int rindsA_1, int rindsA_m,
             const int cindsA_data[], int cindsA_1, int cindsA_n,
             const double x_data[], double y_data[]) {
    int i;
    int j;
    for (i = rindsA_1; i <= rindsA_m; i++) {
        int k;
        double tmp_sum;
        tmp_sum = 0.0F;
        for (j = cindsA_1; j <= cindsA_n; j++) {
            k = cindsA_data[j - 1] - 1;
            tmp_sum +=
                x_data[k] * A_data[k + (A_size[1] * (rindsA_data[i - 1] - 1))];
        }
        k = rindsA_data[i - 1] - 1;
        y_data[k] += tmp_sum;
    }
}

The issue is that it seems to be more efficient to compute y(:) = A(:, cinds)*x(cinds) + y(:) and then pick y(rinds). I suspect this might be related to how instructions are performed (https://godbolt.org/). Or I could do something (more obvious) wrong. Any ideas?

Code for the alternative y(:) = A(:, cinds)*x(cinds) + y(:) is the following.

void gesubmv2(const double A_data[], const int A_size[2],
             const int cindsA_data[], int cindsA_1, int cindsA_n,
             const double x_data[], double y_data[]) {
    int i;
    int j;
    for (i = 0; i < A_size[1]; i++) {
        int k;
        double tmp_sum;
        tmp_sum = 0.0F;
        for (j = cindsA_1; j <= cindsA_n; j++) {
            k = cindsA_data[j - 1] - 1;
            tmp_sum += x_data[k] * A_data[k + (A_size[1] * i)];
        }
        y_data[i] += tmp_sum;
    }
}
  • 1
    If this code works, but you're looking for optimizations, probably a better candidate for [code review exchange](https://codereview.stackexchange.com/) – yano May 18 '23 at 15:56
  • Before posting on Code Review please read [A guide to Code Review for Stack Overflow users](https://codereview.meta.stackexchange.com/questions/5777/a-guide-to-code-review-for-stack-overflow-users/5778#5778) and [How do I ask a good question?](https://codereview.stackexchange.com/help/how-to-ask) – pacmaninbw May 18 '23 at 16:35
  • The performance of this code can be greatly affected by the sizes of the arrays, the relationship of `A_size[1]` to the cache geometry, the ordering of values in `rindsA_data` and `cindsA_data`, the locations of the arrays relative to each other, and features of the target platform. Without that information, it is hard to say how to optimize the code. – Eric Postpischil May 18 '23 at 16:36

1 Answers1

0

You can take the (A_size[1] * (rindsA_data[i - 1] - 1)) expression out of the j loop. The optimiser might be doing this already so you might not see much if any performance difference. Likewise, in the second snippet, (A_size[1] * i) doesn't need to be inside the j loop. If your optimiser isn't doing as I suggested, this could be why you're seeing the performance difference.

Also, you increment j with the for statement, then use j-1 as an index, so you can save one operation by putting the k assignmet once before the loop and once after you've calculated tmp_sum. e.g.

    kk = (A_size[1] * (rindsA_data[i - 1] - 1));
    k = cindsA_data[cindsA_1 - 1] - 1;
    for (j = cindsA_1; j < cindsA_n; j++) {
        tmp_sum +=
            x_data[k] * A_data[k + kk];
        k = cindsA_data[j] - 1;
    }
    tmp_sum += x_data[k] * A_data[k + kk];
Simon Goater
  • 759
  • 1
  • 1
  • 7