I have C code for generic-sub-matrix-vector (gesubmv) multiplication that computes y(rinds) = A(rinds, cinds)*x(cinds) + y(rinds), where rinds and cinds are vectors of row and column (not necessarily sorted) indices, respectively. For genericity, rinds = rindsA_data(rindsA_1:rindsA_m) and cindsA_data(cindsA_1:cindsA_n).
My code is the following:
void gesubmv(const double A_data[], const int A_size[2],
const int rindsA_data[], int rindsA_1, int rindsA_m,
const int cindsA_data[], int cindsA_1, int cindsA_n,
const double x_data[], double y_data[]) {
int i;
int j;
for (i = rindsA_1; i <= rindsA_m; i++) {
int k;
double tmp_sum;
tmp_sum = 0.0F;
for (j = cindsA_1; j <= cindsA_n; j++) {
k = cindsA_data[j - 1] - 1;
tmp_sum +=
x_data[k] * A_data[k + (A_size[1] * (rindsA_data[i - 1] - 1))];
}
k = rindsA_data[i - 1] - 1;
y_data[k] += tmp_sum;
}
}
The issue is that it seems to be more efficient to compute y(:) = A(:, cinds)*x(cinds) + y(:) and then pick y(rinds). I suspect this might be related to how instructions are performed (https://godbolt.org/). Or I could do something (more obvious) wrong. Any ideas?
Code for the alternative y(:) = A(:, cinds)*x(cinds) + y(:) is the following.
void gesubmv2(const double A_data[], const int A_size[2],
const int cindsA_data[], int cindsA_1, int cindsA_n,
const double x_data[], double y_data[]) {
int i;
int j;
for (i = 0; i < A_size[1]; i++) {
int k;
double tmp_sum;
tmp_sum = 0.0F;
for (j = cindsA_1; j <= cindsA_n; j++) {
k = cindsA_data[j - 1] - 1;
tmp_sum += x_data[k] * A_data[k + (A_size[1] * i)];
}
y_data[i] += tmp_sum;
}
}