[Meta: This question has been weighing on me due to its being unanswered for so long. I've personally shied away from it due to its use of the phrase "most computationally efficient." In practice, its hard to guarantee that any solution meets that description given variations that may occur from one target machine or dataset to the next. But since nobody else has stepped up, I'll make some comments in hopes that they might be of use.]
A few things that stand out for me in your code:
1) Unless I'm missing something, you are computing norm(V[r, ..])
redundantly many, many times during the computation. Asymptotically speaking, this suggests that you're doing quadratic work where only linear work is required. I'd suggest computing the norm for each row once and storing it in an array to avoid this redundant computation:
var normVrow: [docs] real = [r in docs] norm(V[r,..]);
Then, within the inner loop, you can just refer to normVrow[i]
or normVrow[j]
.
2) Since this is Chapel and your loop seems to have no cross-loop dependencies, rather than using serial for
loops, you should probably use a parallel forall
loop for this computation. There's a question about whether to:
(a) change your outer loop to a forall
(which would result in a load imbalance since the overall iteration space is triangular),
(b) change both loops to forall
loops (which would help the load imbalance problem by over-decomposing, but likely also increase overhead), or
(c) make the outer loop into a dynamically scheduled loop to address the load imbalance.
My instinct would be to do option c using Chapel's dynamic iterator:
use DynamicIters;
forall i in dynamic(ndocs) {
...
}
3) A final thing to consider would be to avoid the triangular iteration space and just redundantly compute X[i,j]
and X[j,i]
even though they'll have the same value. This may not make sense on a shared-memory run, but if you were computing on a distributed array X
, you'd likely reduce communication since those matrix values will be stored by different processors. In this approach, you could iterate with a single forall
loop over X.domain
, and the result would be well-load-balanced by default without the need for the dynamic iterator.