1

I’m aware of Fortran’s column major ordering and to ensure the inner-most loop is indexing the first dimension of the array. What I’m unsure about is which of the following would be faster:

do k=1,100
  do j=1, 12
      x(:,k) * y(j,:)
  enddo
enddo

Or:

do k=1,100
  do j=1, 12
     x(:,k) * y_transpose(:,j)
  enddo
enddo

Where x is (k x k) and y is (j x k).

I assume the second option would be faster as the first dimension of y_transpose would be contiguous in memory and faster to access?

Appreciate the help.

cjp1032
  • 23
  • 3
  • They aren't actually Fortran statements (they appear to be RHS only), so it's hard to tell. You might be surprised what compiler optimisers are capable of doing, so I would time it and see. Your code doesn't give enough information about what you are ultimately trying to do, but I suspect that standard inner_product(), or even matmul(), might help you. – lastchance Jul 26 '23 at 14:08
  • 1
    I apologise for the pseudo code, I wanted to make the example simple. Basically it’s doing element wise multiplication which is then summed and assigned to an array index: `z(j,k)=sum(x(:,k) * y(j,:))`. Timing is showing an approx 10% speed up with transposing. – cjp1032 Jul 26 '23 at 14:31
  • 1
    Yes, as lastchance says, context is vitally important so "simple" can really hide what's important. For example, at the most extreme, your use of that expression could be something like `n=size(x(:,k)*y_transpose(:,j))` and your compiler (along with the rest of us) will just laugh at your worrying about memory access patterns. Further, there's no basic answer: more advanced implementations will think about cache blocking and all manner of other access optimizations and these again depend on specifics of use and target architecture. – francescalus Jul 26 '23 at 14:51
  • Right, so for example even if k and j were an order of magnitude higher in dimension the structure of `y` vs `y_transpose` might not always produce any changes in speed? It sounds like my thinking is too simplistic, I was hoping with all other variables constant the best memory access pattern would be obvious! – cjp1032 Jul 27 '23 at 00:35
  • Instead of analyzing, why not just time it. With today's modern optimizing compilers, there may be very little or not difference or it may be significantly different. Even worse, it may be significantly different from what you analyzed - then you will need to re-analyze. – cup Aug 02 '23 at 05:06

0 Answers0