Is it possible to optimize these fortran loops?

Question

Here is my problem: I have a fortran code with a certain amount of nested loops and first I wanted to know if it's possible to optimize (rearranging) them in order to get a time gain? Second I wonder if I could use OpenMP to optimize them?

I have seen a lot of posts about nested do loops in fortran and how to optimize them but I didn't find one example that is suited to mine. I have also searched about OpenMP for nested do loops in fortran but I'm level 0 in OpenMP and it's difficult for me to know how to use it in my case.

Here are two very similar examples of loops that I have, first:

do p=1,N
  do q=1,N

    do ab=1,nVV

      cd = 0
      do c=nO+1,N
        do d=c+1,N

          cd = cd + 1
          A(p,q,ab) = A(p,q,ab) + (B(p,q,c,d) - B(p,q,d,c))*C(cd,ab)

        end do
      end do
      
      kl = 0
        do k=1,nO
          do l=k+1,nO

            kl = kl + 1
            A(p,q,ab) = A(p,q,ab) + (B(p,q,k,l) - B(p,q,l,k))*D(kl,ab)

            end do
          end do

        end do

       do ij=1,nOO
         
         cd = 0
         do c=nO+1,N
           do d=c+1,N

             cd = cd + 1
              E(p,q,ij) = E(p,q,ij) + (B(p,q,c,d) - B(p,q,d,c))*F(cd,ij)

            end do
          end do


          kl = 0
          do k=1,nO
            do l=k+1,nO

              kl = kl + 1
              E(p,q,ij) = E(p,q,ij) + (B(p,q,k,l) - B(p,q,l,k))*G(kl,ij)

            end do
          end do

        end do

      end do
    end do

and the other one is:

    do p=1,N
      do q=1,N

        do ab=1,nVV

          cd = 0
          do c=nO+1,N
            do d=nO+1,N

              cd = cd + 1
              A(p,q,ab) = A(p,q,ab) + B(p,q,c,d)*C(cd,ab)

            end do
          end do

          kl = 0
          do k=1,nO
            do l=1,nO

              kl = kl + 1
              A(p,q,ab) = A(p,q,ab) + B(p,q,k,l)*D(kl,ab)

            end do
          end do

        end do

        do ij=1,nOO

          cd = 0
          do c=nO+1,N
            do d=nO+1,N

              cd = cd + 1
              E(p,q,ij) = E(p,q,ij) + B(p,q,c,d)*F(cd,ij)

            end do
          end do

          kl = 0
          do k=1,nO
            do l=1,nO

              kl = kl + 1
              E(p,q,ij) = E(p,q,ij) + B(p,q,k,l)*G(kl,ij)

            end do
          end do

        end do

      end do
    end do

The very small difference between the two examples is mainly in the indices of the loops. I don't know if you need more info about the different integers in the loops but you have in general: nO < nOO < N < nVV. So I don't know if it's possible to optimize these loops and/or possibly put them in a way that will facilitate the use of OpenMP (I don't know yet if I will use OpenMP, it will depend on how much I can gain by optimizing the loops without it).

I already tried to rearrange the loops in different ways without any success (no time gain) and I also tried a little bit of OpenMP but I don't know much about it, so again no success.

We need more context on what the various entities you use are. For example, we haven't the slightest idea what effect `A(p,q,ab) = A(p,q,ab) + ...` has: is `A` an array or function; an intrinsic or derived type; is `=` intrinsic or defined assignment; is `+` intrinsic or defined? These aspects can completely override any assumptions, however obvious a guess may seem. — francescalus, Jan 17 '23 at 10:35
How big are the constants like `N`, `nVV`, `nO` in practice? How did you compile your program (compiler and flags)? Did you enabled optimizations? What kind of machine do you use to run the program? — Jérôme Richard, Jan 17 '23 at 10:43
@francescalus The two examples that I showed are part of a subroutine where all the quantities A, B,... are arrays. This subroutine just build the different arrays and everything is double precision. Then for your question about intrinsic or defined I'm not sure to understand it... — User3000, Jan 17 '23 at 10:44
@JérômeRichard In practice ```nO``` is small compared to ```N``` which is itself small compared to ```nVV```. To give some values, ```nO``` is between 2 and 10 while ```N``` is between 100 and 200, ```nVV``` is between 180 and 400. — User3000, Jan 17 '23 at 10:51
*This subroutine just build the different arrays* which kind of suggests (a) that it is a one-off operation and (b) may not be worth the effort required to optimise it significantly. Have you profiled your whole code and determined that this is a bottleneck? (Oh, and as always when asked about software the question *is it possible* has the answer *yes*.) — High Performance Mark, Jan 17 '23 at 10:59
@HighPerformanceMark I checked many parts of the code and yes this part is one of the bottleneck (the other bottlenecks are diagonalisations of matrices). — User3000, Jan 17 '23 at 11:05
Ok so the array are huge (several GiB) and the code is clearly memory-bound. This means OpenMP should not really help if you are running this on a PC but it could significantly help on a server though it will certainly not scale. The main memory is slow (compared to cache and the computational speed of CPU cores) nowadays so reading/writing huge arrays is pretty slow. — Jérôme Richard, Jan 17 '23 at 11:16
I notice that the access patterns to the B array is far from optimal, as the inner loops correspond to the last indeces of B, which can result in many cache misses depending on the size of B. It's difficult to know up to which point it can slow down your code, but changing the loop order if possible is a way to go. — PierU, Jan 17 '23 at 11:18
How *exactly* are all the arrays dimensioned? It could just be a series of calls to dgemv. A working example would really, *really* help. — Ian Bush, Jan 17 '23 at 11:48
Thanks @PierU but I'm not sure how to change the loop order for the B array... — User3000, Jan 17 '23 at 12:27
@IanBush So for the dimensions of the arrays I have: A(N,N,nVV), B(N,N,N,N), C(nVV,nVV), D(nO,nVV), E(N,N,nOO), F(nVV,nOO), G(nOO,nOO) — User3000, Jan 17 '23 at 12:30
I concur to Ian. It seems to me you are just doing some form of matrix multiplication. Is that correct? There are dedicated high-performance subroutines for that. — Vladimir F Героям слава, Jan 17 '23 at 12:31
I will post an answer to illustrate, as it's difficult to post code in comments. Alternatively, you may change the way the arrays are filled, if you have the choice + if it makes sense + if it doesn't hurt other parts of the code. — PierU, Jan 17 '23 at 12:33
@VladimirFГероямслава It was one thing that I tried to do but it is not straightforward to rewrite everything in terms of matrix multiplication (at least for me) so I didn't manage to do it... But you are probably right it might be possible to do it. Actually the formula do not imply any matrix multiplication due to the different dimensions of the arrays. Maybe it's possible to rewrite the 4 dimensional B array in a 2 dimensional array and do the matrix multiplication... — User3000, Jan 17 '23 at 12:41
@IanBush I'll try to give you a working example but maybe with small arrays because in practice they are huge and I don't know how to give them to you. — User3000, Jan 17 '23 at 12:44
If they don't fit into the RAM then any code optimization is (almost) pointless. The performance hit because of the usage of the swap file (even if it is on SSD) is way bigger than any gain you can get by rearranging the code... How much RAM do you have ? — PierU, Jan 17 '23 at 12:53
@User3000 What is nOO in your matrix sizes? N0*N0? And from the dimensions you give us it should probably B is the biggest array you have so if N is 200 you will require about 12GByte of memory. And if you don't I agree with PierU - use of swap will dominate anything else — Ian Bush, Jan 17 '23 at 12:59
@PierU Maybe there are cases where the arrays can fit in the RAM (depending on the values of N, nVV,...) — User3000, Jan 17 '23 at 13:00
I made some tests and in some cases I have a nVV=2500 or more so I guess it doesn't fit in memory... — User3000, Jan 17 '23 at 13:25
N is the crucial dimension - 200 * 200 * 200 * 200 > 2500 * 2500. That fourth order scaling in the size of B is a potential killer - unless you bite the bullet and go distributed memory parallel and use a cluster. — Ian Bush, Jan 17 '23 at 13:33
Assuming B is a default REAL array its size is (4*N^4/2^30)GiB. If you say that B should not be larger than 10GiB to leave some room to the other parts of your program, the other applications, the OS, this means N<228. — PierU, Jan 17 '23 at 13:54
@IanBush `8*200**4/1024**3 ~= 12` GiB for the biggest array. While this is big, modern computing machines typically have significantly more RAM than this. AFAIK, most HPC computing nodes have at least 64 GiB nowadays. Having an access to a machine with 128~256 GiB RAM is pretty easy. Cloud machines also typically reach 256~512 GiB of RAM and anyone (paying) can use them relatively easily. Such machine have a much bigger RAM throughput. Distributed memory is not required though this may be a good way to improve performance using an efficient partitioning (at the expense of a more complex code). — Jérôme Richard, Jan 17 '23 at 14:04
@JérômeRichard I am perfectly aware of this. However the OP may be developing on his/her laptop. He/she may not have access to HPC nodes, many people do not or simply do not want the hassle of using a remote machine. Personally I suspect this is an XY problem and the user doesn't really need all this memory, but again we can't tell from what is provided. — Ian Bush, Jan 17 '23 at 14:16
@JérômeRichard and IanBush thank you for all your comments! Actually I do have access to HPC nodes, I'm a PhD student in a quantum chemistry and physics lab but I do more theory than coding (this explain my post and my lack of knowledge). I identified this part as one the bottleneck of the code and I just wanted to know if I could write it in a better way to go a little bit faster without putting to much effort and time using OpenMP/MPI (principally because I know nothing about those tools). So again thanks for your comments! And if I have time I will try to use those tools in our HPC nodes. — User3000, Jan 17 '23 at 15:20
@PierU Thanks a lot for your answer! I did try (without OpenMP) your solution for a small case and it works faster that what I had! I will try with OpenMP! — User3000, Jan 17 '23 at 16:01
There has been no mention of sparsity. Is this a feature of the equations that could be utilised ? I agree with @Pieru; either place DO p,q as an inner loop or change the subscript order for arrays A,B,E. Essentially, by making the inner-most DO the first subscript does suggest vectorising gains can be achieved. an inner DO q p can achieve this and mitigate the problems of referencing B. — johncampbell, Jan 19 '23 at 02:39
From what you've described, the size `nVV` seems to be too small. The index `cd` will be incremented `(N-n0)**2` times. For N = 100, and n0 = 10, this would mean `cd` reaches the value 90^2 = 8100 which exceeds the array bound `C(nVV)`. — IPribec, Jan 28 '23 at 00:41
@IPribec rather about half that value (still, nVV looks too small indeed) — PierU, Jan 28 '23 at 07:56
It's about half that value in the first code block, where a triangular section is addressed. The OP should clarify in more detail. — IPribec, Jan 28 '23 at 15:20
@IPribec I was just trying to give an idea of the different values. But the biggest case that I looked at is N = 210, nOO = 8, nVV = 40 804. — User3000, Jan 30 '23 at 17:49
You said earlier that `n00` is `n0*n0`. 8 is not a perfect square. You probably mean n0; this would then give (210 - 8)**2 = 40804, as I expected. — IPribec, Jan 30 '23 at 21:34

PierU · Answer 1 · 2023-01-17T15:38:15.477

From the initial comments it may appear that at least in some cases you may be using more memory than the available RAM, which means you may be using the swap file, with all the bad consequences on the performances. To fix this, you have to either install more RAM if possible, or deeply reorganize your code to not store the full B array (by far the largest one) at once (again, if possible).

Now, let's assume that you have enough RAM. As I wrote in the comments, the access pattern on the B array is far from optimal, as the inner loops correspond to the last indeces of B, which can result in many cache misses (all the more given the the size of B). Changing the loop order if possible is a way to go.

Just looking at your first example, I am focusing on the computation of the array A (the computation of the array E looks completely independent of A, so it can be processed separately):

!! test it at first without OpenMP
!!$OMP PARALLEL DO PRIVATE(cd,c,d,kl,k,l)
do ab=1,nVV
   cd = 0
   do c=nO+1,N
      do d=c+1,N
         cd = cd + 1         
         A(:,:,ab) = A(:,:,ab) + (B(:,:,c,d) - B(:,:,d,c))*C(cd,ab)
      end do
   end do
   kl = 0
   do k=1,nO
      do l=k+1,nO
         kl = kl + 1
         A(:,:,ab) = A(:,:,ab) + (B(:,:,k,l) - B(:,:,l,k))*D(kl,ab)
      end do
   end do
end do  
!!$OMP END PARALLEL DO

What I did:

moved the loops on p and q from outer to inner positions (it's not always as easy than it is here)
replaced them with array syntax (no performance gain to expect, just a code easier to read)

Now the inner loops (abstracted by the array syntax) tackle contiguous elements in memory, which is much better for the performances. The code is even ready for OpenMP multithreading on the (now) outer loop.

EDIT/Hint

Fortran stores the arrays in "column-major order", that is when incrementing the first index one accesses contiguous elements in memory. In C the arrays are stored in "row-major order", that is when incrementing the last index one accesses contiguous elements in memory. So a general rule is to have the inner loops on the first indeces (and the opposite in C).

Does `C` have multidimensional arrays? I thought no, so there is no column-major order vs. row-major order in a 1D array. Anyway to point stands that things nearest in the first index are also nearest in memory in Fortran. — John Alexiou, Jan 17 '23 at 19:32
@JohnAlexiou Yes, C has multidimentional arrays, with some limitations. — PierU, Jan 17 '23 at 20:29

IPribec · Answer 2 · 2023-01-28T15:40:27.443

It would be helpful if you could describe the operations you'd like to perform using tensor notation and the Einstein summation rule. I have the feeling the code could be written much more succinctly using something like np.einsum in NumPy.

For the second block of loop nests (the ones where you iterate across a square sub-section of B as opposed to a triangle) you could try to introduce some sub-programs or primitives from which the full solution is built.

Working from the bottom up, you start with a simple sum of two matrices.

!
! a_ij := a_ij + beta * b_ij 
!
pure subroutine apb(A,B,beta)
   real(dp), intent(inout) :: A(:,:)
   real(dp), intent(in) :: B(:,:)
   real(dp), intent(in) :: beta
   A = A + beta*B
end subroutine

(for first code block in the original post, you would substitute this primitive with one that only updates the upper/lower triangle of the matrix)

One step higher is a tensor contraction

!
! a_ij := a_ij + b_ijkl c_kl
!
pure subroutine reduce_b(A,B,C)
   real(dp), intent(inout) :: A(:,:)
   real(dp), intent(in) :: B(:,:,:,:)
   real(dp), intent(in) :: C(:,:)

   integer :: k, l

   do l = 1, size(B,4)
      do k = 1, size(B,3)

         call apb( A, B(:,:,k,l), C(k,l) )
   
      end do
   end do

end subroutine

Note the dimensions of C must match the last two dimensions of B. (In the original loop nest above, the storage order of C is swapped (i.e. c_lk instead of c_kl.)

Working our way upward, we have the contractions with two different sub-blocks of B, moreover A, C, and D have an additional outer dimension:

!
! A_n := A_n  +  B1_cd C_cdn +  B2_kl D_kln
!
!   The elements of  A_n    are  a_ijn
!   The elements of  B1_cd  are  B1_ijcd
!   The elements of  B2_kl  are  B2_ijkl
!
subroutine abcd(A,B1,C,B2,D)
   real(dp), intent(inout), contiguous :: A(:,:,:)
   real(dp), intent(in) :: B1(:,:,:,:)
   real(dp), intent(in) :: B2(:,:,:,:)
   real(dp), intent(in), contiguous, target :: C(:,:), D(:,:)

   real(dp), pointer :: p_C(:,:,:) => null()
   real(dp), pointer :: p_D(:,:,:) => null()

   integer :: k
   integer :: nc, nd

   nc = size(B1,3)*size(B1,4)
   nd = size(B2,3)*size(B2,4)

   if (nc /= size(C,1)) then
      error stop "FATAL ERROR: Dimension mismatch between B1 and C"
   end if

   if (nd /= size(D,1)) then
      error stop "FATAL ERROR: Dimension mismatch between B2 and D"
   end if

   ! Pointer remapping of arrays C and D to rank-3
   p_C(1:size(B1,3),1:size(B1,4),1:size(C,2)) => C
   p_D(1:size(B2,3),1:size(B2,4),1:size(D,2)) => D

   !$omp parallel do default(private) shared(A,B1,p_C,B2,p_D)
   do k = 1, size(A,3)

      call reduce_b( A(:,:,k), B1, p_C(:,:,k))
      call reduce_b( A(:,:,k), B2, p_D(:,:,k))

   end do
   !$omp end parallel do

end subroutine

Finally, we reach the main level where we select the subblocks of B

program doit

   use transform, only: abcd, dp

   implicit none

   ! n0  [2,10]
   !
   integer, parameter :: n0  = 6
   integer, parameter :: n00 = n0*n0

   integer, parameter :: N, nVV

   real(dp), allocatable :: A(:,:,:), B(:,:,:,:), C(:,:), D(:,:)

   ! N   [100,200]
   !
   read(*,*) N
   nVV = (N - n0)**2

   allocate(A(N,N,nVV))
   allocate(B(N,N,N,N))
   allocate(C(nVV,nVV))
   allocate(D(n00,nVV))

   print *, "Memory occupied (MB): ", & 
      real(sizeof(A) + sizeof(B) + sizeof(C) + sizeof(D),dp) / 1024._dp**2

   A = 0
   call random_number(B)
   call random_number(C)
   call random_number(D)

   call abcd(A=A, &
             B1=B(:,:,n0+1:N,n0+1:N), &
             B2=B(:,:,1:n0,1:n0), &
             C=C, &
             D=D)

   deallocate(A,B,C,D)

end program

Similar to the answer by PierU, parallelization is on the outermost loop. On my PC, for N = 50, this re-engineered routine is about 8 times faster when executed serially. With OpenMP on 4 threads the factor is 20. For N = 100 and I got tired of waiting for the original code; the re-engineered version on 4 threads took about 3 minutes.

The full code I used for testing, configurable via environment variables (ORIG=<0|1> N=100 ./abcd), is available here: https://gist.github.com/ivan-pi/385b3ae241e517381eb5cf84f119423d

With more fine-tuning it should be possible to bring the numbers down even further. Even better performance could be sought with a specialized library like cuTENSOR (also used under the hood of Fortran intrinsics as explained in Bringing Tensor Cores to Standard Fortran or a tool like the Tensor Contraction Engine.

One last thing I found odd was that large parts of B are un-unused. The sub sections B(:,:,1:n0,n0+1:N) and B(:,:,n0+1:N,1:n0) appear to be wasted space.

Thanks a lot for your answer ! As you said I imagine that it would be easier with Python but I would prefer to stick with fortran because the example that I showed is just a small part of a much bigger fortran code. I have to admit that I am not (at all) a fortran expert so there are few things that I am not sure to understand like use transform, target, contiguous (I guess it's to have access to contiguous memory) but I'll try your answer ! Actually for B it's because in other places in the code we use the other parts of B. — User3000, Jan 30 '23 at 18:01
@User3000 I guess Ivan forgot to define the module named `transform`that contains the routine `abcd()`, along with a real kind `integer, parameter :: dp = kind(1d0)`. `target` is an attribute that specifies that the variable can be the target of a pointer, and indeed `contiguous` is an attribute that specifies that the array shall be contiguous in memory. — PierU, Jan 30 '23 at 21:57
I didn't include the full module to keep the answer shorter. The full code along with some instructions is available at the link: https://gist.github.com/ivan-pi/385b3ae241e517381eb5cf84f119423d. — IPribec, Jan 30 '23 at 22:17

Is it possible to optimize these fortran loops?

2 Answers2

EDIT/Hint