How to properly choose which variables should be private/shared/reduced in nested do-loops in Fortran with openmp?

Question

I have the following subroutine, which is my stepper function in a bigger DEM program. It computes every interaction between particles i and j, then updates the forces. I am now trying, as a first effort, to parallelize this thick O(N^2) loop, before I try to use fancier search algorithms.

But, I can't find a way to make it work. I know it is due to variables being modified by different threads, but I don't know how to deal with that properly. One possibility would be to redefine all my arrays to be matrices, but I still don't know if this is the correct way.

Also, note that I use include for my variables since I have a lot of variables that need to be accessed and modified by many subroutines, and it felt like the easy way to make that work, I am open to suggestions.

Here is my subroutine:

subroutine stepper (tstep)

    use omp_lib

    implicit none

    include "parameter.h"
    include "CB_variables.h"
    include "CB_const.h"
    include "CB_bond.h"
    include "CB_forcings.h"

    integer :: i, j
    integer, intent(in) :: tstep

    ! reinitialize force arrays for contact and bonds
    do i = 1, n
        mc(i)  = 0d0
        mb(i)  = 0d0
        fcx(i) = 0d0
        fcy(i) = 0d0
        fbx(i) = 0d0
        fby(i) = 0d0
    end do
    
    ! put yourself in the referential of the ith particle
    ! loop through all j particles and compute interactions

    !$omp parallel do schedule(dynamic) &
    !$omp private(i,j) &
    !$omp reduction(+:tfx,tfy,fcx,fcy,fbx,fby,m,mc,mb)
    do i = 1, n
        do j = i + 1, n

            ! compute relative position and velocity
            call rel_pos_vel (i, j)

            ! bond initialization
            if ( tstep .eq. 1 ) then
                if ( -deltan(i, j) .le. 5d-1 * r(i)) then ! can be fancier
                    !bond (i, j) = 1
                end if
                call bond_properties (i, j)
            end if

            ! verify if two particles are colliding
            if ( deltan(i,j) .gt. 0 ) then

                call contact_forces (i, j)
                !call bond_creation (i, j) ! to implement

                ! update contact force on particle i by particle j
                fcx(i) = fcx(i) - fcn(i,j) * cosa(i,j)
                fcy(i) = fcy(i) - fcn(i,j) * sina(i,j)

                ! update moment on particule i by particule j due to tangent contact 
                mc(i) = mc(i) - r(i) * fct(i,j) - mcc(i,j)

                ! Newton's third law
                ! update contact force on particle j by particle i
                fcx(j) = fcx(j) + fcn(i,j) * cosa(i,j)
                fcy(j) = fcy(j) + fcn(i,j) * sina(i,j)

                ! update moment on particule j by particule i due to tangent contact 
                mc(j) = mc(j) - r(j) * fct(i,j) + mcc(i,j)

            end if

            ! compute forces from bonds between particle i and j
            if ( bond (i, j) .eq. 1 ) then

                call bond_forces (i, j)
                !call bond_breaking (i, j)

                ! update force on particle i by particle j due to bond
                fbx(i) = fbx(i) - fbn(i,j) * cosa(i,j) +    &
                                    fbt(i,j) * sina(i,j)
                fby(i) = fby(i) - fbn(i,j) * sina(i,j) -    &
                                    fbt(i,j) * cosa(i,j)

                ! update moment on particule i by particule j to to bond
                mb(i) = mb(i) - r(i) * fbt(i,j) - mbb(i, j)

                ! Newton's third law
                ! update force on particle j by particle i due to bond
                fbx(j) = fbx(j) + fbn(i,j) * cosa(i,j) -    &
                                    fbt(i,j) * sina(i,j)
                fby(j) = fby(j) + fbn(i,j) * sina(i,j) +    &
                                    fbt(i,j) * cosa(i,j)
 
                
                ! update moment on particule j by particule i to to bond
                mb(j) = mb(j) - r(i) * fbt(i,j) + mbb(j, i)

            end if

            ! compute sheltering height for particule j on particle i for air and water drag
            call sheltering(i, j)

        end do

        ! compute the total forcing from winds, currents and coriolis on particule i
        call forcing (i)
        !call coriolis(i)

        ! reinitialize total force arrays before summing everything
        m(i)   = 0d0
        tfx(i) = 0d0
        tfy(i) = 0d0

        ! sum all forces together on particule i
        tfx(i) = fcx(i) + fbx(i) + fax(i) + fwx(i)
        tfy(i) = fcy(i) + fby(i) + fay(i) + fwy(i)

        ! sum all moments on particule i together
        m(i) =  mc(i) + mb(i) + ma(i) + mw(i)
    end do
    !$omp end parallel do

    ! forces on side particles for experiments
    call experiment_forces

    ! integration in time
    call velocity
    call euler

end subroutine stepper

rel_pos_vel(i, j) computes relative positions, velocities, angles, etc. between particles i and j. Everything inside is a 2D array from common blocks.
bond_properties(i, j) same thing here
contact_forces(i, j) some local variables as well as some arrays from common blocks
bond_forces(i, j) same thing
sheltering and forcing same

I tried to put different variables (the ones wthat are being changed tfx, tfy, m, etc.) in the private state, but without success. I don't think I understand well the reduction tag as well. Also, I know that some constant arrays like the radius r(i) are accessed by multiple threads, but I don't know how to deal with that.

Actually, when printing the number of threads in the parallel region (using omp_get_num_thread()), the output is 1, indicating that no parallelization is taking place.

Also, when trying to parallelize the first initialization loop as a first step towards understanding:

!$omp do private(i) schedule(dynamic) num_threads(10)
print*, omp_get_num_threads()
do i = 1, n
    mc(i)  = 0d0
    mb(i)  = 0d0
    fcx(i) = 0d0
    fcy(i) = 0d0
    fbx(i) = 0d0
    fby(i) = 0d0
end do
!$omp parallel do

it is still not working, the output of omp_get_num_threads is 1. And the number of threads is explicitly given to be 10.

I am compiling using gfortran, with the flags -ffast-math, -O3, -fopenmp

Does OpenMP work for you elsewhere? How did you set up the environment? How did you compile? Unfortunately, we cannot test your code - see [mcve]. — Vladimir F Героям слава, May 11 '23 at 16:30
My code is too big to be posted here. This is the only place openmp is used. I am compiling using gfortran, with the flags -o3, -ffast-math -fopenmp — Sogapi, May 11 '23 at 17:28
"My code is too big to be posted here." That's why you should create a [mcve]. Pay special attention to words "minimal' and "reproducible". Show us the settings of the environment in which you are executing the code and the exact complete compilation command. — Vladimir F Героям слава, May 12 '23 at 05:03
1) What does mean *" I can't find a way to make it work*"? No parallelization but correct results? Wrong results? Compilation error? Runtime error?... Do you expect us to guess? 2) does it work properly in serial mode, without `-fopenmp`? — PierU, May 12 '23 at 09:02
How does work for instance `call rel_pos_vel (i, j)`? You are passing `i` and `j` as input, but were is the result? Does it update some variables that are in the include files? But then we need to see the include files. — PierU, May 12 '23 at 09:10
Before the end of the `i` do loop you are updating the `tfx` and `tfy` array, but only with the `i` index. Consequently, all what has been calculated with the `j` index does not participate here. So you should update it either inside the `j` do loop and also for the `j` index, or in a separate loop after the parallel loop — PierU, May 12 '23 at 09:25
Another question: what is the result that you need at the end? Only `tfx`, `tfy`, and `m`? Or also and the individual components `fbx`, `fby`, etc... ? — PierU, May 12 '23 at 09:28
Ok, so @PierU, what happens is no parallelization and the same results as in serial as a quick htop shows that only thread is being used. All subroutines that are called arr modifying 2D arrays with indices (i,j) if provided with input (i,j). And at the end I want tfx tfy and m which are 1D arrays of the size of the number of particles. — Sogapi, May 13 '23 at 13:19
You may then proceed a bit differently : all the `fbx`, `fby`, etc, could be just scalars declared within the parallel region, and that you would directly add/subtract to `tfx` and `tfy` (which would be still reducted). — PierU, May 14 '23 at 08:14
Another comment on the scheduling: as the iterations are unbalanced, `schedule(dynamic)` looks a good idea at first, but in practice it's not that efficient. `schedule(nonmonotonic:dynamic)` can possibly help, but the experience shows that is such cases it's better to have the iterations with a low workload first (in your case `do i = n, 1, -1`) and use `schedule(guided)` — PierU, May 14 '23 at 08:20
Sorry, it's Fortran, not C, so the variables cannot be declared within the parallel region. The `fbx`, `fby`,... scalars would have to be private. — PierU, May 14 '23 at 08:21
Now, if the parallelism is not working at all (a single thread whatever you are doing), it's en entirely different problem that deserves a separate question. — PierU, May 14 '23 at 08:23
@PierU I tried your solutions, but none worked for me as there is only a single thread running whatever I am doing. I edited my question in consequence with my complete compilation file. — Sogapi, May 15 '23 at 15:51
I rolled back your edit. What you are asking is a new question. You already have an answer to your previous question. Please start a new question about SCons. Do not forget to show the **complete** output of the Scons command including all compilation and linking commands it issued. And as I asked originally, tell us more about your environment. Is the value of `OMP_NUM_THREADS` set in any way? — Vladimir F Героям слава, May 15 '23 at 16:06
You can/should actually show the compilation commands that were performed by SCons for you also here. But do not change a question to something completely different. If you want to ask how to compile OpenMP programs in SCons, start a new question. — Vladimir F Героям слава, May 15 '23 at 16:13
Yes well I figured out my error in Scons, now the solution provided by @PierU is working. — Sogapi, May 15 '23 at 16:17

PierU · Accepted Answer · 2023-05-13T15:31:05.177

As I wrote in a comment: Before the end of the i do loop you are updating the tfx and tfy array, but only with the i index. Consequently, all what has been calculated with the j index does not participate here. So you should update it either inside the j do loop and also for the j index, or in a separate loop after the parallel loop.

Here is an demonstration code:

program foo
implicit none

integer, parameter :: N = 40
integer :: s(N), t(N), i, j

s(:) = 0
t(:) = 0
!$ call omp_set_num_threads(2)
!$OMP PARALLEL DO PRIVATE(j) REDUCTION(+:s,t) SCHEDULE(dynamic,1)
do i = 1, N
    do j = i+1, N
        s(i) = s(i)+1
        s(j) = s(j)+1
    end do
    t(i) = s(i)
end do
!$END END PARALLEL DO

write(*,*) "s array:"
write(*,"(10I5)") s(:)
write(*,*) "t array:"
write(*,"(10I5)") t(:)
end program

If I run it without OpenMP (or optionally with a single thread) the output is as expected:

s array:
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
t array:
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39```

Now if I run it with OpenMP I get garbage in t(:):

 s array:
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
   39   39   39   39   39   39   39   39   39   39
 t array:
   39   38   38   37   37   36   36   35   35   34
   34   33   33   32   32   31   31   30   30   29
   29   28   28   27   27   26   26   25   25   24
   24   23   23   22   22   21   21   20   20   19

The solution is then obviously to update t only after the parallel loop:

program foo
implicit none

integer, parameter :: N = 40
integer :: s(N), t(N), i, j

s(:) = 0

!$ call omp_set_num_threads(2)
!$OMP PARALLEL DO PRIVATE(j) REDUCTION(+:s) SCHEDULE(dynamic,1)
do i = 1, N
    do j = i+1, N
        s(i) = s(i)+1
        s(j) = s(j)+1
    end do
end do
!$END END PARALLEL DO

t(:) = s(:)

write(*,*) "s array:"
write(*,"(10I5)") s(:)
write(*,*) "t array:"
write(*,"(10I5)") t(:)
end program

Explanation:

In the example above (with OpenMP and reduction on t), let's assume the thread 0 processes the odd i iterations, and thread 1 the even i iterations. because of the reduction, s and t are implicitely private in the OpenMP region: let's denote s0, t0, s1, t1 the private versions.

in thread 0, iter 1, at the end of the loop on j:
- s0(:)=[N-1,1,1,...,1]
- t0(:)=[N-1,0,0,...,0]
in thread 1, iter 2, at the end of the loop on j:
- s1(:)=[0,N-2,1,...,1]
- t1(:)=[0,N-2,0,...,0]
in thread 0, iter 3, at the end of the loop on j:
- s0(:)=[N-1,1,N-2,2,2,...,2]
- t0(:)=[N-1,0,N-2,0,0,...,0]
in thread 1, iter 4, at the end of the loop on j:
- s1(:)=[0,N-2,1,N-3,1,...,1]
- t1(:)=[0,N-2,0,N-3,0,...,0]

...

in thread 0, iter N-1, at the end of the loop on j:
- s0(:)=[N-1,1,N-2,2,N-3,3,N-4,4,...,N/2-1,N/2]
- t0(:)=[N-1,0,N-2,0,N-3,0,N-4,0,...,N/2-1,0]
in thread 1, iter N, at the end of the loop on j:
- s1(:)=[0,N-2,1,N-3,1,N-4,...,N/2,N/2-1]
- t1(:)=[0,N-2,0,N-3,0,N-4,...,0 ,N/2]

At the end of the parallel region, the reduction consists in s=s0+s1 and t=t0+t1. You can see that it goes well for s, but not for t

Hum, I don’t see why this is the case, I am very new to openmp. — Sogapi, May 13 '23 at 13:25
Ok thanks! I understand know how the reduction works, I was confused. — Sogapi, May 13 '23 at 15:58

How to properly choose which variables should be private/shared/reduced in nested do-loops in Fortran with openmp?

1 Answers1

Explanation: