Unexpected performance in Fortran routines

Question

I have been playing with some Fortran routines and I would like to ask for guidance as to understand why some of them take more time than others.

The problem all routines solve is the following. Given a set of points find those that are too close. For that, the components of the points are compared and, if all of them are lower than a threshold, the two points are considered to be too close.

The first routine is the following:

subroutine routine1(npoints, ndis, points, r, is_close)
    implicit none
    integer, parameter :: dp = kind(0d0)
    integer, intent(in) :: npoints, ndis
    real(dp), dimension(npoints, ndis), intent(in) :: points
    real(dp), intent(in) :: r
    logical :: clos
    logical, dimension(npoints) :: is_close

    integer :: i, i_check
    
    is_close = .FALSE.
    do i=1,npoints
       clos = .FALSE.
       i_check = i
       do while (.not. clos .and. i_check < npoints)
          i_check = i_check + 1
          clos = all(abs(points(i,:) - points(i_check,:)) < r)
       enddo
       is_close(i) = clos
    enddo
end subroutine

In the second routine there is just one "small" change. I break the clos = all(abs(points(i,:) - points(i_check,:)) < r) into two lines. First, the difference is calculated and later the all is evaluated:

subroutine routine2(npoints, ndis, points, r, is_close)
    implicit none
    integer, parameter :: dp = kind(0d0)
    integer, intent(in) :: npoints, ndis
    real(dp), dimension(npoints, ndis), intent(in) :: points
    real(dp), intent(in) :: r
    logical :: clos
    logical, dimension(npoints) :: is_close

    integer :: i, i_check
    real(dp), dimension(ndis) :: diff
    
    is_close = .FALSE.
    do i=1,npoints
       clos = .FALSE.
       i_check = i
       do while (.not. clos .and. i_check < npoints)
          i_check = i_check + 1
          diff = abs(points(i,:) - points(i_check,:))
          clos = all(diff < r)
       enddo
       is_close(i) = clos
    enddo
end subroutine

The last routine starts from routine1 but since in Fortran one should run from outer to inner loops I first transpose the points so I can execute the loop from the outer index:

subroutine routine3(npoints, ndis, points, r, is_close)
    implicit none
    integer, parameter :: dp = kind(0d0)
    integer, intent(in) :: npoints, ndis
    real(dp), dimension(npoints, ndis), intent(in) :: points
    real(dp), intent(in) :: r
    logical :: clos
    logical, dimension(npoints) :: is_close
    real(dp), dimension(ndis,npoints) :: points_T

    integer :: i, i_check
    
    is_close = .FALSE.
    points_T = transpose(points)
    do i=1,npoints
       clos = .FALSE.
       i_check = i
       do while (.not. clos .and. i_check < npoints)
          i_check = i_check + 1
          clos = all( abs(points_T(:,i) - points_T(:,i_check))< r)
       enddo
       is_close(i) = clos
    enddo
end subroutine

With this, I would expect routines 1 and 2 to perform basically the same and 3 slightly better. But when the performance is measured the results I get are the following (the full code to reproduce this calculation is available at the end of the question):

   npoints         routine1/s                 routine2/s                   routine3/s
      10   1.2979999999999999E-006   3.1230000000000000E-006   6.3010000000000003E-006
     100   2.7813000000000000E-005   1.9997000000000000E-004   3.8600999999999999E-005
    1000   2.5518790000000000E-003   1.4855343999999999E-002   1.1216130000000000E-003
   10000   8.0283021999999996E-002  0.64925195400000002        5.5415089000000001E-002
  100000   7.8920265199999999        79.243228029999997        16.829428664999998

My questions are:

Why routine2 takes so much longer than routine1 being basically the same?
What am I missing with the order of indexes that explains routine3 being so slow?

Full code to reproduce this calculations, compiled with gfortran and -O3 flag:

program test                                                                               
    implicit none
    integer, parameter :: dp = kind(0.d0)
    integer, parameter :: ndis = 21
    integer :: ipoint, npoints
    real(dp), parameter :: r = 1e-5
    real(dp), dimension(:, :), allocatable :: points
    logical, dimension(:), allocatable :: is_close
    real(dp) :: t1, t2, t3
    integer(8) :: ini, fin, count_rate

    do ipoint = 1,5
        npoints = 10**ipoint
        allocate(points(npoints, ndis), is_close(npoints))
        ! Generate data
        call RANDOM_NUMBER(points)

        ! routine 1
        call SYSTEM_CLOCK(ini,count_rate=count_rate)
        call routine1(npoints, ndis, points, r, is_close)
        call SYSTEM_CLOCK(fin)
        t1 = fin - ini

        ! routine 2
        call SYSTEM_CLOCK(ini,count_rate=count_rate)
        call routine2(npoints, ndis, points, r, is_close)
        call SYSTEM_CLOCK(fin)
        t2 = fin - ini

        ! routine 3
        call SYSTEM_CLOCK(ini,count_rate=count_rate)
        call routine3(npoints, ndis, points, r, is_close)
        call SYSTEM_CLOCK(fin)
        t3 = fin - ini

        deallocate(points, is_close)

        print *, npoints, dble(t1/count_rate), dble(t2/count_rate), dble(t3/count_rate)
    end do
end program

subroutine routine1(npoints, ndis, points, r, is_close)
    implicit none
    integer, parameter :: dp = kind(0d0)
    integer, intent(in) :: npoints, ndis
    real(dp), dimension(npoints, ndis), intent(in) :: points
    real(dp), intent(in) :: r
    logical :: clos
    logical, dimension(npoints) :: is_close

    integer :: i, i_check
    
    is_close = .FALSE.
    do i=1,npoints
       clos = .FALSE.
       i_check = i
       do while (.not. clos .and. i_check < npoints)
          i_check = i_check + 1
          clos = all(abs(points(i,:) - points(i_check,:)) < r)
       enddo
       is_close(i) = clos
    enddo
end subroutine

subroutine routine2(npoints, ndis, points, r, is_close)
    implicit none
    integer, parameter :: dp = kind(0d0)
    integer, intent(in) :: npoints, ndis
    real(dp), dimension(npoints, ndis), intent(in) :: points
    real(dp), intent(in) :: r
    logical :: clos
    logical, dimension(npoints) :: is_close

    integer :: i, i_check
    real(dp), dimension(ndis) :: diff
    
    is_close = .FALSE.
    do i=1,npoints
       clos = .FALSE.
       i_check = i
       do while (.not. clos .and. i_check < npoints)
          i_check = i_check + 1
          diff = abs(points(i,:) - points(i_check,:))
          clos = all(diff < r)
       enddo
       is_close(i) = clos
    enddo
end subroutine

subroutine routine3(npoints, ndis, points, r, is_close)
    implicit none
    integer, parameter :: dp = kind(0d0)
    integer, intent(in) :: npoints, ndis
    real(dp), dimension(npoints, ndis), intent(in) :: points
    real(dp), intent(in) :: r
    logical :: clos
    logical, dimension(npoints) :: is_close
    real(dp), dimension(ndis,npoints) :: points_T

    integer :: i, i_check
    
    is_close = .FALSE.
    points_T = transpose(points)
    do i=1,npoints
       clos = .FALSE.
       i_check = i
       do while (.not. clos .and. i_check < npoints)
          i_check = i_check + 1
          clos = all( abs(points_T(:,i) - points_T(:,i_check))< r)
       enddo
       is_close(i) = clos
    enddo
end subroutine

I didn't try anything but maybe the transpose (and maybe a little the allocation / deallocation depending whether it will land on the stack or the heap) is the reason that routine3 is slower. What happens when you move the transpose outside of the timing? — albert, Mar 10 '21 at 17:53
One thing worth noting is that you should generally NOT time your code via `cpu_time()` which will not represent the actual amount of time spent but the sum of the time spent by all processors. Instead, you should time via `system_clock()` which measures the wallclock time. With O3 optimization level, I suspect multithreading might be at work (though I am not sure, since you do not specify the compiler). In such a case, `cpu_time()` does not represent the wallclock time. — Scientist, Mar 10 '21 at 18:12
@King I have also modified the code as to measure time with `system_clock()`. I haven't use it before so I hope it is correct. I guess so, since the same trends are found. — pablo, Mar 10 '21 at 18:30
If you are on a Unix-like system, use `cpu_time`. It will return the actual elapsed CPU time used by your process. `system_clock` returns the elapses wall-clock time. If, for whatever reason, your process is put to sleep by the operating system (e.g., ctrl-z after starting the process), then `system_clock` counts that time as part of your process's time. Not sure why @King think `cpu_time` is the sum of the time of all processors. — steve, Mar 10 '21 at 19:02
@Steve in a multithreaded program cpu_time typically sums the time across the threads (which is of course the process time), which can be confusing, see questions here passim - though of course the code is allowed to report whatever it wants. This is probably what King is referring to. Personally I much prefer to use system_clock and average across a number of runs as that measures what I am actually interested in, how long I will usually have to wait for my program to complete. — Ian Bush, Mar 10 '21 at 19:14
@steve What I meant related to threading: https://stackoverflow.com/a/6880133/2088694 — Scientist, Mar 10 '21 at 19:15
@pablo I think the issue with `routine2` is that with each looping, you generate a temporary array `diff` and then check the condition for all elements. There is no need for a temporary array. As soon as any of `abs(points(i,:) - points(i_check,:))` does not hold, you do not need to check the rest; `clos = .false.` will hold anyway. That might be what the compiler is doing in `routines1` and `routine3`. Fixing this, will likely make it comparable or even faster than the two others. — Scientist, Mar 10 '21 at 19:34
@IanBush, Last time I looked, the standard did not address multi-threading, so of course, neither `cpu_time` nor `system_clock` have any constraints based on multi-threading. There are two problems with `system_clock`. 1. It measures elapsed wall-clock time. If your process is suspended (say, swapped out to let others have a crack the cpu), `system_clock` includes that time. 2. There are portability issues with `system_clock`. A long discussion recently occurred in the J3 mailing list, because `system_clock` is (IMHO) poorly defined. J3 is updating the language in the standard. — steve, Mar 10 '21 at 19:41
@Steve Yes, I saw some of the discussion. Here is not the place to discuss, so let's agree to differ. — Ian Bush, Mar 10 '21 at 20:33
Try mixing up the execution order as the processor scheduling might skew the results. — John Alexiou, Mar 10 '21 at 22:08
Note you can put the subroutines inside the program after a `contains` keyword. This way explicit interfaces are not required. — John Alexiou, Mar 10 '21 at 22:10
Interestingly enough I get a huge performance increase for `routine3` when I enable `AVX2` instruction set. I repeated the calls in order to get the timings about even, using the following formula `repeat = max(1,244140625/npoints**2)`. Without AVX2 I get `(t1=1.3960, t3=1.1800)` and with `(t1=1.9500, t3=0.56200)`. — John Alexiou, Mar 10 '21 at 22:50
@King. As you suggested I modified the `routine2` to check the condition `abs(points(i,:) - points(i_check,:))` for each component, and exit as soon as one holds. For more than 1000 points the execution experiences a great boost, for `npoints = 1e5` the execution time goes to just 3.5 s (from 79 s in the original version). — pablo, Mar 11 '21 at 07:50

Serge3leo · Answer 1 · 2021-03-10T21:18:50.503

3

The optimizer's constraints probably lead to a complete calculation of the diff array;
Of course, points_T(k,i) and points_T(k+1,i) are directly adjacent in RAM, so the overhead is substantially less. In particular, for most computers, the amount of data sent between RAM and cache is several times less. But the cost of transposing is just as high, and judging by the time difference between routine1() and routine2(), all() in the main loop processes no more than 1/10 of the row of the points() array.

edited Mar 10 '21 at 21:18

answered Mar 10 '21 at 21:11

Serge3leo

411
2
4

I moved the transpose out of the `system_clock` and execution time is almost the same. But if I understand well you say that maybe I don't see such a great boost since only a small percentage of points are actually being used in evaluation? – pablo Mar 11 '21 at 08:09
«... and judging by the time difference between routine1() and routine2(), all() in the main loop processes no more than 1/10 of the row of the points() array.» The probability of meeting the condition abs(points(i,1) - points(i_check, 1)) < r, for a uniform distribution of 0..1 and r = 1.e-5, approximately 2.e-5. Therefore, points(i,1) and,sometimes, points(i,2) are used to calculate all(). The rest of the array points(:,3:) is practically not used in calculations. Therefore, when using the points() array, there are fewer cache misses than when using points_T(). – Serge3leo Mar 11 '21 at 11:21

Unexpected performance in Fortran routines

1 Answers1