FORTRAN: Access array via pointer matrix, performance

Question

I am having a issue here for using pointers. Before I do that I have a performance concern. Suppose there is a 2D matrix like this:

0.0  0.0  0.0.....
0.0  0.7  0.5.....
0.0  0.5  0.8.....
0.0  0.3  0.8.....

.....

And I need to calculate the gradient of this thing. Therefore, for each number, I'll need the number as well as all its 4 nearest neighbors of this 2D matrix. Besides the first and last row and column are 0.

Now I have two method:

Make such a NxN matrix directly and calculate the gradient. Exactly follow the description. Here the memory use is NxNxreal*8, The loop start from calculating the (2,2) element then (2,3), ...
Make a (N-2)x(N-2)+1 array, and a NxN pointer matrix (use type at the moment). the (N-2)x(N-2) elements of the array will store the numbers except the 0.0s on the border. The last element of the is 0.0. For the pointer matrix, all the elements on the boarder will point to the last element of the array, 0.0. Other pointers should point to the places where they suppose to point to.

Here comes a issue of performance since the matrix that I am handling can be really huge or maybe 3D.

For method 1, there is nothing to say since it is just a straight forward method.

For method 2, I am wondering if the compiler can handle the issue properly. Since each FORTRAN pointer is like a structure according to my understanding so far. IF that is the case, FORTRAN pointer is slower than c pointer since it is not just a simple de-reference? I am also wondering if the type warp of the pointer decrease the performance (that warp is needed to make a pointer matrix). Is ther a particular reason that I should give up method 2 since it should be slower?

Let's take IVF on windows, gfortran and ifort on Linux for example. Since it can be compiler dependent.

UPDATE: Appreciate Stefan's code. I wrote on by my self as well.

program stencil
    implicit none
    type pp
        real*8, pointer :: ptr
    endtype pp
    type(pp), allocatable :: parray(:,:)
    real*8, allocatable, target :: array(:)
    real*8, allocatable :: grad(:,:,:), direct(:,:)
    integer, parameter :: n = 5000
    integer :: i, j
    integer :: clock_rate, clock_start, clock_stop

    allocate(array(n**2+1))
    allocate(parray(0:n+1, 0:n+1))
    allocate(grad(2, n, n))
    call random_number(array)
    array(n**2+1) = 0
    do i = 0, n + 1
        parray(0,i)%ptr => array(n**2+1)
        parray(n+1,i)%ptr => array(n**2+1)
        parray(i,0)%ptr => array(n**2+1)
        parray(i,n+1)%ptr => array(n**2+1)
    enddo
    do i = 1, n
        do j = 1, n
            parray(i,j)%ptr => array((i-1) * n + j)
        enddo
    enddo
    !now stencil
    call system_clock(count_rate=clock_rate)
    call system_clock(count=clock_start)
    do j = 1, n
        do i = 1, n
            grad(1, i, j) = (parray(i + 1,j)%ptr - parray(i - 1,j)%ptr)/2.D0
            grad(2, i, j) = (parray(i,j + 1)%ptr - parray(i,j - 1)%ptr)/2.D0
        enddo
    enddo
    call system_clock(count=clock_stop)
    print *, "pointer, time cost= ", real(clock_stop-clock_start)/real(clock_rate)
    deallocate(array)
    deallocate(parray)
    allocate(direct(0:n+1, 0:n+1))
    call random_number(direct)
    do i = 0, n + 1
        direct(0,i) = 0
        direct(n+1,i) = 0
        direct(i,0) = 0
        direct(i,n+1) = 0
    enddo
    !now stencil directly
    call system_clock(count_rate=clock_rate)
    call system_clock(count=clock_start)
    do j = 1, n
        do i = 1, n
            grad(1, i, j) = (direct(i + 1,j) - direct(i - 1,j))/2.D0
            grad(2, i, j) = (direct(i,j + 1) - direct(i,j - 1))/2.D0
        enddo
    enddo
    call system_clock(count=clock_stop)
    print *, "direct, time cost= ", real(clock_stop-clock_start)/real(clock_rate)
endprogram stencil

result (o0):

pointer, time cost= 2.170000

direct, time cost= 1.127000

result (o2):

pointer, time cost= 0.5110000

direct, time cost= 9.4999999E-02

So FORTRAN pointer is much slower. Stefan has pointed that out earlier. Now I am wondering if there is room for improvement. As I know so far, if I did it with c, the difference should not be this much.

Try it and measure it (and report your results :) ). What do you mean with `type warp`? — Stefan, Sep 07 '14 at 08:36
Most of us Fortran programmers (well, this one anyway, I can't really speak for the rest) are wondering why you would contemplate using pointers for an N-point stencil operation. I'm struggling to see any advantage in your option 2, even leaving the question of performance to one side. Padding an array with a halo of 0s so that stencil operations can be carried out without special cases for the edges is an entirely standard approach. Gosh, in modern Fortran you can even index your array elements from `(0,N+1)` and run your stencil over `(1,N)`. — High Performance Mark, Sep 07 '14 at 10:46
@HighPerformanceMark Because you need a general way to handle the boundary conditions. I know method 1 is a standard way for finite boundary, but when you calculate periodic boundary conditions, the ghost points must be update each time according to the input matrix. If you use pointer, you get the new matrix with boundary by just a shift (add a constant to the pointer matrix). Besides this way you need fewer ghost points or even no ghost points. So using pointer matrix will give you more flexibility for different boundary especially for non-finite boundaries. Plus save RAM is many cases. — FortCpp, Sep 07 '14 at 16:21
@Stefan I am asking to see if anyone did something similar before. If no, I'll try it. The warp means you must define a type with a pointer member and they make an array of the type for an array of pointer. This is the only way of doing so. You can find it on Stackoverflow. — FortCpp, Sep 07 '14 at 16:23
Again, people have been writing stencil calculations, in Fortran, on large arrays with periodic boundaries, without using pointers, for many years. For many years after the Fortran standard introduced pointers even. But don't let my scepticism about the performance of your suggested pointer-based approach deter you, you obviously have good reasons to think it would be a sensible approach for your code. But please, as @Stefan first commented, try it, measure it, and let us know how you get on. — High Performance Mark, Sep 07 '14 at 16:55
It seems people think it is worthless to check the performance or even think about it. That's what I learnt here. — FortCpp, Sep 07 '14 at 18:41
@Stefan certainly doesn't think it worthless to check the performance of your two options, he advised you to do just that. I spend half my working life checking the performance of codes; I don't think checking performance is worthless either. But you came here asking for an opinion and in my opinion, based on my experiences, using pointers as you outline will not perform as quickly as the straightforward array-based solution you also propose. What is worthless is trying to resolve questions of performance by opinion. Follow Stefan's original advice. — High Performance Mark, Sep 07 '14 at 19:35
I'm sorry, but I'm still confused with this type thing. You could either create an array with `(N-2)x(N-2)x5` pointers or create a `(N-2)x(N-2)` array of a type with 5 pointers. Both approaches would require a massive amount of memory and time for setup and I would guess, that this time could be comparable to the actual computation time of the direct calculation. But as @HighPerformanceMark pointed out, I don't think that it's worthless to check the performance because you never know, what wonders your compiler might achieve. — Stefan, Sep 07 '14 at 20:14
@Stefan So far one cannot create an array whose elements are pointers due to the FORTRAN standard. In order to do it, you must define a type then make an allocatable array of the type. See here: http://stackoverflow.com/questions/8900336/arrays-of-pointers — FortCpp, Sep 07 '14 at 21:03
@HighPerformanceMark I understand do it and check is the only way. But there are some clues to predict. For example I'll never worry about such kind of issue when I write the same program using c. Since in c the pointer is just a memory address and pointer and array are equivalent. All one needs to do is to de-reference such a memory address. However, in FORTRAN, everything is very different. I don't have a good knowledge of what pointer is essentially is in FORTRAN though I know how do program it. One can write better program base on some deep knowledge of the language. — FortCpp, Sep 07 '14 at 21:12
Fortran array pointers are more complicated than the C ones because they can point to non-contiguous parts of memory. You may try to declare the pointer as `contiguous` and see if the compiler is able to optimize it. If not write to your compiler vendor to think about it. — Vladimir F Героям слава, Sep 08 '14 at 06:52
@VladimirF I just added some code to my post. Could you please show me where to add "contiguous"? I don't know it very well. — FortCpp, Sep 08 '14 at 07:27
There is some serious misunderstanding. There is no array pointer in your code! You have only a scalar pointer component. In fact, this thing is exactly the same thing as in C, it is only the address. The contiguous attribute is only used for pointers to arrays! — Vladimir F Героям слава, Sep 08 '14 at 11:27

Stefan · Accepted Answer · 2014-09-08T18:05:54.817

At first, I have to apologize, because I misunderstood the way, pointer work in Fortran...

Finally, I was so intrigued from the topic, that I created a test on my own. It is based on an array, which has a surrounding for zeros.

Declaration:

real, dimension(:,:), allocatable, target :: array
real, dimension(:,:,:), allocatable :: res
real, dimension(:,:), pointer :: p1, p2, p3, p4
allocate(array(0:n+1, 0:n+1), source=0.)
allocate(res(n,n,2), source=0.)

Now the methods:

Loops:

do j = 1, n
    do i = 1, n
        res(i,j,1) = array(i+1,j) - array(i-1,j)
        res(i,j,2) = array(i,j+1) - array(i,j-1)
    end do
end do

Array assignment:

res(:,:,1) = array(2:n+1,1:n) - array(0:n-1,1:n)
res(:,:,2) = array(1:n,2:n+1) - array(1:n,0:n-1)

Pointers:

p1 => array(0:n-1,1:n)
p2 => array(1:n,2:n+1)
p3 => array(2:n+1,1:n)
p4 => array(1:n,0:n-1)
res(:,:,1) = p3 - p1
res(:,:,2) = p2 - p4

While the last two methods do rely on the extra layer of zeros, the loops can introduce some conditionals to care for these.

The timings are interesting:

loops:     0.17528710301849060
array:     0.21127231500577182
pointers:  0.21367537401965819

While the array and pointer assignments yield approximately the same timings, the loop construct (mind the loop order! this was a factor of 5!!!) is the fastest method.

UPDATE: I tried to squeeze a bit of performance out of your code and found one small thing. Your code performs with -O2 in 0.95s and 0.30s (with n = 10000).

Transposing your matrix to get a more linear memory access yields a runtime of 0.50s for the pointer part.

parray(i,j)%ptr => array((j-1) * n + i)

IMHO, the problem is the missing information about the pointers, which forbid additional optimization. Using -O3 -fopt-info-missed you get complaints about unknown alignment and non-consecutive accesses. The additional factor 2 compared to my results should stem from the fact, that you are using double precision, while my code is written in single precision...

I guess you can speed up your loop approach even more by unrolling the loop manually to gain some benefits from vectorization. But this is just a guess and heavily depending on the hardware you are using. — PVitt, Sep 08 '14 at 08:00
@PVitt I'm lazy, so I only tried `-funroll-loops` which resulted in a speedup of 5% for the loops and 7% for the arrays and pointers. The loops were reportedly unrolled 7 times. FYI, I'm using an i5-750. — Stefan, Sep 08 '14 at 09:59
@Stefan It's nice to learn -fopt-info-missed. The non-consecutive accesses is unsolvable issue, right? I mean since no matter how you optimize it, calculation of the gradient for a 2D matrix requires non-consecutive access. Do you agree? — FortCpp, Sep 09 '14 at 21:28
I think the problem is, that the compiler doesn't know, that the date access is "not so nonconsecutive". In the loop case, the access pattern is in principle the same, but the compiler can directly see, what you want to do. Your pointer array might have the same properties, but the compiler can't know this (they *could* point anywhere), because this would require a deep analysis of the complete code. On the other hand, the normal Fortran array pointers tell the compiler that their access is contiguous along the first axis, which yields a comparable timing to the normal array case. — Stefan, Sep 10 '14 at 04:42
@Stefan I agree with your opinion when I first saw it two weeks ago. I don't see any other answers to this question for a while. Now I'd like to make a conclusion to my question. I appreciate your example code and explanation. — FortCpp, Sep 24 '14 at 16:44

score 0 · Answer 2 · answered Sep 24 '14 at 17:06

I accept Stefan's answer as the best answer. But I personally want to make a conclusion for the discussion and my own question.

The FORTRAN pointer is different from the C pointer according to Vladimir. It seems FORTRAN standard is aimed to make the array pointer a "subsetter" for the array. Therefore the "array of pointer" in FORTRAN is almost pointless unlike the situation in C. Please read Stefan's code for the detail of using FORTRAN pointers. Besides, "array of pointer" in FORTRAN is possible, but low performance for it is definitely not a simple dereference.
The performance of the calculation can be boosted using direct access with loop unrolling. Please find details in Stefan's code. When using the direct access, the compiler optimization can be done better according to Stefan's comment. I think that is the reason why people doing it without using pointers to solve Stencil problems.
The idea of using pointer for handling the Stencil is to reduce the memory cost and make the boundary condition VERY flexible. But it is not a choice for performance for the moment. The main reason is the "non-consecutive" memory access and compiler optimization cannot be performed for no knowledge of the pointer pattern.

Please refer to Stefan's answer for the FORTRAN pointer.

FORTRAN: Access array via pointer matrix, performance

2 Answers2