An attempt at an answer - I can't claim to understand all that is going on here but thought I would report what I have found.
First we have to make some assumptions about how arguments are passed in Fortran with and without the value attribute. This will be implementation dependent, but as the question mentions gfortran I'll concentrate on that. In comp.lang.fortran Thomas Koenig, a gfortran developer says
"Since the example is for gfortran, maybe I can add a little here.
It is indeed a possible choice for a compiler to pass an argument via
the C passing conventions, which effectively means that the temporary
copy in question is made in a register or on the stack. For a
sufficiently small number of arguments, most ABIs will use registers.
This method does not work as such with OPTIONAL VALUE arguments, but
it is possible to get around that with hidden arguments which indicate
the presence of absence of the optional argument.
Gfortran does indeed use a C-like argument passing convention for
VALUE arguments (including the hidden arguments for otional
arguments). One advantage is that this saves one pointer dereference
if the value is indeed passed in a register, which can lead to speed
advantages."
So I'm going to assume that the default argument passing method is by reference, and as described above when the value attribute is used.
For compilation I shall use
ian@eris:~/work/stack$ gfortran-10 --version
GNU Fortran (GCC) 10.0.1 20200225 (experimental)
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The codes I will look at are as follows. First the one that uses default argument passing:
ian@eris:~/work/stack$ cat ack_default.f90
Program ackermann
Interface
Recursive Function ack( m, n ) Result( a )
Integer, Intent(in) :: m
Integer, Intent(in) :: n
Integer :: a
End Function ack
End Interface
Integer :: start, finish, rate
Call system_Clock( start, rate )
Write(*,*) ack(3, 12)
Call system_Clock( finish, rate )
Write( *, * ) 'Time: ', Real( finish - start ) / rate
End Program ackermann
Recursive Function ack(m, n) Result(a)
Integer, Intent(in) :: m
Integer, Intent(in) :: n
Integer :: a
If (m == 0) Then
a=n+1
Else If (n == 0) Then
a=ack(m-1,1)
Else
a=ack(m-1, ack(m, n-1))
End If
End Function ack
And next the value version:
Program ackermann
Interface
Recursive Function ack( m, n ) Result( a )
Integer, Intent(in), Value :: m
Integer, Intent(in), Value :: n
Integer :: a
End Function ack
End Interface
Integer :: start, finish, rate
Call system_Clock( start, rate )
Write(*,*) ack(3, 12)
Call system_Clock( finish, rate )
Write( *, * ) 'Time: ', Real( finish - start ) / rate
End Program ackermann
Recursive Function ack(m, n) Result(a)
Integer, Intent(in), Value :: m
Integer, Intent(in), Value :: n
Integer :: a
If (m == 0) Then
a=n+1
Else If (n == 0) Then
a=ack(m-1,1)
Else
a=ack(m-1, ack(m, n-1))
End If
End Function ack
It can be seen that the only difference is the value attribute on the arguments. Compiling both and comparing I get:
ian@eris:~/work/stack$ gfortran-10 -O3 -Wall -Wextra -std=f2008 ack_default.f90
ian@eris:~/work/stack$ ./a.out
32765
Time: 1.01900005
ian@eris:~/work/stack$ gfortran-10 -O3 -Wall -Wextra -std=f2008 ack_value.f90
ian@eris:~/work/stack$ ./a.out
32765
Time: 0.602999985
So the value version is appreciably quicker than that in which the arguments are passed by the default mechanism.
Asking for an optimisation report from gfortran gives the following:
ian@eris:~/work/stack$ gfortran-10 -O3 -Wall -Wextra -std=f2008 -fopt-info ack_default.f90
ack_default.f90:27:0: optimized: Inlined ack/13 into ack/0 which now has time 18.062500 and size 95, net change of +65.
ack_default.f90:11:0: optimized: basic block part vectorized using 16 byte vectors
ack_default.f90:13:0: optimized: basic block part vectorized using 16 byte vectors
ian@eris:~/work/stack$ gfortran-10 -O3 -Wall -Wextra -std=f2008 -fopt-info ack_value.f90
ack_value.f90:11:0: optimized: Inlined ack.constprop/12 into ackermann/1 which now has time 174.107273 and size 60, net change of -7.
ack_value.f90:27:0: optimized: Inlined ack/14 into ack/0 which now has time 455.794475 and size 79, net change of +64.
ack_value.f90:11:0: optimized: basic block part vectorized using 16 byte vectors
ack_value.f90:13:0: optimized: basic block part vectorized using 16 byte vectors
Thus it appears that the value code has an extra level of inlining applied, and this was my first thought at an answer. However turning off inlining gives
ian@eris:~/work/stack$ gfortran-10 -O3 -Wall -Wextra -std=f2008 -fno-inline ack_default.f90
ian@eris:~/work/stack$ ./a.out
32765
Time: 1.46000004
ian@eris:~/work/stack$ gfortran-10 -O3 -Wall -Wextra -std=f2008 -fno-inline ack_value.f90
ian@eris:~/work/stack$ ./a.out
32765
Time: 0.958999991
so the value version is still much quicker than the default version - something else is going on.
Thomas Koenig also said:
With gfortran, it can also be instructive to inspect the output of -fdump-tree-original.
So I took a look at that. First with default passing (and keeping only the relevant parts)
ian@eris:~/work/stack$ gfortran-10 -O3 -Wall -Wextra -std=f2008 -fdump-tree-original ack_default.f90
ian@eris:~/work/stack$ cat ack_default.f90.004t.original
ack (integer(kind=4) & restrict m, integer(kind=4) & restrict n)
{
integer(kind=4) a;
if (*m == 0)
{
a = *n + 1;
}
else
{
if (*n == 0)
{
{
integer(kind=4) D.3903;
static integer(kind=4) C.3904 = 1;
D.3903 = *m + -1;
a = ack (&D.3903, &C.3904);
}
}
else
{
{
integer(kind=4) D.3905;
integer(kind=4) D.3906;
integer(kind=4) D.3907;
D.3905 = *m + -1;
D.3906 = *n + -1;
D.3907 = ack ((integer(kind=4) *) m, &D.3906);
a = ack (&D.3905, &D.3907);
}
}
L.2:;
}
L.1:;
return a;
}
And now for the value version
ian@eris:~/work/stack$ cat ack_value.f90.004t.original
ack (integer(kind=4) m, integer(kind=4) n)
{
integer(kind=4) a;
if (m == 0)
{
a = n + 1;
}
else
{
if (n == 0)
{
a = ack (m + -1, 1);
}
else
{
a = ack (m + -1, ack (m, n + -1));
}
L.2:;
}
L.1:;
return a;
}
It can be seen the value version is a lot simpler and is pretty much a transliteration of the code. However the default code has a lot more going on, in particular
{
integer(kind=4) D.3905;
integer(kind=4) D.3906;
integer(kind=4) D.3907;
D.3905 = *m + -1;
D.3906 = *n + -1;
D.3907 = ack ((integer(kind=4) *) m, &D.3906);
a = ack (&D.3905, &D.3907);
}
Now I am not an expert here ... but that looks to me very much like the compiler setting up temporaries on the stack to hold the values of intermediate results, they can't overwrite the original, and in fact looks quite similar to what I would expect the compiler would have to do to implement passing by value. Thus it looks to me that
- Because the compiler has to create new "variables" on the stack to hold the intermediate results when passing by reference in this case there will be no advantage gained by using that method
- The compiler is better at optimising the standard "pass by value" method than a more generic "pass by reference and intermediate values". I really am beginning to guess now but I suspect it is how the compiler is using registers underlies the improved performance.
To go further we need somebody who reads x86 assembler. That's not me.