2

I was comparing the performance of doing a sum followed by an assignment of two arrays, in the form of c=a+b, between a native Fortran type, real, and a derived data type that only contains one array of real. The class is very simple: it contains operators for addition and assignment and a destructor, as follows:

module type_mod

use iso_fortran_env

type :: class_t
  real(8), dimension(:,:), allocatable :: a
contains
  procedure :: assign_type
  generic, public :: assignment(=) => assign_type
  procedure :: sum_type
  generic :: operator(+) => sum_type
  final :: destroy
end type class_t

contains

  subroutine assign_type(lhs, rhs)
    class(class_t), intent(inout) :: lhs
    type(class_t), intent(in) :: rhs
    lhs % a = rhs % a
  end subroutine assign_type

  subroutine destroy(this)
    type(class_t), intent(inout) :: this
    if (allocated(this % a)) deallocate(this % a)
  end subroutine destroy

  function sum_type (lhs, rhs) result(res)
    class(class_t), intent(in) :: lhs
    type(class_t), intent(in) :: rhs
    type(class_t) :: res
    res % a = lhs % a + rhs % a
  end function sum_type

end module type_mod

The assign subroutine contains different modes of operations, just for the sake of benchmarking.

To test it against performing the same operations on a real I created the following module

module subroutine_mod

  use type_mod, only: class_t

  contains

  subroutine sum_real(a, b, c)
    real(8), dimension(:,:), intent(inout) :: a, b, c
    c = a + b
  end subroutine sum_real


  subroutine sum_type(a, b, c)
    type(class_t), intent(inout) :: a, b, c
    c = a + b
  end subroutine sum_type

end module subroutine_mod

Everything is executed in the program below, considering arrays of size (10000,10000) and repeating the operation 100 times:

program test

  use subroutine_mod

  integer :: i
  integer :: N = 100 ! Number of times to repeat the assign
  integer :: M = 10000 ! Size of the arrays
  real(8) :: tf, ts
  real(8), dimension(:,:), allocatable :: a, b, c
  type(class_t) :: a2, b2, c2

  allocate(a2%a(M,M), b2%a(M,M), c2%a(M,M))
  a2%a = 1.0d0
  b2%a = 2.0d0
  c2%a = 3.0d0

  allocate(a(M,M), b(M,M), c(M,M))
  a = 1.0d0
  b = 2.0d0
  c = 3.0d0

  ! Benchmark timing with
  call cpu_time(ts)
  do i = 1, N
    call sum_type(a2, b2, c2)
  end do
  call cpu_time(tf)
  write(*,*) "Type : ", tf-ts

  call cpu_time(ts)
  do i = 1, N
    call sum_real(a, b, c)
  end do
  call cpu_time(tf)
  write(*,*) "Real : ", tf-ts

end program test

To my surprise, the operation with my derived datatype consistently underperformed the operation with the Fortran arrays by a factor of 2 with gfortran and a factor of 10 with ifort. For instance, using the CHECK_SIZE mode, which saves allocation time, I got the following timings compiling with the -O2 flag:

gfortran

  • Data type: 33 s
  • Real : 13 s

ifort

  • Data type: 30 s
  • Real : 3 s

Question

Is this normal behaviour? If so, are there any recommendations to achieve better performance?

Context

To provide some context, the type with a single array will be very useful for a code refactoring task, where we need to keep similar interfaces to a previous type.

Compiler versions

  • gfortran 9.4.0
  • ifort 2021.6.0 20220226

1 Answers1

3

You are worried about allocation time, but you do a lot of allocations of arrays of shape [M,M] for the derived type, and almost none for the intrinsic type.

The only allocations for the intrinsic type are in the main program, for a, b and c. These are outside the timing loop.

For the derived type, you allocate for a2%a, b2%a and c2%a (again outside the timing loop), but also res%a in the function sum, N times inside the timing loop.

Equally, inside the sum_real subroutine the assignment statement c=a+b involves no allocatable object but inside sum_type the c in c=a+b is an allocatable array: the compiler checks whether c is allocated and if so, whether its shape matches the right-hand side expression.

In summary: you are not comparing like with like. There's a lot of overhead in wrapping an intrinsic array as an allocatable component of a derived type.


Tangential to your timing concerns is the "cleverness" of the subroutine assign. It's horrible.

Calling an argument lhs when it's associated with the right-hand side of the assignment statement is a little confusing, but the select case construct is confusing beyond a little.

In

case (ASSUMED_SIZE)
  this % a = lhs % a

under rules where the rest of the program makes any sense, invokes a couple of checks:

  • is this%a allocated? If not, allocate it to the shape of lhs%a.
  • if it is allocated, check whether the shape matches lhs%a, if not deallocate it then allocate it to the shape of lhs%a.

Those checks and actions which are done manually in the CHECK_SIZE case, in other words.

The final subroutine does nothing of value, so the entire assign subroutine's execution can be replaced by this%a = lhs%a.

(Things would be different if the final subroutine had substantive effect or the compiler had been asked to ignore the rules of intrinsic assignment using -fno-realloc-arrays or -nostandard-realloc-lhs for example, or this%a(:,:)=lhs%a had been used.)

francescalus
  • 30,576
  • 16
  • 61
  • 96
  • I appreciate your answer. Maybe I was not completely clear, and I fully understand that the select case is a little bit confusing: I left it there so that I could test some variations. Also thank you for the comments on the variable names, it really went under the radar. – António Carneiro Dec 28 '22 at 09:06
  • The whole idea of the code above is to compare the time required to perform the operation `c=a+b` using native arrays and derived data types (in truth, I am really interested in the operation `a=a+b` which I assumed would fall in the same context). In the updated version of the code, I think the number of explicit allocations is identical in both cases. However, the timings are exactly the same: the derived data type is slower. Maybe I am missing something, but does this change you position in any way? – António Carneiro Dec 28 '22 at 09:10
  • 1
    The `sum_type` still allocates the (large) result array every time it is called. There's just no way round having to allocate the allocatable component of the function result. (A clever compiler may be able to optimize things away but will have to account for how you may want to also do `c=a+c`. With intrinsic types and operations, compilers know lots of "tricks" and can do `a=a+1`, for `a` an array, in place, but this is not safe to do for arbitrary defined operation.) – francescalus Dec 28 '22 at 09:49
  • You can perhaps try making `sum_type` elemental. – francescalus Dec 28 '22 at 09:50
  • Ok, now I really see your point. The issue is with the allocation of the result variable in the `sum_type` function. I replaced the `c=a+b` in the `sum_real` subroutine with `c=sum_real_func(a,b)`, where `sum_real_func` just does `c=a+b` and now I get identical run times in both cases. I see that this will be very difficult to circumvent. Unfortunately making `sum_type` elemental did not improve noticeably. – António Carneiro Dec 28 '22 at 10:37
  • You can see additional detail in another [closely related question](https://stackoverflow.com/q/28241473/3157076). – francescalus Dec 28 '22 at 10:59
  • Thank you for the link. I think this is all clear now. I will mark this as answered! – António Carneiro Dec 29 '22 at 12:00