Threadprivate allocatable performance issues with OpenMP and Fortran

Question

I have a parrallel part of a code which uses a THREADPRIVATE ALLOCATABLE array of a derived type which, in turns, contains other ALLOCATABLE variables:

MODULE MYMOD
  TYPE OBJ
    REAL, DIMENSION(:), ALLOCATABLE :: foo1
    REAL, DIMENSION(:), ALLOCATABLE :: foo2
  END TYPE

  TYPE(OBJ), DIMENSION(:), ALLOCATABLE ::  priv

  TYPE(OBJ), DIMENSION(:), ALLOCATABLE ::  shared

  !$OMP THREADPRIVATE(priv)

END MODULE

The variable "priv" is used by each thread as buffer for heavy calculations and is then copied on a shared variable.

MODULE MOD2

  SUBROUTINE DOSTUFF()

    !$OMP PARALLEL PRIVATE(n,dim)

    CALL ALLOCATESTUFF(n,dim)
    CALL HEAVYSTUFF()
    CALL COPYSUFFONSHARED()

    !$OMP END PARALLEL

  END SUBROUTINE DOSTUFF

  SUBROUTINE ALLOCATESTUFF(n,dim)
  USE MYMOD, ONLY : priv

    ALLOCATE(priv(n))
    DO i=1,i
      ALLOCATE(priv(i)%foo1(dim))
      ALLOCATE(priv(i)%foo2(dim))
    ENDDO

  END SUBROUTINE ALLOCATESTUFF

  SUBROUTINE COPYSTUFFONSHARED()
  USE MYMOD
    ...
  END SUBROUTINE COPYSTUFFONSHARED

  SUBROUTINE HEAVYSTUFF()
  USE MYMOD, ONLY : priv
    ...
  END SUBROUTINE HEAVYSTUFF

END MODULE

I'm running this code on a machine with two CPUs, each one with 10 cores, and I'm experiencing a strong loss of performance when passing the limit of 10 threads: basically, the codes scales linearly up to 10 threads, and then the slope is strongly reduced after this barrier. I obtain a very similar behavior on a machine with 8 CPUs, each one with 4 cores but this time the loss is around 5/6 threads.

As order of magnitude "n" of priv is small (less than 10), whereas "dim" for each "foo" is of the order of some milions.

What I guess from this behavior is that there's a sort of bottleneck in accessing the memory because of the connection between the CPUs. The strange behavior is that if I mesure separately the time required for doing HEAVYSTUFF and COPYSTUFFONSHARED, it is HEAVYSTUFF that slowes down, whereas COPYSTUFFONSHARED has an "almost linear" speed-up.

The question is: am I assured that the memory in a THREADPRIVATE derived type will be actually allocated locally on the CPU to which the thread belongs? If so, what else can be the explanation of this behavior? Otherwise, how can I force data locality?

Thank you

You need an implementation of OpenMP which supports affinity. Prior to OpenMP 4, environment variables were implementation dependent. — tim18, Sep 07 '17 at 12:10
The `threadprivate` data sharing class only means that each thread gets a separate copy of the variable. It has nothing to do with the place where the memory gets allocated, though some memory allocators are more thread-aware than others. Most operating systems have a "first-touch" NUMA allocation policy, which means that physical memory is allocated preferentially from the NUMA node where the thread that first writes to some memory page executes. — Hristo Iliev, Sep 08 '17 at 16:02

Threadprivate allocatable performance issues with OpenMP and Fortran

0 Answers0