Multi-threaded linear system solution in OpenBLAS

Question

I have a code using Fortran 95 and the gfortran compiler. I am also using OpenMP and I have to handle very big arrays. In my code I also have to solve a system of linear equations using the solver DGTSV from OpenBLAS. I want to parallelize this solver as well using openblas which should be capable of that. But I have trouble with the syntax. Using the attached pseudo code all 4 CPUs are used to almost 100% but I am not sure if each kernel solves the linear equations separately or if they split it into parts and calculating it parallel. The whole stuff is compiled using gfortran -fopenmp -lblas a.f95 -o a.out

So my pseudo code looks like

program a
implicit none
integer, parameter  ::  N   =       200
real*8, dimension(numx) ::  D   =       0.0
real*8, dimension(numx-1):: DL  =       0.0
real*8, dimension(numx-1):: DU  =       0.0
real*8, dimension(numx) ::  b   =       0.0
integer         ::  info    =       0
integer :: numthread=4
...
!$OMP PARALLEL NUM_THREADS(numthread)
...
!$OMP DO
...
!$OMP END DO
CALL DGTSV(N,1,DL,D,DU,b,N,info)
!$OMP DO
...
!$OMP END DO
...
!$OMP END PARALLEL
end program a

What does I have to do to make the solver parallelized, so each kernel calculates parts of the solver?

Each thread would solve the whole system of linear equations, repeating the work of all other threads, even if OpenBLAS was compiled with OpenMP support. You need to call `DGTSV` outside the parallel region and it will create its own parallel region inside the subroutine implementation. — Hristo Iliev, Jul 24 '12 at 18:36
You can also put the call to `DGTSV` inside an OpenMP `SINGLE` construct and enable nested parallelism. This will save you a tiny fraction of time. — Hristo Iliev, Jul 25 '12 at 08:36
yes, thank you! I know this option. I will find out how big the time save is. — user1548953, Jul 26 '12 at 09:32

Jorge Bellon · Answer 1 · 2016-10-06T15:57:30.047

Inside an OpenMP parallel region, all the threads execute the same code (as in MPI), and the work is only split when the threads reach a loop/section/task.

In your example, the work inside the loops (OMP DO) is distributed among the available threads. After the loop is done, an implicit barrier synchronizes all the threads and then they execute in parallel the function DGTSV. After the subroutine has returned, the loop is split again.

@HristoIliev proposed using a OMP SINGLE clause. This restricts the piece of code inside to be executed by only one thread and forces all the other threads to wait for it (unless you specify nowait).

On the other hand, nested parallelism is called to the case where you declare a parallel region inside another parallel region. This also applies when you perform calls to a OpenMP parallelized library inside a parallel region.

By default, OpenMP does not increase parallelism nested parallel regions, instead, only the thread that enter the parallel region is able to execute it. This behavior can be changed using the environment variable OMP_NESTED to true.

The OMP SINGLE solution is far better than splitting the parallel region in two, as the resources are reused for the next loop:

$!OMP PARALLEL
  $!OMP DO
  DO ...
  END DO

  $!OMP SINGLE
  CALL DGTSV(...)

  $!OMP DO
  DO ...
  END DO
$!OMP END PARALLEL

To illustrate the usage of OMP_NESTED I'll show you some results I had from an application which used FFTW (a Fast Fourier Transform implementation) configured to use OpenMP. The execution was performed in a 16 core two-socket Intel Xeon E5 @2.46GHz node.

The following graphs show the amount of time spent in the whole application, where parallel regions appear when CPUs > 1, serialized regions when CPUs = 1 and synchronization regions when CPUs = 0.

The application is embarrassingly parallel, so in this particular case using nesting is not worthwhile (FFTW does not scale that good).

This is the OMP_NESTED=false execution. Observe how the amount of parallelism is limited by the amount of threads spent in the external parallel region (ftdock).

This is the OMP_NESTED=true execution. In this case, it is possible to increase parallelism further than the amount of threads spent on the external parallel region. The maximum parallelism possible in this case is 16, when either the 8 external threads create a single peer to execute the internal parallel region or they are 4 creating 3 additional threads each (8x2 = 4x4 = 16).

What is the conclusion? Maybe it's just me, but I have no idea how to interpret that graph. The problam itself (with FFTW) is very relevant for me. — Vladimir F Героям слава, Oct 06 '16 at 16:00
The conclusion is: it depends, as usually. The graph shows how many threads the program can run in parallel (best if you have a single bar over 16, which is the maximum core number). In my particular example the best strategy was just to throw the 16 threads on the external parallel region and use a serialized version of the library. In OPs case, if the BLAS routine must be serialized then using nesting parallelism is a good starting point, although all those synchronization barriers will limit the scalability at some point. — Jorge Bellon, Oct 06 '16 at 16:18

Multi-threaded linear system solution in OpenBLAS

1 Answers1