No speedup with OpenMP when using Matlab MEX in Linux

Question

I'm using OpenMP to speed up Fortran code in a Matlab MEX-file. However, I find that OpenMP seems not work on Linux, but actually works on Windows. I attach the code as follows:

1) Matlab Mex file:

clc; clear all; close all;   tic

FLAG_SYS = 0; % 0 for Windows; 1 for Linux

%--------------------------------------------------------------------------
% Mex Fortran code 
%--------------------------------------------------------------------------
if FLAG_SYS == 0
    mex COMPFLAGS="-Qopenmp $COMPFLAGS"...
        LINKFLAGS="/Qopenmp $LINKFLAGS"...   
        OPTIMFLAGS="/Qopenmp $OPTIMFLAGS"...
        '-IC:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.5.267\windows\mkl\include'...
        '-LC:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.5.267\windows\mkl\lib\intel64'...
        -lmkl_intel_ilp64.lib -lmkl_intel_thread.lib -lmkl_core.lib libiomp5md.lib...
        Test_OpenMP_Mex.f90...
        -output Test_OpenMP_Mex  

elseif FLAG_SYS == 1        
    mex COMPFLAGS="-fopenmp $COMPFLAGS"...
        LINKFLAGS="-fopenmp $LINKFLAGS"...  
        FFLAGS='$FFLAGS -fdec-math -cpp' ...
        '-I${MKLROOT}/include'...
        '-L${MKLROOT}/lib'...
        -lmkl_avx2 -lmkl_gf_ilp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread -lm -ldl...
        Test_OpenMP_Mex.f90...
        -output Test_OpenMP_Mex           
end

Test_OpenMP_Mex;

2) Fortran code

#include "fintrf.h"

     !GATEWAY ROUTINE
      SUBROUTINE MEXFUNCTION(NLHS, PLHS, NRHS, PRHS)

     !DECLARATIONS
      IMPLICIT NONE

     !MEXFUNCTION ARGUMENTS:
      MWPOINTER PLHS(*), PRHS(*)
      INTEGER NLHS, NRHS

     !FUNCTION DECLARATIONS:
      MWPOINTER MXCREATEDOUBLEMATRIX

      MWPOINTER MXGETM, MXGETN
      INTEGER MXISNUMERIC 

     !POINTERS TO INPUT MXARRAYS:
      MWPOINTER MIV1, MIV2

     !POINTERS TO OUTPUT MXARRAYS:
      MWPOINTER MOV1, MOV2

     !CALL FORTRAN CODE
     CALL  TEST_OPENMP


      RETURN

      END

!-----------------------------------------------------------------------
    SUBROUTINE TEST_OPENMP

        USE OMP_LIB

        IMPLICIT NONE

        INTEGER I, J, K, STEP
        REAL*8  STARTTIME, ENDTIME,Y


        OPEN(1,FILE='1.TXT') 

        !COUNT ELAPSED TIME START
        STARTTIME = OMP_GET_WTIME() 

        DO I = 1,1000000
            DO J = 1,50000
                DO K = 1,1000
                    Y=(I+10)*J-SQRT(789.1)+SQRT(789.1)-(I+10)*J
                END DO
            END DO
        END DO     


        ENDTIME = OMP_GET_WTIME()
        WRITE(1,*) ENDTIME-STARTTIME

        !COUNT ELAPSED TIME START
        STARTTIME = OMP_GET_WTIME() 

!$OMP PARALLEL
!$OMP DO PRIVATE(I,J)
        DO I = 1,1000000
            DO J = 1,50000
                DO K = 1,1000
                    Y=(I+10)*J-SQRT(789.1)+SQRT(789.1)-(I+10)*J
                END DO
            END DO
        END DO     
!$OMP END DO  
!$OMP END PARALLEL        

        ENDTIME = OMP_GET_WTIME()
        WRITE(1,*) ENDTIME-STARTTIME 

!$OMP PARALLEL        
        ! GET THE NUMBER OF THREADS
        WRITE(1,*) OMP_GET_THREAD_NUM(), OMP_GET_NUM_THREADS() 
!$OMP END PARALLEL         
        CLOSE(1)

        RETURN

      END SUBROUTINE TEST_OPENMP

The output on Windows is:

   1.09620520001044     
   4.50355500000296     
   0           6
   1           6
   3           6
   5           6
   2           6
   4           6

and the output on Linux is:

   0.0000   
   0.0000    
   0           1

It's obvious that OpenMP works on Windows, since the calculation time reduces from 4.5s to 1.0s. I can find that there are 6 threads being used for calculation. However, on Linux, no calculation seems to be executed, and there are only 2 threads (the number of threads on Linux is 36, but only 2 of them are used).

Any suggestions are welcome!

You can directly download code from this link: https://www.dropbox.com/sh/crkuwhu22407sjs/AAAQrtzAvTmFOmAxv_jpTCBaa?dl=0

If you are worried about the number of threads you are getting it's simpler just to Write( *, * ) omp_get_thread_num(), omp_get_num_threads() just after you enter the parallel region — Ian Bush, May 27 '20 at 07:36
@IanBush I tried what you suggest, and it outputs 0 and 1, only once! I think it means that in the parallel region, there is only one thread, named 0. So, it backs to the question, is the `!$OMP PARALLEL` not indentied by MEX? and how to fix it. — kun zhao, May 27 '20 at 08:20
@VladimirF This code is one of the subroutines that prepares variables for Pardiso in MKL. The code can be run correctly, but without speedup. I've added the code in the question — kun zhao, May 27 '20 at 08:20
But the subroutine you call in the parallel region is needed, INITIAL_BANK_STRESS. — Vladimir F Героям слава, May 27 '20 at 08:25
@CrisLuengo I changed the question, which now is much simplified — kun zhao, May 27 '20 at 10:43
@IanBush I changed the question, which now is much simplified — kun zhao, May 27 '20 at 10:43
So, how do you set the number of threads on Linux? What is your value of the environment variables like `OMP_NUM_THREADS`? Wgich paet of the code show 36 threads? — Vladimir F Героям слава, May 27 '20 at 10:44
Sorry for the spamming@. I know that through `omp_get_max_threads`, and it also corresponses to the total threads of one core on Lunix HPC. Another thing is that I cannot use 'CALL OMP_SET_NUM_THREADS()' on Lunix, and it remainders `undefined symbol: omp_set_num_threads_8_`, but I can use the function to set thread number on Windows. — kun zhao, May 27 '20 at 12:33
Your execution time goes from 1s for the sequential part to 4.5s for the parallel part. The parallel part writes to a shared `K` and `Y`, which is likely why it’s slower. To see a speedup you need to do a parallel for, which you don’t have. Each thread runs through the same calculations. — Cris Luengo, May 27 '20 at 14:02

Cris Luengo · Accepted Answer · 2020-05-27T15:43:11.510

When compiling MEX-files under Linux (and MacOS) the COMPFLAGS variable is ignored. It is a Windows-specific environment variable. You need to use CFLAGS for C, CXXFLAGS for C++, or FFLAGS for Fortran, and LDFLAGS for the linker. These are the standard Unix environment variables to control compilation.

Your compile command will look like this:

mex LDFLAGS='-fopenmp $LDFLAGS'...
    FFLAGS='-fopenmp -fdec-math -cpp $FFLAGS' ...
    '-I${MKLROOT}/include'...
    '-L${MKLROOT}/lib'...
    -lmkl_avx2 -lmkl_gf_ilp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread -lm -ldl...
    Test_OpenMP_Mex.f90...
    -output Test_OpenMP_Mex

Reference:

Or, in other words, "You're not compiling with OpenMP enabled, so it's not a surprise that it has no effect!" — Jim Cownie, May 27 '20 at 15:37

score 1 · Answer 2 · answered May 28 '20 at 08:49

1

There is one note you shouldn't miss when lining against intel mkl ilp64 versions of libs: you need to add -I4 compiler option, otherwise, you may see some kind of an unexpected segfault... Please refer to the mkl linker adviser to see more details: https://software.intel.com/content/www/us/en/develop/articles/intel-mkl-link-line-advisor.html

answered May 28 '20 at 08:49

Gennady.F

571
2
7

That is not strictly true. It is enough to pass 64-bit integers in the arguments. The compiler option is not necessary and may well be incompatible with other libraries being used. And it does not answer the question either. But the OP probably wants LP64. – Vladimir F Героям слава May 28 '20 at 10:23

No speedup with OpenMP when using Matlab MEX in Linux

2 Answers2