fortran matrix product slows when called with f2py by python

Question

I've been trying to use f2py to interface an optimized fortran code for vector and matrix multiplication with python. To obtain a performance comparison useful for my purposes I perform the same product inside a cycle 100000 times. With a full fortran code the product takes 2.4 sec (ifort), while with f2py it takes approx 11 sec. Just for reference, with numpy it takes approx 20 sec. I ask both the fortran and the python part to write the time difference before and after the cycle and with f2py they both write 11 sec, so the code is not losing time in passing arrays. I triyed to understand if it is the way in which numpy array are stored, but I can't understand the problem. Do you have any idea? Thanks in advance

fortran Main

program Main
    implicit none
    save

    integer :: seed, i, j, k
    integer, parameter :: states =15
    integer, parameter :: tessere = 400
    real, dimension(tessere,states,states) :: matrix
    real, dimension(states) :: vector
    real :: start, finish
    real  :: prod(tessere)

    do i=1,tessere
       do j=1,states
          do k=1,states
              matrix(i,j,k) = i+j+k
          end do
       enddo
    end do
    do i=1,states
        vector(i) = i
    enddo
    call doubleSum(vector,vector,matrix,states,tessere,prod)

end program

fortran subroutine:

subroutine doubleSum(ket, bra, M , states, tessere,prod)
    integer :: its, j, k,t
    integer :: states
    integer :: tessere
    real, dimension(tessere,states,states) :: M
    real, dimension(states) :: ket
    real, dimension(states) :: bra
    real, dimension(tessere) :: prod
    real,dimension(tessere,states) :: ctmp

    call cpu_time(start)
    do t=1,100000
        ctmp=0.d0
        do k=1,states
             do j=1,states
                do its=1,tessere
                   ctmp(its,k)=ctmp(its,k)+ M(its,k,j)*ket(j)
                enddo
             enddo
        enddo
        do its=1,tessere
            prod(its)=dot_product(bra,ctmp(its,:))
        enddo
    enddo
    call cpu_time(finish)
    print '("Time = ",f6.3," seconds.")',finish-start
end subroutine

python script

import numpy as np
import time
import cicloS


M= np.random.rand(400,15,15)
ket=np.random.rand(15)

#M=np.asfortranarray(M)
#ket=np.asfortranarray(ket)

import time


start=time.time()  
prod=cicloS.doublesum(ket,ket,M)
end=time.time()
print(end-start)

.pyf file generated with f2py and edited

!    -*- f90 -*-
! Note: the context of this file is case sensitive.

python module cicloS 
    interface  
        subroutine doublesum(ket,bra,m,states,tessere,prod) 
            real dimension(states),intent(in) :: ket
            real dimension(states),depend(states),intent(in) :: bra
            real dimension(tessere,states,states),depend(states,states),intent(in) :: m
            integer, optional,check(len(ket)>=states),depend(ket) :: states=len(ket)
            integer, optional,check(shape(m,0)==tessere),depend(m) :: tessere=shape(m,0)
            real dimension(tessere),intent(out) :: prod
        end subroutine doublesum
    end interface
end python module cicloS

There seems to be an error in the Fortran subroutine you provided, shouldn't `integer, dimension(tessere) :: prod` be `real, dimension(tessere) :: prodotto` and all occurrences of `prod` in the subroutine replaced with `prodotto`? What kind of compilation flags do you use to compile the ifort version of your code, and are you using the same ones for f2py? (Using `gfortran -O3 sub.f90 main.f90` to compile, I get the same run time with plain Fortran as with the f2py version - around 1.3 seconds for both) — jbdv, Jan 30 '19 at 12:26
Errors in the fortran routine were typos in the question, which i corrected. Actually checking for the compilation flags I discovered that the standard way in my lab cluster when calling f2py is gfortran -O3, while if ifort module is loaded it becomes ifort -O1, so this was the reason of the difference in the performance. Do you know if it is possible to tell f2py which compiler and flags to use? — MartaR, Jan 31 '19 at 15:35
Thought that was it, I should rather have said 'typos'/'inconsistencies' than 'errors' :) Different compiler flags (and `ifort` vs. `gfortran`!) could definitely explain the timing differences you observe. You can specify (additional) compiler flags by adding e.g. `--opt='-O1'` to your f2py command, and search for `--help-fcompiler` for finding out how to specify which fortran compiler f2py should use. — jbdv, Jan 31 '19 at 16:13
f2py flags work perfectly! I'm sure they are there, but I wasn't able to find them online! Thank you very much for your help! — MartaR, Feb 01 '19 at 09:58
I just posted an answer summarizing everything (with links to the appropriate documentation), for future reference. — jbdv, Feb 01 '19 at 10:34

jbdv · Accepted Answer · 2019-02-01T17:10:01.630

The OP has indicated that the observed execution time difference, between standalone and F2PY compiled versions of the code, was due to different compilers and compiler flags being used.

In order to obtain consistent result, and thereby answer the question, it is necessary to ensure that F2PY uses the desired 1) compiler, and 2) compiler flags.

Part 1: Specify which Fortran compiler should be used by F2PY

A list of Fortran compilers available to F2PY on the target system can be displayed by executing e.g. python -m numpy.f2py -c --help-fcompiler. On my system, this produces (truncated):

Fortran compilers found:
  --fcompiler=gnu95    GNU Fortran 95 compiler (7)
  --fcompiler=intelem  Intel Fortran Compiler for 64-bit apps (19.0.1.144)

You can instruct F2PY which of the available Fortran compilers to use, by adding an appropriate --fcompiler flag to your compile command. For using ifort e.g. (assuming you have already created and edited a cicloS.pyf file):

python -m numpy.f2py --fcompiler=intelem -c cicloS.pyf sub.f90

Part 2: Specify additional compiler flags to be used by F2PY

Note that the output from --help-fcompiler in the previous step also displays the default compiler flags (see e.g. compiler_f90) that F2PY defines for each available compiler. Again on my system, this was (truncated and simplified to most relevant flags):

gnu95: -O3 -funroll-loops
intelem: -O3 -xSSE4.2 -axCORE-AVX2,COMMON-AVX512

You can the specify preferred optimisation flags for F2PY with the --opt flag in you compile command (see also --f90flags in the documentation), that now becomes e.g.:

python -m numpy.f2py --fcompiler=intelem --opt='-O1' -c cicloS.pyf sub.f90

Compare run time for standalone and F2PY versions:

Compiling a standalone executable with ifort -O1 sub.f90 main.f90 -o main, and the F2PY compiled version from Part 2, I get the following output:

./main
Time =  5.359 seconds.

python test.py
Time =  5.297 seconds.
5.316878795623779

Then, compiling a standalone executable with ifort -O3 sub.f90 main.f90 -o main, and the (default) F2PY compiled version from Part 1, I get these results:

./main
Time =  1.297 seconds.

python test.py
Time =  1.219 seconds.
1.209657907485962

Thus showing similar results for the standalone and F2PY versions, as well as the influence of compiler flags.

Comment on temporary arrays

Although not the cause of the slowdown you observe, do note that F2PY is forced to make temporary copies of the arrays M (and ket) in your Python example for two reasons:

the 3D array M that you pass to cicloS.doublesum() is a default NumPy array, with C ordering (row-major). Since Fortran uses column-major ordering, F2PY will make array copies. The commented out np.asfortranarray() should correct this part of the problem.
the next reason for array copies (also for ket) is that there is a mismatch between the real kinds on the Python (default 64bit, double precision float) and Fortran (real gives a default precision, likely 32bit float) sides of your example. So copies are again made to account for this.

You can get notification when array copies are made by adding a -DF2PY_REPORT_ON_ARRAY_COPY=1 flag (also in documentation) to your F2PY compile command. In your case, array copies can be avoided completely by changing the dtype of your M and ket matrices in Python (i.e. M=np.asfortranarray(M, dtype=np.float32)) and ket=np.asfortranarray(ket, dtype=np.float32)) , or alternatively by defining the real variables in your Fortran code with the appropriate kind (e.g. add use, intrinsic :: iso_fortran_env, only : real64 to your subroutine and main program and define reals with real(kind=real64).

fortran matrix product slows when called with f2py by python

1 Answers1

Part 1: Specify which Fortran compiler should be used by F2PY

Part 2: Specify additional compiler flags to be used by F2PY

Compare run time for standalone and F2PY versions:

Comment on temporary arrays