0

In doing some performance testing in Python, I compared the timing for different methods to calculate the Euclidean distance between an array of coordinates. I found my Fortran code compiled with F2PY to be roughly 4x slower than the C implementation used by SciPy. Comparing that C code, to my Fortran code I see no fundamental difference that would lead to the factor of 4 difference. Here is my code (with some comments explaining its use):

        subroutine distance(coor,dist,n)
        double precision coor(n,3),dist(n,n)
        integer n,i,j
        double precision xij,yij,zij

cf2py   intent(in):: coor,n
cf2py   intent(in,out):: dist
cf2py   intent(hide):: xij,yij,zij,

       do 200,i=1,n-1
           do 300,j=i+1,n
               xij=coor(i,1)-coor(j,1)
               yij=coor(i,2)-coor(j,2)
               zij=coor(i,3)-coor(j,3)

               dist(i,j)=dsqrt(xij*xij+yij*yij+zij*zij)

  300   continue
  200   continue

        end

c         1         2         3         4         5         6         7
c123456789012345678901234567890123456789012345678901234567890123456789012
c
c     to setup and incorporate into python (requires numpy):
c
c     # python setup_distance.py build
c     # cp build/lib*/distance.so ./
c
c     to call this from python add the following lines:
c
c     >>> import sys ; sys.path.append('./')
c     >>> from distance import distance
c
c     >>> dist = distance(coor, dist)

Looking at the compile command run by F2PY, I recognized there is no avx compile flag. I tried adding it in the Python setup file using extra_compile_args=['-mavx]` but this had no change to the compile command run by F2PY:

compiling Fortran sources
Fortran f77 compiler: /usr/bin/gfortran -Wall -g -ffixed-form -fno-second-underscore -fPIC -O3 -funroll-loops
Fortran f90 compiler: /usr/bin/gfortran -Wall -g -fno-second-underscore -fPIC -O3 -funroll-loops
Fortran fix compiler: /usr/bin/gfortran -Wall -g -ffixed-form -fno-second-underscore -Wall -g -fno-second-underscore -fPIC -O3 -funroll-loops
compile options: '-I/home/user/anaconda/lib/python2.7/site-packages/numpy/core/include -Ibuild/src.linux-x86_64-2.7 -I/home/user/anaconda/lib/python2.7/site-packages/numpy/core/include -I/home/user/anaconda/include/python2.7 -c'
gfortran:f77: ./distance.f
creating build/lib.linux-x86_64-2.7
/usr/bin/gfortran -Wall -g -Wall -g -shared build/temp.linux-x86_64-2.7/build/src.linux-x86_64-2.7/distancemodule.o build/temp.linux-x86_64-2.7/build/src.linux-x86_64-2.7/fortranobject.o build/temp.linux-x86_64-2.7/distance.o -L/home/user/anaconda/lib -lpython2.7 -lgfortran -o build/lib.linux-x86_64-2.7/distance.so
Steven C. Howell
  • 16,902
  • 15
  • 72
  • 97
  • 1
    By default, numpy arrays are stored in C order, so when you call `distance(coor, dist)`, a Fortran-ordered copy of the arguments is made and passed to the actual Fortran routine. I don't know if that would account for the 4x difference, but you could run some timing experiments with Fortran-ordered numpy arrays to find out. – Warren Weckesser Oct 28 '16 at 21:34
  • ^Good one^ @WarrenWeckesser If the fortran code can be compiled on its own, then some compilers will show where it vectorises or not. Assuming you need to do this often, then it may be possible to use a library and link in the library which allows for reuse... – Holmz Oct 29 '16 at 04:26

2 Answers2

1

To answer how to add the avx flag into compiler options.
In your case the f77 complier is being picked gfortran:f77: ./distance.f < That is the key line.
You could try specifying --f77flags=-mavx

Ash Sharma
  • 470
  • 3
  • 18
  • Note that I am using F2PY, so how to set this flag is not obvious. Before I had `extra_compile_args=['-mavx']` which did not work. I just tried changing this to `extra_compile_args=['--f77flags=-mavx']` which gave the following error: `gcc: error: unrecognized command line option ‘--f77flags=-mavx’`. – Steven C. Howell Feb 07 '17 at 12:45
  • I originally used a numpy distutils based `setup.py` file for compilation. I did not figure out how to include the `-mavx` flag. Reading the [F2PY docs](https://docs.scipy.org/doc/numpy/f2py/usage.html) I saw I could compile from the command line using `f2py -c .f --f77flags='-mavx'`. I am not certain this flag is working as the resulting code does not run faster than w/o it but I think this a step in the right direction. – Steven C. Howell Feb 07 '17 at 13:04
  • You can check if the option is being considered in the line that reads: `Fortran f77 compiler:`. Also `f2py -c .f --verbose` option will give you a detailed output. – Ash Sharma Feb 08 '17 at 04:09
  • If you're always going to use the `-mavx` option, you can edit the `numpy/distutils/fcompiler/gnu.py` at line numbers 87 and 260, where the compiler arguments are being set (Look for `compiler_f77` string in case you have a different version and the line numbers don't match). Editing these files may not be recommended, but you can be sure the option is being considered. – Ash Sharma Feb 08 '17 at 04:17
0

In the comments Warren Weckesser explained that Fortran arrays are stored transposed with regards to the C ones. However, one important implication was not mentioned. To traverse the array in the right order, you have to switch the order of the loops. In C the first index is the outer loop, in Fortran the first index should be the inner loop. You are indexing as dist(i,j) so your loops are in wrong order.

But because your j loop depends on the i loop value, you may have to switch the role of the indexes in your array (transpose it).

Some compilers are able to correct some simple loop ordering for you with high enough level of optimizations.

BTW the -funroll-loops is known to be often too aggressive and actually detrimental. Some limits usually should be set.