How can numpy be so much faster than my Fortran routine?

Question

I get a 512^3 array representing a Temperature distribution from a simulation (written in Fortran). The array is stored in a binary file that's about 1/2G in size. I need to know the minimum, maximum and mean of this array and as I will soon need to understand Fortran code anyway, I decided to give it a go and came up with the following very easy routine.

  integer gridsize,unit,j
  real mini,maxi
  double precision mean

  gridsize=512
  unit=40
  open(unit=unit,file='T.out',status='old',access='stream',&
       form='unformatted',action='read')
  read(unit=unit) tmp
  mini=tmp
  maxi=tmp
  mean=tmp
  do j=2,gridsize**3
      read(unit=unit) tmp
      if(tmp>maxi)then
          maxi=tmp
      elseif(tmp<mini)then
          mini=tmp
      end if
      mean=mean+tmp
  end do
  mean=mean/gridsize**3
  close(unit=unit)

This takes about 25 seconds per file on the machine I use. That struck me as being rather long and so I went ahead and did the following in Python:

    import numpy

    mmap=numpy.memmap('T.out',dtype='float32',mode='r',offset=4,\
                                  shape=(512,512,512),order='F')
    mini=numpy.amin(mmap)
    maxi=numpy.amax(mmap)
    mean=numpy.mean(mmap)

Now, I expected this to be faster of course, but I was really blown away. It takes less than a second under identical conditions. The mean deviates from the one my Fortran routine finds (which I also ran with 128-bit floats, so I somehow trust it more) but only on the 7th significant digit or so.

How can numpy be so fast? I mean you have to look at every entry of an array to find these values, right? Am I doing something very stupid in my Fortran routine for it to take so much longer?

EDIT:

To answer the questions in the comments:

Yes, also I ran the Fortran routine with 32-bit and 64-bit floats but it had no impact on performance.
I used iso_fortran_env which provides 128-bit floats.
Using 32-bit floats my mean is off quite a bit though, so precision is really an issue.
I ran both routines on different files in different order, so the caching should have been fair in the comparison I guess ?
I actually tried open MP, but to read from the file at different positions at the same time. Having read your comments and answers this sounds really stupid now and it made the routine take a lot longer as well. I might give it a try on the array operations but maybe that won't even be necessary.
The files are actually 1/2G in size, that was a typo, Thanks.
I will try the array implementation now.

EDIT 2:

I implemented what @Alexander Vogt and @casey suggested in their answers, and it is as fast as numpy but now I have a precision problem as @Luaan pointed out I might get. Using a 32-bit float array the mean computed by sum is 20% off. Doing

...
real,allocatable :: tmp (:,:,:)
double precision,allocatable :: tmp2(:,:,:)
...
tmp2=tmp
mean=sum(tmp2)/size(tmp)
...

Solves the issue but increases computing time (not by very much, but noticeably). Is there a better way to get around this issue? I couldn't find a way to read singles from the file directly to doubles. And how does numpy avoid this?

Thanks for all the help so far.

Did you try the Fortran routine without 128-bit floats? I'm not aware of any hardware that actually supports those, so they'd have to be done in software. — user2357112, Nov 15 '15 at 19:33
What if you try the Fortran version using an array (and in particular using one read rather than a billion)? — francescalus, Nov 15 '15 at 19:57
Did you consider using array operators in Fortran as well? Then, you could try `minval()`, `maxval()`, and `sum()`? Furthermore, you are mixing IO with the operations in Fortran, but not in Python - that is not a fair comparison ;-) — Alexander Vogt, Nov 15 '15 at 20:00
When benchmarking something involving a large file, be sure it's cached the same for all runs. — Tom Zych, Nov 15 '15 at 22:42
Because of the complete parallizability of your problem I would propose the use of OpenMP (wraps available for many languages including Fortran (90)). It has the function `MP_Reduce` that has a couple of functions already build in, for example `MPI_MIN`, `MPI_MAX`, and `MPI_MEAN`. Simply using it recursively should do the trick. See e.g.: https://doc.itc.rwth-aachen.de/download/attachments/3474275/OpenMP_and_MPI_for_Dummies_fortran.pdf?version=1&modificationDate=1391432801000&api=v2 for a start.Because doing number crunching linearly is, well... — deamentiaemundi, Nov 16 '15 at 01:28
Also note that precision is a rather big deal in Fortran, and it comes at a cost. Even after you fix all those obvious issues with your Fortran code, it might very well be that the extra precision is required and will cause a significant speed loss. — Luaan, Nov 16 '15 at 08:56
If the file is roughly 1 GB I would think it contains 64 bits floats? I.e. `1024**3 == 512**3 * 8` — , Nov 16 '15 at 09:20
@deamentiaemundi You seem to be mixing up MPI with OpenMP. They are 2 completely different things. BTW, a question on the `fortran` tag with so many upvotes is seen very rarely (envying). — Vladimir F Героям слава, Nov 16 '15 at 15:38
@VladimirF you use both, normally, OpenMP and OpenMPI. It makes a lot of things simpler. An very colorful (sorry for that, not mine) example for pi in : https://doc.itc.rwth-aachen.de/download/attachments/8750213/05-OpenMP_C_Quicky.pdf?version=1&modificationDate=1397737196000&api=v2 And Fortran seems to have its fans, still! ;-) — deamentiaemundi, Nov 16 '15 at 17:11
@deamentiaemundi Yes, you can use both, but they are separate and very, very different. Don't mix them that way. And it is just MPI. OpenMPI is just one particular implementation of MPI (others are MPICH, MVAPICH, Intel MPI, MS MPI,...). P.S. I just envy the guys their fake internet points ;) — Vladimir F Героям слава, Nov 16 '15 at 17:16
You can tell numpy to use double precision accumulator when computing the mean: `mean=numpy.mean(mmap, dtype=np.float64)` — credondo, Nov 16 '15 at 17:28
@VladimirF they mix *very* well ;-) (BTW: If I give an example for an implementation I always point to an open one if one is available, that's the single reason for the naming of OpenMP(I)). But serious: if you do number crunching you have clusters of independant processors with their own memory. Each processors has multiple cores nowadays, which can share their memory directly. So you use OpenMP for the CPUs and MPI for communication in the cluster and in reality it's way more complex. — deamentiaemundi, Nov 16 '15 at 17:42

Alexander Vogt · Accepted Answer · 2015-11-16T18:25:39.997

Your Fortran implementation suffers two major shortcomings:

You mix IO and computations (and read from the file entry by entry).
You don't use vector/matrix operations.

This implementation does perform the same operation as yours and is faster by a factor of 20 on my machine:

program test
  integer gridsize,unit
  real mini,maxi,mean
  real, allocatable :: tmp (:,:,:)

  gridsize=512
  unit=40

  allocate( tmp(gridsize, gridsize, gridsize))

  open(unit=unit,file='T.out',status='old',access='stream',&
       form='unformatted',action='read')
  read(unit=unit) tmp

  close(unit=unit)

  mini = minval(tmp)
  maxi = maxval(tmp)
  mean = sum(tmp)/gridsize**3
  print *, mini, maxi, mean

end program

The idea is to read in the whole file into one array tmp in one go. Then, I can use the functions MAXVAL, MINVAL, and SUM on the array directly.

For the accuracy issue: Simply using double precision values and doing the conversion on the fly as

mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))

only marginally increases the calculation time. I tried performing the operation element-wise and in slices, but that did only increase the required time at the default optimization level.

At -O3, the element-wise addition performs ~3 % better than the array operation. The difference between double and single precision operations is less than 2% on my machine - on average (the individual runs deviate by far more).

Here is a very fast implementation using LAPACK:

program test
  integer gridsize,unit, i, j
  real mini,maxi
  integer  :: t1, t2, rate
  real, allocatable :: tmp (:,:,:)
  real, allocatable :: work(:)
!  double precision :: mean
  real :: mean
  real :: slange

  call system_clock(count_rate=rate)
  call system_clock(t1)
  gridsize=512
  unit=40

  allocate( tmp(gridsize, gridsize, gridsize), work(gridsize))

  open(unit=unit,file='T.out',status='old',access='stream',&
       form='unformatted',action='read')
  read(unit=unit) tmp

  close(unit=unit)

  mini = minval(tmp)
  maxi = maxval(tmp)

!  mean = sum(tmp)/gridsize**3
!  mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))
  mean = 0.d0
  do j=1,gridsize
    do i=1,gridsize
      mean = mean + slange('1', gridsize, 1, tmp(:,i,j),gridsize, work)
    enddo !i
  enddo !j
  mean = mean / gridsize**3

  print *, mini, maxi, mean
  call system_clock(t2)
  print *,real(t2-t1)/real(rate)

end program

This uses the single precision matrix 1-norm SLANGE on matrix columns. The run-time is even faster than the approach using single precision array functions - and does not show the precision issue.

Why does mixing input with calculation slow it down so much? They both have to read the entire file, that will be the bottleneck. And if the OS does readahead, the Fortran code shouldn't have to wait much for I/O. — Barmar, Nov 17 '15 at 22:07
@Barmar You'll still have the function call overhead and logic for checking if the data is in cache every time. — Overv, Nov 18 '15 at 00:42

casey · Answer 2 · 2015-11-16T02:09:15.047

The numpy is faster because you wrote much more efficient code in python (and much of the numpy backend is written in optimized Fortran and C) and terribly inefficient code in Fortran.

Look at your python code. You load the entire array at once and then call functions that can operate on an array.

Look at your fortran code. You read one value at a time and do some branching logic with it.

The majority of your discrepancy is the fragmented IO you have written in Fortran.

You can write the Fortran just about the same way as you wrote the python and you'll find it runs much faster that way.

program test
  implicit none
  integer :: gridsize, unit
  real :: mini, maxi, mean
  real, allocatable :: array(:,:,:)

  gridsize=512
  allocate(array(gridsize,gridsize,gridsize))
  unit=40
  open(unit=unit, file='T.out', status='old', access='stream',&
       form='unformatted', action='read')
  read(unit) array    
  maxi = maxval(array)
  mini = minval(array)
  mean = sum(array)/size(array)
  close(unit)
end program test

Does the mean computed in this way obtain the same precision as `numpy`'s `.mean` call? I have some doubts about that. — Bakuriu, Nov 17 '15 at 08:40
@Bakuriu No, it doesn't. See Alexander Vogt's answer and my edits on the question. — user35915, Nov 17 '15 at 10:37

How can numpy be so much faster than my Fortran routine?

2 Answers2