How does NumPy use ufunc buffers to speed up sum calculation?

Question

NumPy uses buffers for ufuncs:

Internally, buffers are used for misaligned data, swapped data, and data that has to be converted from one data type to another. The size of internal buffers is settable on a per-thread basis.

Apparently this buffer size parameter can have a significant effect on speed, e.g.,

In [1]: x = np.ones(int(1e7))

In [2]: timeit x.sum()
100 loops, best of 3: 4.88 ms per loop

In [3]: np.getbufsize()
Out[3]: 8192

In [4]: np.setbufsize(64)
Out[4]: 8192

In [5]: timeit x.sum()
10 loops, best of 3: 18 ms per loop

Furthermore, if I, for example, write a Fortran function:

function compute_sum(n, x) result(s)
    !f2py integer, intent(hide), depend(x) :: n = shape(x,0)
    !f2py double precision, intent(in)     :: x(n)
    !f2py double precision, intent(out)    :: s

    integer n
    double precision x(n)
    double precision s

    do i=1,n
        s = s + x(i)
    end do
end function

and compile with f2py:

f2py -c mysum.f90 -m mysum

then the Fortran code is twice as slow as the NumPy sum:

In [1]: import mysum

In [2]: x = np.ones(int(1e7))

In [3]: timeit mysum.compute_sum(x)
100 loops, best of 3: 10.2 ms per loop

I imagine there's a little overhead (although, the array is not copied in) in passing back and forth between the Fortran code, but certainly not 5 milliseconds worth!

Two questions:

What exactly does the NumPy documentation mean by "buffers are used for misaligned data, etc ..." and how do this affect how sum is calculated?
What would the analogous (i.e., running nearly as fast the NumPy sum) Fortran code look like for a given buffer size?

@user2357112 Actually, I guess he is talking about the original sum without setting the buffer size which IS actually 5ms faster than the Fortran version. — jotasi, Oct 16 '17 at 08:11
When the data is not within the bounds or cannot be accessed by some integer multiple of the strides then it is misaligned and loaded to buffers to regularize it. — percusse, Oct 16 '17 at 08:19
@user2357112 I'm referring to the initial ~5ms NumPy sum with the default buffer size. — Matt Hancock, Oct 16 '17 at 10:29
@percusse can you expand on that? Why would the data be "not within bounds" here? bounds of what? Why would the elements not be accessible by integer multiple strides in this example? What does "loaded to buffers" mean? Are these auxiliary arrays? Why does this increase speed? Apologies if these are naive questions :) — Matt Hancock, Oct 16 '17 at 10:30
@ChesterVonWinchester No not at all. I don't have all the details unfortunately but suppose your array is not laid out in the memory as a single continuous block. Then what you need to do is to gather the data and then perform vectorized operations. However if you are exactly sure that it is a aligned block you can simply say I want the 43th item and the dtype is ... then if I multiply the basic element memory unit multiplied by 43 it is going to be there. — percusse, Oct 16 '17 at 10:57
My guess would be the compiler isn't vectorizing your sum. The [pairwise summation routine](https://github.com/numpy/numpy/blob/v1.13.3/numpy/core/src/umath/loops.c.src#L1614) NumPy calls for a floating-point sum is unrolled to encourage compiler vectorization. Try unrolling your loop. (Come to think of it, I think the block handling may be unintentionally limiting the numerical stability benefits of the pairwise summation.) — user2357112, Oct 16 '17 at 16:50
(Note that while there's a block-based, vectorized base case in the pairwise sum code, that's a separate layer of block handling from the general NumPy ufunc block logic. The general block handling happens in an outer loop around this code.) — user2357112, Oct 16 '17 at 16:52
Cool. I didn't realize numpy used this algorithm. `f2py` calls `gfortran` with the flag `-funroll-loops` by default, so I'm not sure this is the source of the speed decrease (if that's what you're suggesting). — Matt Hancock, Oct 16 '17 at 17:41

How does NumPy use ufunc buffers to speed up sum calculation?

0 Answers0