3

I have a large 2D numpy array. I would like to be able to efficiently run row-wise operations on subsets of the columns, without copying the data.

In what follows, a = np.arange(1000000).reshape(1000, 10000) and columns = np.arange(1, 1000, 2). For reference,

In [4]: %timeit a.sum(axis=1)
7.26 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The approaches I am aware of are:

  1. fancy indexing with list of columns
In [5]: %timeit a[:, columns].sum(axis=1)
42.5 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  1. fancy indexing with mask of columns
In [6]: cols_mask = np.zeros(10000, dtype=bool)
   ...: cols_mask[columns] = True                                                                                                                                                                                                                                                                                             

In [7]: %timeit a[:, cols_mask].sum(axis=1)
42.1 ms ± 302 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  1. masked array
In [8]: cells_mask = np.ones((1000, 10000), dtype=bool)

In [9]: cells_mask[:, columns] = False

In [10]: am = np.ma.masked_array(a, mask=cells_mask)

In [11]: %timeit am.sum(axis=1)
80 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  1. python loop
In [12]: %timeit sum([a[:, i] for i in columns])
31.2 ms ± 531 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Somewhat surprisingly to me, the last approach is the most efficient: moreover, it avoids copying the full data, which for me is a prerequisite. However, it is still much slower than the simple sum (on double the data size), and most importantly, it is not trivial to generalize to other operations (e.g., cumsum).

Is there any approach I am missing? I would be fine with writing some cython code, but I would like the approach to work for any numpy function, not just sum.

Pietro Battiston
  • 7,930
  • 3
  • 42
  • 45

4 Answers4

3

On this one pythran seems a bit faster than numba at least on my rig:

import numpy as np

#pythran export col_sum(float[:,:], int[:])
#pythran export col_sum(int[:,:], int[:])

def col_sum(data, idx):
    return data.T[idx].sum(0)

Compile with pythran <filename.py>

Timings:

timeit(lambda:cs_pythran.col_sum(a, columns),number=1000)
# 1.644187423051335
timeit(lambda:cs_numba.col_sum(a, columns),number=1000)
# 2.635075871949084
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
2

If you want to beat c-compiled block summation, you're probably best off with numba. Any indexing that stays in python (numba creates c-compiled functions with jit) is going to have python overhead.

from numba import jit

@jit
def col_sum(block, idx):
    return block[:, idx].sum(1)

%timeit a.sum(axis=1)
100 loops, best of 3: 5.25 ms per loop

%timeit a[:, columns].sum(axis=1)
100 loops, best of 3: 7.24 ms per loop

%timeit col_sum(a, columns)
100 loops, best of 3: 2.46 ms per loop
Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • Thanks for the reply... unfortunately, on my setup this is still slower than ``sum([a[:, i] for i in columns])`` (33 ms. vs. 30) - which is strange since in your setup it is way faster than the simple sum, which on my setup is way faster than anything else! Maybe you used a different ``idx``? – Pietro Battiston Jul 28 '19 at 07:06
  • I used the exact same `columns` from your problem specification – Daniel F Jul 29 '19 at 07:15
1

You can use Numba. For best performance it is usually necessary to write simple loops as you would do in C. (Numba basically a Python to LLVM-IR code translator, quite like Clang for C)

Code

import numpy as np
import numba as nb
@nb.njit(fastmath=True,parallel=True)
def row_sum(arr,columns):
    res=np.empty(arr.shape[0],dtype=arr.dtype)
    for i in nb.prange(arr.shape[0]):
        sum=0.
        for j in range(columns.shape[0]):
            sum+=arr[i,columns[j]]
        res[i]=sum
    return res

Timings

a = np.arange(1_000_000).reshape(1_000, 1_000)
columns = np.arange(1, 1000, 2)

%timeit res_1=a[:, columns].sum(axis=1)
1.29 ms ± 8.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit res_2=row_sum(a,columns)
59.3 µs ± 4.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

np.allclose(res_1,res_2)
True
max9111
  • 6,272
  • 1
  • 16
  • 33
  • This is impressive: it takes way less than the other solutions, and even less than the simple (full) sum. The need to write down the "simple loops" however makes it difficult to generalize it (say, to ``a[:, columns].cumsum(axis=1)``, or any other ``ufunc`` known to ``numpy``): is there any obvious solution? By the way, adding different columns is not conceptually different from adding vectors in random positions in memory, but if I rewrite ``row_sum`` to directly accept a ``nb.typed.List`` of (copies of the same columns) arrays, result are 40x worse. Any better way to do it? – Pietro Battiston Jul 28 '19 at 07:46
  • @Pietro Battiston There are many ufuncs supported, but not all of them. The main problem is that all numpy ufuncs (except BLAS calls) are replaced by several loops. The compiler trys to optimze that away, this works sometimes and sometimes not. There are quite a few changes planned on using lists http://numba.pydata.org/numba-doc/latest/reference/deprecation.html?highlight=deprecation#deprecation-of-reflection-for-list-and-set-types. Normally I avoid using lists whenever possible, at least in code where performance matters. – max9111 Jul 29 '19 at 09:21
  • Yes, I had seen that deprecation notice... and indeed in my version of numba (0.45.0), ``typed.List`` is present, and I used it; but I wasn't able to declare the type of its content - and performance is still pretty low. – Pietro Battiston Jul 29 '19 at 15:12
0

With Transonic (https://transonic.readthedocs.io), it's easy to write code that can be accelerated by different Python accelerators (in practice Cython, Pythran and Numba).

For example, with the boost decorator, one can write

import numpy as np

from transonic import boost

T0 = "int[:, :]"
T1 = "int[:]"


@boost
def row_sum_loops(arr: T0, columns: T1):
    # locals type annotations are used only by Cython
    i: int
    j: int
    sum_: int
    res: "int[]" = np.empty(arr.shape[0], dtype=arr.dtype)
    for i in range(arr.shape[0]):
        sum_ = 0
        for j in range(columns.shape[0]):
            sum_ += arr[i, columns[j]]
        res[i] = sum_
    return res


@boost
def row_sum_transpose(arr: T0, columns: T1):
    return arr.T[columns].sum(0)

On my computer, I obtain:

TRANSONIC_BACKEND="python" python row_sum_boost.py
Checks passed: results are consistent
Python
row_sum_loops        108.57 s
row_sum_transpose    1.38

TRANSONIC_BACKEND="cython" python row_sum_boost.py
Checks passed: results are consistent
Cython
row_sum_loops        0.45 s
row_sum_transpose    1.32 s

TRANSONIC_BACKEND="numba" python row_sum_boost.py
Checks passed: results are consistent
Numba
row_sum_loops        0.27 s
row_sum_transpose    1.16 s

TRANSONIC_BACKEND="pythran" python row_sum_boost.py
Checks passed: results are consistent
Pythran
row_sum_loops        0.27 s
row_sum_transpose    0.76 s

See https://transonic.readthedocs.io/en/stable/examples/row_sum/txt.html for the full code and a more complete comparison on the example of this question.

Note that Pythran is also very efficient with the transonic.jit decorator.

paugier
  • 1,838
  • 2
  • 20
  • 39