3

In NumPy, I have a d x n array A and a list L of length n, describing where I want each column of A to end up in matrix B. The idea is that column i of matrix B is the sum of all columns of A for which the corresponding value in L is i.

I can do this with a for loop:

A = np.arange(15).reshape(3,5)
L = [0,1,2,1,1]
n_cols = 3
B = np.zeros((len(A), n_cols)) 
# assume I know the desired number of columns, 
# which is also one more than the maximum value of L
for i, a in enumerate(A.T):
    B[:, L[i]] += a

I was wondering if there a way to do this by slicing array A (or in some otherwise vectorized manner)?

Kevin Yang
  • 167
  • 7
  • Do you think you could post a complete example? (i.e. a [mcve]) It would help if we could see the effect of this code on small matrices and an example of a list `L` too. – Praveen Sep 11 '16 at 06:18
  • @Praveen I filled out the example a bit more. – Kevin Yang Sep 12 '16 at 00:11

2 Answers2

3

You are sum-reducing/collapsing the columns off A using L for selecting those columns. Also, you are updating the columns of output array based on the uniqueness of L elems.

Thus, you can use np.add.reduceat for a vectorized solution, like so -

sidx = L.argsort()
col_idx, grp_start_idx = np.unique(L[sidx],return_index=True)
B_out = np.zeros((len(A), n_cols))
B_out[:,col_idx] = np.add.reduceat(A[:,sidx],grp_start_idx,axis=1)

Runtime test -

In [129]: def org_app(A,n_cols):
     ...:     B = np.zeros((len(A), n_cols)) 
     ...:     for i, a in enumerate(A.T):
     ...:         B[:, L[i]] += a
     ...:     return B
     ...: 
     ...: def vectorized_app(A,n_cols):
     ...:     sidx = L.argsort()
     ...:     col_idx, grp_start_idx = np.unique(L[sidx],return_index=True)
     ...:     B_out = np.zeros((len(A), n_cols))
     ...:     B_out[:,col_idx] = np.add.reduceat(A[:,sidx],grp_start_idx,axis=1)
     ...:     return B_out
     ...: 

In [130]: # Setup inputs with an appreciable no. of cols & lesser rows
     ...: # so as that memory bandwidth to work with huge number of 
     ...: # row elems doesn't become the bottleneck
     ...: d,n_cols = 10,5000
     ...: A = np.random.rand(d,n_cols)
     ...: L = np.random.randint(0,n_cols,(n_cols,))
     ...: 

In [131]: np.allclose(org_app(A,n_cols),vectorized_app(A,n_cols))
Out[131]: True

In [132]: %timeit org_app(A,n_cols)
10 loops, best of 3: 33.3 ms per loop

In [133]: %timeit vectorized_app(A,n_cols)
100 loops, best of 3: 1.87 ms per loop

As the number of rows become comparable with the number of cols in A, the high memory bandwidth requirements for the vectorized approach would offset any noticeable speedup from it.

Divakar
  • 218,885
  • 19
  • 262
  • 358
1

Is this iteration on ``B` the same (not tested)?

 for I in range(B.shape[1]):
       B[:, I] = A[:, L==I].sum(axis=1)

The number loops will be fewer. But more importantly it may give other vectorization insights.

(edit) on testing, this works, but is much slower.

+========

scipy.sparse does column sum with matrix product with a matrix of ones. Can we do the same here? Make C array with 1s in the right columns

def my2(A,L):
    n_cols = L.shape[0]
    C = np.zeros((n_cols,n_cols),int)
    C[np.arange(n_cols), L] = 1
    return A.dot(C)

This is 7x faster than your loop, and slightly faster than @Divakars reduceat code.

==========

In [126]: timeit vectorized_app(A,L)
The slowest run took 8.16 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 99.7 µs per loop
In [127]: timeit val2 = my2(A,L)
The slowest run took 10.46 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 81.6 µs per loop
In [128]: timeit org1(A,n_cols)
1000 loops, best of 3: 623 µs per loop
In [129]: d,n_cols
Out[129]: (10, 100)
hpaulj
  • 221,503
  • 14
  • 230
  • 353