2

I am looking for a way to select multiple rows from a numpy array multiple times given an array of indexes.

Given M and indexes, I would like to get N avoiding for loop, since it is slow for big dimensions.

import numpy as np
M = np.array([[1, 0, 1, 1, 0],
              [1, 1, 1, 1, 0],
              [0, 0, 0, 1, 1],
              [1, 0, 0, 1, 1]])
indexes = np.array([[True, False, False, True],
                    [False, True, True, True],
                    [False, False, True, False],
                    [True, True, False, True]])
N = [M[index] for index in indexes]


Out: 
[array([[1, 0, 1, 1, 0],
        [1, 0, 0, 1, 1]]),
 array([[1, 1, 1, 1, 0],
        [0, 0, 0, 1, 1],
        [1, 0, 0, 1, 1]]),
 array([[0, 0, 0, 1, 1]]),
 array([[1, 0, 1, 1, 0],
        [1, 1, 1, 1, 0],
        [1, 0, 0, 1, 1]])]
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
allucas
  • 21
  • 1
  • The fact that you get a list of arrays that differ in shape strongly suggests that this list comprehension is the best you can do. – hpaulj Oct 29 '20 at 22:05
  • Numpy is usually at its best when handling homogeneous data while your expected output is not. Loop seems like the best choice here. – Quang Hoang Oct 29 '20 at 22:06
  • @hpaul is list comprehension really better than `np.split` here? – mathfux Oct 29 '20 at 22:37

2 Answers2

1

We can use advantage that output data is homogenous in at least one dimension.

x, y = np.where(indexes)
split_idx = np.flatnonzero(np.diff(x))+1
output = np.split(M[y], split_idx)

Sample run:

>>> x
array([0, 0, 1, 1, 1, 2, 3, 3, 3], dtype=int32)
>>> y
array([0, 3, 1, 2, 3, 2, 0, 1, 3], dtype=int32)
>>> split_idx
array([2, 5, 6], dtype=int32)
mathfux
  • 5,759
  • 1
  • 14
  • 34
  • For this small example, the straightforward list comprehension is faster. But the alternatives may scale differently. `split` has to loop as well, taking multiple slices. Scaling may depend on the number of rows versus columns. – hpaulj Oct 29 '20 at 23:43
0

A slightly different approach, that uses broadcasting, and a different way of identifying the split points:

b_shape = (indexes.shape[0],) + M.shape  # New shape for broadcasted M. Here, (4,4,5)
M_b = np.broadcast_to(M, b_shape)        # Broadcasted M with the new shape.
                                         # (it uses views instead of replicating data)
r,c = np.nonzero(indexes)
result_joined = M_b[r,c,:]                             # The stack of all the selected rows from M
split_points = np.cumsum(np.sum(indexes, axis=1))[:-1] # Identify where to split.
result_split = np.split (result_merged, split_points)  # Final result, obtained by splitting.

Output:

[array([[1, 0, 1, 1, 0],
       [1, 0, 0, 1, 1]]),
array([[1, 1, 1, 1, 0],
       [0, 0, 0, 1, 1],
       [1, 0, 0, 1, 1]]),
array([[0, 0, 0, 1, 1]]),
array([[1, 0, 1, 1, 0],
       [1, 1, 1, 1, 0],
       [1, 0, 0, 1, 1]])]

print (result_split)
fountainhead
  • 3,584
  • 1
  • 8
  • 17