3

I could perform filtering of numpy arrays via

a[np.where(a[:,0]==some_expression)]

or

a[a[:,0]==some_expression]

What are the (dis)advantages of each of these versions - especially with regard to performance?

jpp
  • 159,742
  • 34
  • 281
  • 339
user7468395
  • 1,299
  • 2
  • 10
  • 23
  • In my perspective the main advantage of np.where is that it can also give you results when the expression is false.This is the main way i use it. You can also check how it affects performence, especially run time. – Lior T Jan 22 '19 at 13:59
  • 1
    @LiorT - not sure I follow your comment, you mean if `a[:,0]==some_expression` is all False? `a[a[:,0]==some_expression]` works too in that case – Mr_and_Mrs_D Jan 22 '19 at 14:50
  • @Mr_and_Mrs_D np.where is kinda like excel if function. you can ask it to give you one answer for the places some condition is true and another value if this is not true. for exmple np.where(x==10,10,0) will return an array of 10 and zeros depend if x equals 10 or not. this could be usefull sometimes – Lior T Jan 22 '19 at 15:03
  • Oh I see - but in this question this should not be relevant – Mr_and_Mrs_D Jan 22 '19 at 15:06

2 Answers2

3

Boolean indexing is transformed into integer indexing internally. This is indicated in the docs:

In general if an index includes a Boolean array, the result will be identical to inserting obj.nonzero() into the same position and using the integer array indexing mechanism described above.

So the complexity of the two approaches is the same. But np.where is more efficient for large arrays:

np.random.seed(0)
a = np.random.randint(0, 10, (10**7, 1))
%timeit a[np.where(a[:, 0] == 5)]  # 50.1 ms per loop
%timeit a[a[:, 0] == 5]            # 62.6 ms per loop

Now np.where has other benefits: advanced integer indexing works well across multiple dimensions. For an example where Boolean indexing is unintuitive in this aspect, see NumPy indexing: broadcasting with Boolean arrays. Since np.where is more efficient than Boolean indexing, this is just an extra reason it should be preferred.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • But np.where() documentation says: "If only condition is given, return condition.nonzero()". Why is that not the same as the default indexing by Boolean as both use bool_array.nonzero()? – Dusch Jan 22 '19 at 15:17
  • @Dusch, Not sure, I haven't look at the source code. Likely an implementation detail. My basic opinion is `np.where` is both faster and more versatile, so if efficiency is important it's worth the verbosity. – jpp Jan 22 '19 at 15:23
1

To my surprise, the first one seems to perform slightly better:

a = np.random.random_integers(100, size=(1000,1))

import timeit

repeat = 3
numbers = 1000

def time(statement, _setup=None):
  print(min(
    timeit.Timer(statement, setup=_setup or setup).repeat(repeat, numbers)))

setup = """from __main__ import np, a"""

time('a[np.where(a[:,0]==99)]')
time('a[(a[:,0]==99)]')

prints (for instance):

0.017856399000000023
0.019185326999999974

Increasing the size of the array makes the numbers differ even more

Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361