4

Let's say I have an array of values, r, which range anywhere from 0 to 1. I want to remove all values that are some threshold value away from the median. Let's assume here that that threshold value is 0.5, and len(r) = 3000. Then to mask out all values outside of this range, I can do a simple list comprehension, which I like:

mask = np.array([ri < np.median(r)-0.5 or ri > np.median(r)+0.5 for ri in r])

And if I use a timer on it:

import time
import numpy as np

start = time.time()
r = np.random.random(3000)
m = np.median(r)
maxr,minr = m-0.5, m+0.5
mask = [ri<minr or ri>maxr for ri in r]
end = time.time()
print('Took %.4f seconds'%(end-start))

>>> Took 0.0010 seconds

Is there a faster way to do this list comprehension and make the mask using NumPy?


Edit:

I've tried several suggestions below, including:

  • An element-wise or operator: (r<minv) | (r>maxv)

  • A Numpy logical or: r[np.logical_or(r<minr, r>maxr)]

  • A absolute difference boolean array: abs(m-r) > 0.5

And here is the average time each one took after 300 runs through:

Python list comprehension: 0.6511 ms
Elementwise or: 0.0138 ms
Numpy logical or: 0.0241 ms
Absolute difference: 0.0248 ms

As you can see, the elementwise Or was always the fastest, by nearly a factor of two (don't know how that would scale with array elements). Who knew.

pretzlstyle
  • 2,774
  • 5
  • 23
  • 40
  • `[ri<-maxr ` is the minus sign a typo here? – ayhan Jul 14 '16 at 19:31
  • @ayhan yea sorry, also the min and max were flopped – pretzlstyle Jul 14 '16 at 19:33
  • 1
    Try `(r > maxr) | (r < minr)` instead. `|` is for element-wise OR. – ayhan Jul 14 '16 at 19:38
  • Are you sure that `maxr` must be less then `minr` or, maybe, you want to use `and` instead of `or`? – Arnial Jul 14 '16 at 19:53
  • @Arnial No, one value can never both less than the minimum and greater than the maximum – pretzlstyle Jul 14 '16 at 21:13
  • @jphollowed You should change `maxr,minr = m-0.5, m+0.5` to `minr,maxr = m-0.5, m+0.5`, otherwise you will have all elements in mask. – Arnial Jul 14 '16 at 21:16
  • @Arnial It's a bit weird, but I like to mask like this: `r[~mask]` – pretzlstyle Jul 14 '16 at 21:23
  • @ayhan Your suggestion was fastest, see my edit. Make an answer and I'll accept it! – pretzlstyle Jul 14 '16 at 21:23
  • I think that's just random deviation. You can accept either one, I think they are both nice. :) – ayhan Jul 14 '16 at 21:26
  • @ayhan What do you mean random deviation? It was very consistently faster. – pretzlstyle Jul 14 '16 at 21:27
  • 1
    I believe `|` and `np.logical_or` are using the same methods so the difference should stem from something else http://i.imgur.com/7Y4frKM.png `abs` could take longer though, you are right. – ayhan Jul 14 '16 at 21:35
  • 1
    @ayhan no they are not same. [`|` is bitwise_or](http://docs.scipy.org/doc/numpy/reference/generated/numpy.bitwise_or.html). They will work same way only on boolean arrays. I have same results on performance tests as you. But i believe that it's hardware dependent. – Arnial Jul 14 '16 at 22:35

2 Answers2

4

One liner...

new_mask = abs(np.median(r) - r) > 0.5
Bruce Pucci
  • 1,821
  • 2
  • 19
  • 26
  • Very useful when the tails of distributions are needed. I like clarity. –  Jul 15 '16 at 02:03
3

You can use numpy conditional selections to create new array, without those values.

start = time.time()
m = np.median(r)
maxr,minr = m-0.5, m+0.5
filtered_array = r[ (r < minr) | (r > maxr) ]
end = time.time()
print('Took %.4f seconds'%(end-start))

filtered_array is slice of r without masked values (all values that will be later removed by mask already removed in filtered_array).

Update: used shorter syntax suggested by @ayhan.

Arnial
  • 1,433
  • 1
  • 11
  • 10