Fastest way to zero out low values in array?

Question

So, lets say I have 100,000 float arrays with 100 elements each. I need the highest X number of values, BUT only if they are greater than Y. Any element not matching this should be set to 0. What would be the fastest way to do this in Python? Order must be maintained. Most of the elements are already set to 0.

sample variables:

array = [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

expected result:

array = [0, .25, 0, .15, .5, 0, 0, 0, 0, 0]

highCountX is the maximum number of non-zero elements that I want to exist in the array — David, Oct 26 '09 at 09:47
If it was 2 the expected result would be: [0, 0, 0, .15, .5, 0, 0, 0, 0, 0] - highCountX limits the number of non-zero elements in result. — Abgan, Oct 26 '09 at 09:49
How do you choose which one to keep and which to throw away if the number of values exceeeds highCountX — James Anderson, Oct 26 '09 at 10:31
you keep the highest values... if there are duplicate values, it doesn't matter which one is used — David, Oct 26 '09 at 10:40
@David: You should consider validating one of the responses, so as to tell readers that it really did solve your problem! — Eric O. Lebigot, Mar 01 '10 at 08:42

score 79 · Accepted Answer · edited Apr 12 '19 at 12:35

79

This is a typical job for NumPy, which is very fast for these kinds of operations:

array_np = numpy.asarray(array)
low_values_flags = array_np < lowValY  # Where values are low
array_np[low_values_flags] = 0  # All low values set to 0

Now, if you only need the highCountX largest elements, you can even "forget" the small elements (instead of setting them to 0 and sorting them) and only sort the list of large elements:

array_np = numpy.asarray(array)
print numpy.sort(array_np[array_np >= lowValY])[-highCountX:]

Of course, sorting the whole array if you only need a few elements might not be optimal. Depending on your needs, you might want to consider the standard heapq module.

edited Apr 12 '19 at 12:35

Kamil S Jaron

494
10
23

answered Oct 26 '09 at 09:49

Eric O. Lebigot

91,433
48
218
260

5

Nice... using proper libraries can take you really far :-) – Abgan Oct 26 '09 at 09:51
I keep running into this numPy, guess I'll have to check it out :) Thanks for the help (everyone). – David Oct 26 '09 at 11:06
@David NumPy really fills a need. I would suggest that you start with the tutorial I linked to: it's probably the fastest way of getting up to speed with NumPy and learning its most important concepts. – Eric O. Lebigot Oct 26 '09 at 13:16
2

What would be faster: `array_np[low_values_indices] = 0` or `array_np *= low_values_indices`? – Radio Controlled Oct 24 '16 at 11:21
assuming that you import numpy as np ... then you can also just use index = np.where(array < lowValY); array[index] = 0; – user1270710 Oct 09 '17 at 23:32
While this works, this is wasteful and should thus arguably be avoided: in the context of the question, there is no need to add `np.where()`, because it only adds another, unneeded layer of computations. In fact, NumPy knows how to select array elements based on an array of booleans (as in this answer), so there is no need to transform it into the array of true indices. – Eric O. Lebigot Oct 11 '17 at 10:15

score 20 · Answer 2 · answered Mar 10 '14 at 02:42

20

from scipy.stats import threshold
thresholded = threshold(array, 0.5)

:)

answered Mar 10 '14 at 02:42

omygaudio

631
5
6

2

Deprecated from scipy 0.17.1 onwards, see https://docs.scipy.org/doc/scipy-0.17.1/reference/generated/scipy.stats.threshold.html#scipy.stats.threshold – weiji14 Apr 02 '18 at 01:19

score 7 · Answer 3 · answered Oct 26 '09 at 11:05

There's a special MaskedArray class in NumPy that does exactly that. You can "mask" elements based on any precondition. This better represent your need than assigning zeroes: numpy operations will ignore masked values when appropriate (for example, finding mean value).

>>> from numpy import ma
>>> x = ma.array([.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0])
>>> x1 = ma.masked_inside(0, 0.1) # mask everything in 0..0.1 range
>>> x1
masked_array(data = [-- 0.25 -- 0.15 0.5 -- -- -- -- --],
         mask = [ True False True False False True True True True True],
   fill_value = 1e+20)
>>> print x.filled(0) # Fill with zeroes
[ 0 0.25 0 0.15 0.5 0 0 0 0 0 ]

As an addded benefit, masked arrays are well supported in matplotlib visualisation library if you need this.

Docs on masked arrays in numpy

jfs · Answer 4 · 2009-10-26T10:05:35.103

Using numpy:

# assign zero to all elements less than or equal to `lowValY`
a[a<=lowValY] = 0 
# find n-th largest element in the array (where n=highCountX)
x = partial_sort(a, highCountX, reverse=True)[:highCountX][-1]
# 
a[a<x] = 0 #NOTE: it might leave more than highCountX non-zero elements
           # . if there are duplicates

Where partial_sort could be:

def partial_sort(a, n, reverse=False):
    #NOTE: in general it should return full list but in your case this will do
    return sorted(a, reverse=reverse)[:n]

The expression a[a<value] = 0 can be written without numpy as follows:

for i, x in enumerate(a):
    if x < value:
       a[i] = 0

Greg Hewgill · Answer 5 · 2009-10-26T10:22:25.183

5

The simplest way would be:

topX = sorted([x for x in array if x > lowValY], reverse=True)[highCountX-1]
print [x if x >= topX else 0 for x in array]

In pieces, this selects all the elements greater than lowValY:

[x for x in array if x > lowValY]

This array only contains the number of elements greater than the threshold. Then, sorting it so the largest values are at the start:

sorted(..., reverse=True)

Then a list index takes the threshold for the top highCountX elements:

sorted(...)[highCountX-1]

Finally, the original array is filled out using another list comprehension:

[x if x >= topX else 0 for x in array]

There is a boundary condition where there are two or more equal elements that (in your example) are 3rd highest elements. The resulting array will contain that element more than once.

There are other boundary conditions as well, such as if len(array) < highCountX. Handling such conditions is left to the implementor.

edited Oct 26 '09 at 10:22

answered Oct 26 '09 at 09:29

Greg Hewgill

951,095
183
1,149
1,285

1

You can use x for x in array if x > lowValY instead of [x for x in array if x > lowValY] to just enumerate over original array without copying it (if original data is quite large this might be a good thing to do). – Abgan Oct 26 '09 at 09:47
1

That's true. `sorted()` will probably need the whole list anyway, though. – Greg Hewgill Oct 26 '09 at 09:48
Heh, 3x faster then my noob code, but I would need the equal elements to maintain the highCountX limit. The arrays should have anywhere from 20-200 elements... they are actually segments of a larger array that I process in chunks. Thanks for the help so far. – David Oct 26 '09 at 09:58
I can't see how do you `zero`ing elements in the original array. – jfs Oct 26 '09 at 10:04
If `highCountX > len([x for x in array if x > lowValY])` then you'll get IndexError. – jfs Oct 26 '09 at 10:08
This wouldn't work (IndexError) if the number of elements larger than lowValY is smaller than highCountX – ThisIsMeMoony Oct 26 '09 at 10:09
Yes, there are other boundary conditions. Error handling is left to the implementor, I have provided an outline of a possible solution. – Greg Hewgill Oct 26 '09 at 10:21
+1. Elegantly solved. N.B.: the last list comprehension only works with Python 2.5+ because of the ternary operation. – Bite code Oct 26 '09 at 10:28

score 2 · Answer 6 · answered Oct 26 '09 at 09:51

Settings elements below some threshold to zero is easy:

array = [ x if x > threshold else 0.0 for x in array ]

(plus the occasional abs() if needed.)

The requirement of the N highest numbers is a bit vague, however. What if there are e.g. N+1 equal numbers above the threshold? Which one to truncate?

You could sort the array first, then set the threshold to the value of the Nth element:

threshold = sorted(array, reverse=True)[N]
array = [ x if x >= threshold else 0.0 for x in array ]

Note: this solution is optimized for readability not performance.

in this case, it doesn't matter which one is truncated... more important is that highCountX is followed — David, Oct 26 '09 at 10:04

score 1 · Answer 7 · answered Oct 26 '09 at 09:56

1

You can use map and lambda, it should be fast enough.

new_array = map(lambda x: x if x>y else 0, array)

answered Oct 26 '09 at 09:56

nnrcschmdt

11
1

Egon · Answer 8 · 2009-10-26T10:46:34.317

Use a heap.

This works in time O(n*lg(HighCountX)).

import heapq

heap = []
array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

for i in range(1,highCountX):
    heappush(heap, lowValY)
    heappop(heap)

for i in range( 0, len(array) - 1)
    if array[i] > heap[0]:
        heappush(heap, array[i])

min = heap[0]

array = [x if x >= min else 0 for x in array]

deletemin works in heap O(lg(k)) and insertion O(lg(k)) or O(1) depending on which heap type you use.

score 0 · Answer 9 · answered Oct 27 '09 at 04:32

0

Using a heap is a good idea, as egon says. But you can use the heapq.nlargest function to cut down on some effort:

import heapq 

array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

threshold = max(heapq.nlargest(highCountX, array)[-1], lowValY)
array = [x if x >= threshold else 0 for x in array]

answered Oct 27 '09 at 04:32

Matt Anderson

19,311
11
41
57

I like this homemade solution that only uses standard modules. However, it should be upgraded so as to really return the largest highCountX elements (if many elements in the array have value `threshold`, the final array has too many non-zero elements). – Eric O. Lebigot Mar 01 '10 at 08:40

Fastest way to zero out low values in array?

9 Answers9

Linked