37

So, lets say I have 100,000 float arrays with 100 elements each. I need the highest X number of values, BUT only if they are greater than Y. Any element not matching this should be set to 0. What would be the fastest way to do this in Python? Order must be maintained. Most of the elements are already set to 0.

sample variables:

array = [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

expected result:

array = [0, .25, 0, .15, .5, 0, 0, 0, 0, 0]
mskfisher
  • 3,291
  • 4
  • 35
  • 48
David
  • 383
  • 1
  • 3
  • 5
  • highCountX is the maximum number of non-zero elements that I want to exist in the array – David Oct 26 '09 at 09:47
  • If it was 2 the expected result would be: [0, 0, 0, .15, .5, 0, 0, 0, 0, 0] - highCountX limits the number of non-zero elements in result. – Abgan Oct 26 '09 at 09:49
  • How do you choose which one to keep and which to throw away if the number of values exceeeds highCountX – James Anderson Oct 26 '09 at 10:31
  • you keep the highest values... if there are duplicate values, it doesn't matter which one is used – David Oct 26 '09 at 10:40
  • @David: You should consider validating one of the responses, so as to tell readers that it really did solve your problem! – Eric O. Lebigot Mar 01 '10 at 08:42

9 Answers9

79

This is a typical job for NumPy, which is very fast for these kinds of operations:

array_np = numpy.asarray(array)
low_values_flags = array_np < lowValY  # Where values are low
array_np[low_values_flags] = 0  # All low values set to 0

Now, if you only need the highCountX largest elements, you can even "forget" the small elements (instead of setting them to 0 and sorting them) and only sort the list of large elements:

array_np = numpy.asarray(array)
print numpy.sort(array_np[array_np >= lowValY])[-highCountX:]

Of course, sorting the whole array if you only need a few elements might not be optimal. Depending on your needs, you might want to consider the standard heapq module.

Kamil S Jaron
  • 494
  • 10
  • 23
Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
  • 5
    Nice... using proper libraries can take you really far :-) – Abgan Oct 26 '09 at 09:51
  • I keep running into this numPy, guess I'll have to check it out :) Thanks for the help (everyone). – David Oct 26 '09 at 11:06
  • @David NumPy really fills a need. I would suggest that you start with the tutorial I linked to: it's probably the fastest way of getting up to speed with NumPy and learning its most important concepts. – Eric O. Lebigot Oct 26 '09 at 13:16
  • 2
    What would be faster: `array_np[low_values_indices] = 0` or `array_np *= low_values_indices`? – Radio Controlled Oct 24 '16 at 11:21
  • assuming that you import numpy as np ... then you can also just use index = np.where(array < lowValY); array[index] = 0; – user1270710 Oct 09 '17 at 23:32
  • While this works, this is wasteful and should thus arguably be avoided: in the context of the question, there is no need to add `np.where()`, because it only adds another, unneeded layer of computations. In fact, NumPy knows how to select array elements based on an array of booleans (as in this answer), so there is no need to transform it into the array of true indices. – Eric O. Lebigot Oct 11 '17 at 10:15
20
from scipy.stats import threshold
thresholded = threshold(array, 0.5)

:)

omygaudio
  • 631
  • 5
  • 6
  • 2
    Deprecated from scipy 0.17.1 onwards, see https://docs.scipy.org/doc/scipy-0.17.1/reference/generated/scipy.stats.threshold.html#scipy.stats.threshold – weiji14 Apr 02 '18 at 01:19
7

There's a special MaskedArray class in NumPy that does exactly that. You can "mask" elements based on any precondition. This better represent your need than assigning zeroes: numpy operations will ignore masked values when appropriate (for example, finding mean value).

>>> from numpy import ma
>>> x = ma.array([.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0])
>>> x1 = ma.masked_inside(0, 0.1) # mask everything in 0..0.1 range
>>> x1
masked_array(data = [-- 0.25 -- 0.15 0.5 -- -- -- -- --],
         mask = [ True False True False False True True True True True],
   fill_value = 1e+20)
>>> print x.filled(0) # Fill with zeroes
[ 0 0.25 0 0.15 0.5 0 0 0 0 0 ]

As an addded benefit, masked arrays are well supported in matplotlib visualisation library if you need this.

Docs on masked arrays in numpy

Alexander Lebedev
  • 5,968
  • 1
  • 20
  • 30
6

Using numpy:

# assign zero to all elements less than or equal to `lowValY`
a[a<=lowValY] = 0 
# find n-th largest element in the array (where n=highCountX)
x = partial_sort(a, highCountX, reverse=True)[:highCountX][-1]
# 
a[a<x] = 0 #NOTE: it might leave more than highCountX non-zero elements
           # . if there are duplicates

Where partial_sort could be:

def partial_sort(a, n, reverse=False):
    #NOTE: in general it should return full list but in your case this will do
    return sorted(a, reverse=reverse)[:n] 

The expression a[a<value] = 0 can be written without numpy as follows:

for i, x in enumerate(a):
    if x < value:
       a[i] = 0
jfs
  • 399,953
  • 195
  • 994
  • 1,670
5

The simplest way would be:

topX = sorted([x for x in array if x > lowValY], reverse=True)[highCountX-1]
print [x if x >= topX else 0 for x in array]

In pieces, this selects all the elements greater than lowValY:

[x for x in array if x > lowValY]

This array only contains the number of elements greater than the threshold. Then, sorting it so the largest values are at the start:

sorted(..., reverse=True)

Then a list index takes the threshold for the top highCountX elements:

sorted(...)[highCountX-1]

Finally, the original array is filled out using another list comprehension:

[x if x >= topX else 0 for x in array]

There is a boundary condition where there are two or more equal elements that (in your example) are 3rd highest elements. The resulting array will contain that element more than once.

There are other boundary conditions as well, such as if len(array) < highCountX. Handling such conditions is left to the implementor.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • 1
    You can use x for x in array if x > lowValY instead of [x for x in array if x > lowValY] to just enumerate over original array without copying it (if original data is quite large this might be a good thing to do). – Abgan Oct 26 '09 at 09:47
  • 1
    That's true. `sorted()` will probably need the whole list anyway, though. – Greg Hewgill Oct 26 '09 at 09:48
  • Heh, 3x faster then my noob code, but I would need the equal elements to maintain the highCountX limit. The arrays should have anywhere from 20-200 elements... they are actually segments of a larger array that I process in chunks. Thanks for the help so far. – David Oct 26 '09 at 09:58
  • I can't see how do you `zero`ing elements in the original array. – jfs Oct 26 '09 at 10:04
  • If `highCountX > len([x for x in array if x > lowValY])` then you'll get IndexError. – jfs Oct 26 '09 at 10:08
  • This wouldn't work (IndexError) if the number of elements larger than lowValY is smaller than highCountX – ThisIsMeMoony Oct 26 '09 at 10:09
  • Yes, there are other boundary conditions. Error handling is left to the implementor, I have provided an outline of a possible solution. – Greg Hewgill Oct 26 '09 at 10:21
  • +1. Elegantly solved. N.B.: the last list comprehension only works with Python 2.5+ because of the ternary operation. – Bite code Oct 26 '09 at 10:28
2

Settings elements below some threshold to zero is easy:

array = [ x if x > threshold else 0.0 for x in array ]

(plus the occasional abs() if needed.)

The requirement of the N highest numbers is a bit vague, however. What if there are e.g. N+1 equal numbers above the threshold? Which one to truncate?

You could sort the array first, then set the threshold to the value of the Nth element:

threshold = sorted(array, reverse=True)[N]
array = [ x if x >= threshold else 0.0 for x in array ]

Note: this solution is optimized for readability not performance.

digitalarbeiter
  • 2,295
  • 14
  • 16
  • in this case, it doesn't matter which one is truncated... more important is that highCountX is followed – David Oct 26 '09 at 10:04
1

You can use map and lambda, it should be fast enough.

new_array = map(lambda x: x if x>y else 0, array)
nnrcschmdt
  • 11
  • 1
0

Use a heap.

This works in time O(n*lg(HighCountX)).

import heapq

heap = []
array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

for i in range(1,highCountX):
    heappush(heap, lowValY)
    heappop(heap)

for i in range( 0, len(array) - 1)
    if array[i] > heap[0]:
        heappush(heap, array[i])

min = heap[0]

array = [x if x >= min else 0 for x in array]

deletemin works in heap O(lg(k)) and insertion O(lg(k)) or O(1) depending on which heap type you use.

Egon
  • 1,705
  • 18
  • 32
0

Using a heap is a good idea, as egon says. But you can use the heapq.nlargest function to cut down on some effort:

import heapq 

array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

threshold = max(heapq.nlargest(highCountX, array)[-1], lowValY)
array = [x if x >= threshold else 0 for x in array]
Matt Anderson
  • 19,311
  • 11
  • 41
  • 57
  • I like this homemade solution that only uses standard modules. However, it should be upgraded so as to really return the largest highCountX elements (if many elements in the array have value `threshold`, the final array has too many non-zero elements). – Eric O. Lebigot Mar 01 '10 at 08:40