Improving performance of operations on a NumPy array

Question

I'm using numpy.delete to remove elements from an array that is inside a while loop. This while loop is valid only if the array is not empty. This code works fine but slows down considerably when the array has over 1e6 elements. Here is an example:

while(array.shape[0] > 0):
     ix = where((array >= x) & (array <= y))[0]
     array = delete(array,ix,None)

I've tried to make this code efficient but I cannot find a good way to speed up the while loop. The bottleneck here is, I think, the delete which must involve a copy of some kind. I've tried using masked array in order to avoid copying but I'm not that good at python and masked array are not that easy to search. Is there a good and fast way to use delete or replace it so that 7e6 elements can be handled by the loop above without taking 24 hours?

Thanks

Why are you deleting elements one at a time instead of doing it all at once? — mgilson, May 14 '12 at 19:42
I think that we may need more of this while loop to figure out what it is that you're trying to accomplish here. As pointed out by @JoshAdel, you have a (likely) infinite loop if x and y are static -- You'll (likely) get to a point where ix is an empty array and then you'll just loop forever (even on small lists). If you don't get to that point, you're just emptying the array which you could do with a single command... — mgilson, May 14 '12 at 20:02
Yes, x and y are changing in the while loop as I search the array for these conditions. I need to "remove" these elements before the next search criteria so that I don't search the same area twice. Array is a time series of values and I need to find the frequency of occurrence of an event by taking the most intense cases. Therefore when I find an event, determined by x and y, I remove this value and all values in a small window before moving on to the next. This is why I thought I needed a while loop, but I'm learning here. Yes, (array >= x) & (array <= y) is the correct form. — Shejo284, May 15 '12 at 05:04

doug · Accepted Answer · 2013-07-05T21:18:53.380

9

So you can substantially improve the performance of your code by:

eliminating the loop; and
avoiding the delete operations (which cause a copy of the original array)

NumPy 1.7 introduced a new mask that is far easier to use than the original; it's performance is also much better because it's part of the NumPy core array object. I think this might be useful to you because by using it you can avoid the expensive delete operation.

In other words, instead of deleting the array elements you don't want, just mask them. This has been suggested in other Answers, but i am suggesting to use the new mask

to use NA, just import NA

>>> from numpy import NA as NA

then for a given array, set the maskna flag to True

>>> A.flags.maskna = True

Alternatively, most array constructors (as of 1.7) have the parameter maskna, which you can set to True

>>> A[3,3] = NA

array([[7, 5, 4, 8, 4],
       [2, 4, 3, 7, 3],
       [3, 1, 3, 2, 1],
       [8, 2, 0, NA, 7],
       [0, 7, 2, 5, 5],
       [5, 4, 2, 7, 4],
       [1, 2, 9, 2, 3],
       [7, 5, 1, 2, 9]])

>>> A.sum(axis=0)
array([33, 30, 24, NA, 36])

Often this is not what you want--i.e., you still want the sum of that column with the NA treated as if it were 0:

To get that behavior, pass in True for the skipma parameter (most NumPy array constructors have this parameter in NumPy 1.7):

>>> A.sum(axis=0, skipna=True)
array([33, 30, 24, 33, 36])

In sum, to speed up your code, eliminate the loop and use the new mask:

>>> A[(A<=3)&(A<=6)] = NA

>>> A
array([[8, 8, 4, NA, NA],
       [7, 9, NA, NA, 8],
       [NA, 6, 9, 5, NA],
       [9, 4, 6, 6, 5],
       [NA, 6, 8, NA, NA],
       [8, 5, 7, 7, NA],
       [NA, 4, 5, 9, 9],
       [NA, 8, NA, 5, 9]])

The NA placeholders--in this context--behave like 0s, which i believe is what you want:

>>> A.sum(axis=0, skipna=True)
array([32, 50, 39, 32, 31])

edited Jul 05 '13 at 21:18

answered May 14 '12 at 20:25

doug

69,080
24
165
199

Very interesting. I didn't know about this feature (+1) ... although I still contend that we need to know more about the while loop to figure out what the best solution is. – mgilson May 14 '12 at 20:42
@mgilson i agree completely; i made a few assumptions about what that code is supposed to do, so that i could answer it. those could easily be way off, but at least the principles (don't loop, don't delete) might still be useful/valid. – doug May 14 '12 at 20:50
Great answer. I thought of something like this using masked array but it was difficult to get it to work as described here. Y and x are changing in the while loop and I need to "remove" these elements before the next search criteria. Array is a time series of values and I need to find the frequency of occurrence of an event by taking the most intense cases. When I find an event, I remove this value and all values in a small window before moving on to the next. This is why I thought I needed a while loop, but I'm learning here. I'll give this a try today and see if it works. – Shejo284 May 15 '12 at 05:13
I'm afraid my numpy version is 1.6.1. I'm not able to update numpy myself as I use it on a professional linux cluster. Can python 2.7.2 be used with numpy 1.7? I'll try and find a work-around, but in the meantime I'll try the first suggestion of setting all hits in the search criteria to False. – Shejo284 May 15 '12 at 06:41
@Shejo284 that's frustrating. You can use NumPy 1.7 and Python 2.7.2 are indeed "compatible" (i used 1.7 dev and 2.7.2). Of course, you could always just use the regular Masked Array class (NumPy 1.6). I didn't mention it because of the new class in 1.7 which i think is better. – doug May 15 '12 at 07:30
4

The `np.NA` stuff apparently did not actually make it into 1.7. – Reid Jul 05 '13 at 20:48

mgilson · Answer 2 · 2012-05-14T19:59:14.753

3

Correct me if I'm wrong, but I think you can just do:

mask=np.where((array >= x) & (array <= y),True,False)
array=array[mask]

and forgo the whole loop?

Also, in my interpreter, array >= x & array <= y produces an exception. You probably meant: (array >= x) & (array <= y)

edited May 14 '12 at 19:59

answered May 14 '12 at 19:43

mgilson

300,191
65
633
696

Yes, x and y are changing in the while loop as I search the array for these conditions. I need to "remove" these elements before the next search criteria so that I don't search the same area twice. Array is a time series of values and I need to find the frequency of occurrence of an event by taking the most intense cases. Therefore when I find an event, determined by x and y, I remove this value and all values in a small window before moving on to the next. This is why I thought I needed a while loop, but I'm learning here. Yes, (array >= x) & (array <= y) is correct. – Shejo284 May 15 '12 at 05:03

score 1 · Answer 3 · answered May 14 '12 at 19:41

1

According to the documentation for numpy.delete, the function returns a copy of the input array with the specified elements removed. So the larger the array you're copying, the slower the function will be.

http://docs.scipy.org/doc/numpy/reference/generated/numpy.delete.html

Why exactly do you need to frequently delete chunks of the array? If your array is extremely dynamic, you might be better off using a list to store pieces of the array and doing deletions only on smaller bits at a time.

answered May 14 '12 at 19:41

Brendan Wood

6,220
3
30
28

As I search the array for these conditions. I need to "remove" these elements before the next search criteria so that I don't search the same area twice. Array is a time series of values and I need to find the frequency of occurrence of an event by taking the most intense cases. Therefore when I find an event, determined by x and y, I remove this value and all values in a small window before moving on to the next. This is why I thought I needed a while loop, but I'm learning here. I tried a list before but the performance is equally poor or I just don't know how to do this efficiently. – Shejo284 May 15 '12 at 05:05
I tried and the following according to the suggestion above, since I don't have numpy 1.7 as yet: mask=np.where((array >= x) & (array <= y),False,True)# reversed array=array[mask] And this work much much faster. A simple and elegant solution, actually. Guess I was over-thinking the problem. This solution reduces a 24 hour operation to about 2 hours. Thanks everyone, I learned a lot! – Shejo284 May 15 '12 at 16:08
That's great to hear, I'm glad you found a simple solution to your problem. – Brendan Wood May 15 '12 at 16:22

Improving performance of operations on a NumPy array

3 Answers3

Linked