Average of numpy array ignoring specified value

Question

I have a number of 1-dimensional numpy ndarrays containing the path length between a given node and all other nodes in a network for which I would like to calculate the average. The matter is complicated though by the fact that if no path exists between two nodes the algorithm returns a value of 2147483647 for that given connection. If I leave this value untreated it would obviously grossly inflate my average as a typical path length would be somewhere between 1 and 3 in my network.

One option of dealing with this would be to loop through all elements of all arrays and replace 2147483647 with NaN and then use numpy.nanmean to find the average though that is probably not the most efficient method of going about it. Is there a way of calculating the average with numpy just ignoring all values of 2147483647?

I should add that, I could have up to several million arrays with several million values to average over so any performance gain in how the average is found will make a real difference.

JohnW · Accepted Answer · 2016-08-24T17:23:33.363

6

Why not using your usual numpy filtering for this?

m = my_array[my_array != 2147483647].mean()

By the way, if you really want speed, your whole algorithm description seems certainly naive and could be improved by a lot.

Oh and I guess that you are calculating the mean because you have rigorously checked that the underlying distribution is normal so that it means something, aren't you?

edited Aug 24 '16 at 17:23

answered Aug 24 '16 at 16:09

JohnW

505
3
14

Is this the most efficient method? I could have up to several million arrays with several million values to average over so any performance gain will make a real difference. (I just made an edit specifying that.) – P-M Aug 24 '16 at 16:12
And how do you thing `nanmean` works exactly? It is there: https://github.com/numpy/numpy/blob/master/numpy/lib/nanfunctions.py#L803 basically the same thing except that their version can be more memory efficient if there are a lot of NANs. Remember that NANs are only defined for floats which would induce an extra cast of the whole array. – JohnW Aug 24 '16 at 16:20
1

Regarding speed, test it first and **if** this becomes a problem, come back. – JohnW Aug 24 '16 at 16:21
The `where` approach is slower if used with `nanmean`. – hpaulj Aug 24 '16 at 16:44
As @hpaulj pointed out in a comment in another answer, `np.where(a == 2147483647, np.nan, a).mean()` is not correct. It should be `np.nanmean(np.where(a == 2147483647, np.nan, a))`. – Warren Weckesser Aug 24 '16 at 16:55
The average path length is one measure amongst many. Simply using the value of the average path length may not be that insightful, agreed, but looking at how it varies with other variables can be useful. Do you have any suggestions on how I could improve the alogirthm? – P-M Aug 24 '16 at 16:57
Nope, that's what is called a bias, i.e. an accuracy problem. Everyday, your computer takes 15 seconds to boot. One day, you forgot the charger and you had to spend 30 minutes fetching it back. Some days after that, you discover that each boot time has been recorded by your computer over a month. You get an average boot time of 72 seconds. Is 72 seconds representative of your typical boot time? Nope. This is all caused by one outlier. This time you know it because you know the process. In your distance case, this is not that obvious. And in the general case, it isn't. – JohnW Aug 24 '16 at 17:15
It's not because everyone uses the mean that it is the correct answer. By doing this, people do a tremendous number of approximation like normality and independence. This is generally wrong and you can test that. For instance with lognormally distributed data, you most likely want to calculate the geometric mean which is also the median of the distribution. – JohnW Aug 24 '16 at 17:17
Regarding the algorithm, sorry, that's really not something that can be conveyed in comments on a QA website. But be assured that a huge amount of theory and (Python) **tools** are available to answer any questions you have about trees, graphs, etc. But you have to dig for that yourself and ask specific questions if needed. – JohnW Aug 24 '16 at 17:22
The purpose of my comment on algorithms as not to tell you that you do it wrong, it was to tell you that there is most certainly a lot of room for improvement there, speed-wise. (and of course, by using better algorithms and getting up to speed with the theory, you could also do more useful things and answer more interesting questions). – JohnW Aug 24 '16 at 17:25

Alexander · Answer 2 · 2016-08-24T17:10:09.543

1

np.nanmean(np.where(my_array == 2147483647, np.nan, my_array))

Timings

a = np.random.randn(100000)
a[::10] = 2147483647

%timeit np.nanmean(np.where(a == 2147483647, np.nan, a))
1000 loops, best of 3: 639 µs per loop

%timeit a[a != 2147483647].mean()
1000 loops, best of 3: 259 µs per loop

import pandas as pd

%timeit pd.Series(a).ne(2147483647).mean()
1000 loops, best of 3: 493 µs per loop

edited Aug 24 '16 at 17:10

answered Aug 24 '16 at 16:13

Alexander

105,104
32
201
196

Is this more efficient than @JohnW 's method? – P-M Aug 24 '16 at 16:17
1

Did you look at the results, or just the timings? The correct where use would be `np.nanmean(a == ..., np.nan, a)`. If you don't use `nanmean` the result could be `nan`. – hpaulj Aug 24 '16 at 16:42

Divakar · Answer 3 · 2016-08-24T18:21:36.020

One way would be to get the sum for all elements in one go and then removing the contribution from the invalid ones. Finally, we need to get the average value itself, divide by the number of valid elements. So, we would have an implementation like so -

def mean_ignore_num(arr,num):
    # Get count of invalid ones
    invc = np.count_nonzero(arr==num)

    # Get the average value for all numbers and remove contribution from num
    return (arr.sum() - invc*num)/float(arr.size-invc)

Verify results -

In [191]: arr = np.full(10,2147483647).astype(np.int32)
     ...: arr[1] = 5
     ...: arr[4] = 4
     ...: 

In [192]: arr.max()
Out[192]: 2147483647

In [193]: arr.sum() # Extends beyond int32 max limit, so no overflow
Out[193]: 17179869185

In [194]: arr[arr != 2147483647].mean()
Out[194]: 4.5

In [195]: mean_ignore_num(arr,2147483647)
Out[195]: 4.5

Runtime test -

In [38]: arr = np.random.randint(0,9,(10000))

In [39]: arr[arr != 7].mean()
Out[39]: 3.6704609489462414

In [40]: mean_ignore_num(arr,7)
Out[40]: 3.6704609489462414

In [41]: %timeit arr[arr != 7].mean()
10000 loops, best of 3: 102 µs per loop

In [42]: %timeit mean_ignore_num(arr,7)
10000 loops, best of 3: 36.6 µs per loop

`2147483647 = 2**31 - 1`. Will your method have an overflow problem if the data type of the array is `np.int32`? — Warren Weckesser, Aug 24 '16 at 17:08
@WarrenWeckesser Seems like it should take care of that, Edited post. — Divakar, Aug 24 '16 at 18:21
Looks like it will: `np.sum()` upcasts integer arrays to "the default platform integer". If that is `np.int64` then there shouldn't be a problem. — Warren Weckesser, Aug 24 '16 at 18:53

Average of numpy array ignoring specified value

3 Answers3

Linked

Related