fast numpy addnan

Question

I would like to add thousands of 4D arrays element wise and accounting for nans. A simple example using 1D arrays would be:

X = array([4,7,89,nan,89,65, nan])
Y = array([0,5,4, 9,  8, 100,nan])
z = X+Y
print z = array([4,12,93,9,97,165,nan])

I've written a simple for loop around this but it takes forever - not a smart solution. Another solution could be creating a larger array and use bottleneck nansum but this would take too much memory for my laptop. I need a running sum over 11000 cases.

Does anyone have a smart and fast way to do this?

score 10 · Accepted Answer · answered Aug 23 '12 at 17:20

10

Here is one possibility:

>>> x = np.array([1, 2, np.nan, 3, np.nan, 4])
... y = np.array([1, np.nan, 2, 5, np.nan, 8])
>>> x = np.ma.masked_array(np.nan_to_num(x), mask=np.isnan(x) & np.isnan(y))
>>> y = np.ma.masked_array(np.nan_to_num(y), mask=x.mask)
>>> (x+y).filled(np.nan)
array([  2.,   2.,   2.,   8.,  nan,  12.])

The real difficulty is that you seem to want nan to be interpreted as zero unless all values at a particular position are nan. This means that you must look at both x and y to determine which nans to replace. If you are okay with having all nan values replaced, then you can simply do np.nan_to_num(x) + np.nan_to_num(y).

answered Aug 23 '12 at 17:20

BrenBarn

242,874
37
412
384

Masked arrays are the way to go here if your numpy implementation is new enough to support it (mine isn't -- maybe it's time for an upgrade) (+1). – mgilson Aug 23 '12 at 17:25
@mgilson: Heh, probably is time! I think masked arrays have been in numpy for a few years now. – BrenBarn Aug 23 '12 at 17:29
Well my computer's a few years old ;^) – mgilson Aug 23 '12 at 17:35
The masked array method would mean a "for" loop and be much slower for my problem.The generator expression works frighteningly fast and the results are accurate. – Shejo284 Aug 23 '12 at 19:31
@mgilson masked arrays were always part of numpy as objects of their own. Since version 1.2, they became a subclass of standard ndarrays. @Shejo284 Where do you see a `for` loop ? @BrenBarn The use of `np.nan_to_num` is overkill, have a look on the solution I posted below (that I would very humbly call the "way to go" if you can use masked arrays...) – Pierre GM Aug 24 '12 at 20:47

score 3 · Answer 2 · edited Dec 30 '12 at 18:26

3

You could do something like:

arr1 = np.array([1.0, 1.0, np.nan, 1.0, 1.0, np.nan])
arr2 = np.array([1.0, 1.0, 1.0, 1.0, 1.0, np.nan])
flags = np.isnan(arr1) & np.isnan(arr2)
copy1 = arr1.copy()
copy2 = arr2.copy()
copy1[np.isnan(copy1)] = 0.0
copy2[np.isnan(copy2)] = 0.0
out = copy1 + copy2
out[flags] = np.NaN
print out
array([  2.,   2.,   1.,   2.,   2.,  NaN])

to find the locations in the arrays where both have a NaN in that index. Then, do essentially what @mgilson suggested, as in make copies and replace the NaNs with 0.0, add the two arrays together, and then replace the flagged indices above with np.NaN.

edited Dec 30 '12 at 18:26

Andy Hayden

359,921
101
625
535

answered Aug 23 '12 at 17:23

reptilicus

10,290
6
55
79

@mgilson: I'm trying to write a generator expression as it consumes less memory but I'm a bit confused as to how this works when dealing with very large numbers and reading a netcdf file, slice for slice: for i in cases: array = np.array(netcdfvar[i]) # Then sum these slices accounting for nan not sure how this generator would look. – Shejo284 Aug 23 '12 at 18:32
@Shejo284 -- I think you posted this on the wrong answer ;-). Anyway, I'm not familiar with reading slices from a netcdf file, but, you might try the following: `sum( nan_to_zero(np.array(netcdfvar[i])) for i in cases )`, or as BrenBarn points out: `sum( np.nan_to_num(netcdfvar[i]) for i in cases )` – mgilson Aug 23 '12 at 18:45
@mgilson: yes, you are right. I'm still learning how to use this site. Thanks. I've been trying several variations with varying success. Your solution is a bit counter intuitive. I'll test it. – Shejo284 Aug 23 '12 at 19:01
@mgilson: works like a charm and super fast. Thanks for teaching me this (and the generator expression). Much obliged. – Shejo284 Aug 23 '12 at 19:14

score 3 · Answer 3 · edited Sep 24 '13 at 12:50

3

import numpy as np
z=np.nansum([X,Y],axis=0)

edited Sep 24 '13 at 12:50

Radim Köhler

122,561
47
239
335

answered Sep 24 '13 at 12:29

kevin

31
1

1

This almost works. The issue is that this solution does not produce the desired output. The output should include NaNs where *both* input vectors have NaNs in the same positions. We can put the NaNs back with the addition of a third line to this solution: `z[np.isnan(x) & np.isnan(y)] = np.NaN` – Jack Kelly Dec 18 '14 at 11:03

score 1 · Answer 4 · answered Aug 23 '12 at 17:12

1

Not sure how this would perform, but it's worth a shot :)

def nan_to_zero(array):
    new_arr = array.copy()
    new_arr[np.isnan(array)] = 0.
    return new_arr

sum( nan_to_zero(arr) for arr in array_generator )

This doesn't result in a NaN in the last place of your array though. It results in a 0 ...

answered Aug 23 '12 at 17:12

mgilson

300,191
65
633
696

@mgilson: a list comprehension after removing the nans. I never thought about the list comprehension part. But I suspect this assumes a 1D array. Can't see how I could code this method for a 4D array. – Shejo284 Aug 23 '12 at 17:16
1

@Shejo284 -- It's actually a generator expression, but works similarly. I don't see any reason why this couldn't be used with 4D arrays though. Really, 4D arrays are just 1D arrays in memory anyway (Unless you really have view objects, but it should still work with those as well) – mgilson Aug 23 '12 at 17:20
2

@Shejo284, `sum` just calls `__add__`, however that's defined. Since `__add__` is defined for 4D arrays as for 1D arrays, it works. – senderle Aug 23 '12 at 17:24
@BrenBarn -- You're correct (I didn't know that existed, neat). And, looking at the source, it looks like it does almost exactly what I have coded up above (except that it needs to do a whole lot more type-checking). – mgilson Aug 23 '12 at 17:24
@mgilson: giving this try and will come back with the results soon. – Shejo284 Aug 23 '12 at 17:27
@senderle -- Thanks for explaining that for me. I think you did a better job than I would have done. – mgilson Aug 23 '12 at 17:27
I'm always happy to improve my answers if you leave a comment saying how this is either incorrect, or could be better. Thanks! – mgilson Aug 23 '12 at 18:42
@BrenBarn: works like a charm and super fast. Thanks for your great contribution. Much obliged. – Shejo284 Aug 23 '12 at 19:16

Pierre GM · Answer 5 · 2012-08-24T22:16:08.610

1

I see several simpler solutions:

(EDITED) Using np.ma

mX = np.ma.masked_array(X, mask=np.isnan(X))
mY = np.ma.masked_array(Y, mask=np.isnan(Y))
mZ = np.ma.masked_array(mX.filled(0) + mY.filled(0),
                        mask=mX.mask * mY.mask)
Z = mZ.filled(np.nan)

(EDITED) Not using np.ma

mx = np.isnan(x)
my = np.isnan(y)
z = np.where(mx,0,x) + np.where(my,0,y)
z[mx&my] = np.nan

edited Aug 24 '12 at 22:16

answered Aug 24 '12 at 20:42

Pierre GM

19,809
3
56
67

1

These solutions do not produce the desired output. He wants the non-nan terms to be added, with nan appearing in the result only if *all* values at a particular position are nan. Your solutions produce additional nans at positions where only one of the two input vectors has a nan. – BrenBarn Aug 24 '12 at 20:55
OK, fixed. Thanks for keeping me on my toes – Pierre GM Aug 24 '12 at 21:26
Also note that your last solution is something the OP explicitly said he didn't want to do (create a larger array containing both). The second solution looks nice, though. – BrenBarn Aug 24 '12 at 21:31

fast numpy addnan

5 Answers5