8

I would like to add thousands of 4D arrays element wise and accounting for nans. A simple example using 1D arrays would be:

X = array([4,7,89,nan,89,65, nan])
Y = array([0,5,4, 9,  8, 100,nan])
z = X+Y
print z = array([4,12,93,9,97,165,nan])

I've written a simple for loop around this but it takes forever - not a smart solution. Another solution could be creating a larger array and use bottleneck nansum but this would take too much memory for my laptop. I need a running sum over 11000 cases.

Does anyone have a smart and fast way to do this?

Christian Hudon
  • 1,881
  • 1
  • 21
  • 42
Shejo284
  • 4,541
  • 6
  • 32
  • 44

5 Answers5

10

Here is one possibility:

>>> x = np.array([1, 2, np.nan, 3, np.nan, 4])
... y = np.array([1, np.nan, 2, 5, np.nan, 8])
>>> x = np.ma.masked_array(np.nan_to_num(x), mask=np.isnan(x) & np.isnan(y))
>>> y = np.ma.masked_array(np.nan_to_num(y), mask=x.mask)
>>> (x+y).filled(np.nan)
array([  2.,   2.,   2.,   8.,  nan,  12.])

The real difficulty is that you seem to want nan to be interpreted as zero unless all values at a particular position are nan. This means that you must look at both x and y to determine which nans to replace. If you are okay with having all nan values replaced, then you can simply do np.nan_to_num(x) + np.nan_to_num(y).

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • Masked arrays are the way to go here if your numpy implementation is new enough to support it (mine isn't -- maybe it's time for an upgrade) (+1). – mgilson Aug 23 '12 at 17:25
  • @mgilson: Heh, probably is time! I think masked arrays have been in numpy for a few years now. – BrenBarn Aug 23 '12 at 17:29
  • Well my computer's a few years old ;^) – mgilson Aug 23 '12 at 17:35
  • The masked array method would mean a "for" loop and be much slower for my problem.The generator expression works frighteningly fast and the results are accurate. – Shejo284 Aug 23 '12 at 19:31
  • @mgilson masked arrays were always part of numpy as objects of their own. Since version 1.2, they became a subclass of standard ndarrays. @Shejo284 Where do you see a `for` loop ? @BrenBarn The use of `np.nan_to_num` is overkill, have a look on the solution I posted below (that I would very humbly call the "way to go" if you can use masked arrays...) – Pierre GM Aug 24 '12 at 20:47
3

You could do something like:

arr1 = np.array([1.0, 1.0, np.nan, 1.0, 1.0, np.nan])
arr2 = np.array([1.0, 1.0, 1.0, 1.0, 1.0, np.nan])
flags = np.isnan(arr1) & np.isnan(arr2)
copy1 = arr1.copy()
copy2 = arr2.copy()
copy1[np.isnan(copy1)] = 0.0
copy2[np.isnan(copy2)] = 0.0
out = copy1 + copy2
out[flags] = np.NaN
print out
array([  2.,   2.,   1.,   2.,   2.,  NaN])

to find the locations in the arrays where both have a NaN in that index. Then, do essentially what @mgilson suggested, as in make copies and replace the NaNs with 0.0, add the two arrays together, and then replace the flagged indices above with np.NaN.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
reptilicus
  • 10,290
  • 6
  • 55
  • 79
  • @mgilson: I'm trying to write a generator expression as it consumes less memory but I'm a bit confused as to how this works when dealing with very large numbers and reading a netcdf file, slice for slice: for i in cases: array = np.array(netcdfvar[i]) # Then sum these slices accounting for nan not sure how this generator would look. – Shejo284 Aug 23 '12 at 18:32
  • @Shejo284 -- I think you posted this on the wrong answer ;-). Anyway, I'm not familiar with reading slices from a netcdf file, but, you might try the following: `sum( nan_to_zero(np.array(netcdfvar[i])) for i in cases )`, or as BrenBarn points out: `sum( np.nan_to_num(netcdfvar[i]) for i in cases )` – mgilson Aug 23 '12 at 18:45
  • @mgilson: yes, you are right. I'm still learning how to use this site. Thanks. I've been trying several variations with varying success. Your solution is a bit counter intuitive. I'll test it. – Shejo284 Aug 23 '12 at 19:01
  • @mgilson: works like a charm and super fast. Thanks for teaching me this (and the generator expression). Much obliged. – Shejo284 Aug 23 '12 at 19:14
3
import numpy as np
z=np.nansum([X,Y],axis=0)
Radim Köhler
  • 122,561
  • 47
  • 239
  • 335
kevin
  • 31
  • 1
  • 1
    This almost works. The issue is that this solution does not produce the desired output. The output should include NaNs where *both* input vectors have NaNs in the same positions. We can put the NaNs back with the addition of a third line to this solution: `z[np.isnan(x) & np.isnan(y)] = np.NaN` – Jack Kelly Dec 18 '14 at 11:03
1

Not sure how this would perform, but it's worth a shot :)

def nan_to_zero(array):
    new_arr = array.copy()
    new_arr[np.isnan(array)] = 0.
    return new_arr

sum( nan_to_zero(arr) for arr in array_generator )

This doesn't result in a NaN in the last place of your array though. It results in a 0 ...

mgilson
  • 300,191
  • 65
  • 633
  • 696
  • @mgilson: a list comprehension after removing the nans. I never thought about the list comprehension part. But I suspect this assumes a 1D array. Can't see how I could code this method for a 4D array. – Shejo284 Aug 23 '12 at 17:16
  • 1
    @Shejo284 -- It's actually a generator expression, but works similarly. I don't see any reason why this couldn't be used with 4D arrays though. Really, 4D arrays are just 1D arrays in memory anyway (Unless you really have view objects, but it should still work with those as well) – mgilson Aug 23 '12 at 17:20
  • 2
    @Shejo284, `sum` just calls `__add__`, however that's defined. Since `__add__` is defined for 4D arrays as for 1D arrays, it works. – senderle Aug 23 '12 at 17:24
  • @BrenBarn -- You're correct (I didn't know that existed, neat). And, looking at the source, it looks like it does almost exactly what I have coded up above (except that it needs to do a whole lot more type-checking). – mgilson Aug 23 '12 at 17:24
  • @mgilson: giving this try and will come back with the results soon. – Shejo284 Aug 23 '12 at 17:27
  • @senderle -- Thanks for explaining that for me. I think you did a better job than I would have done. – mgilson Aug 23 '12 at 17:27
  • I'm always happy to improve my answers if you leave a comment saying how this is either incorrect, or could be better. Thanks! – mgilson Aug 23 '12 at 18:42
  • @BrenBarn: works like a charm and super fast. Thanks for your great contribution. Much obliged. – Shejo284 Aug 23 '12 at 19:16
1

I see several simpler solutions:

  • (EDITED) Using np.ma

    mX = np.ma.masked_array(X, mask=np.isnan(X))
    mY = np.ma.masked_array(Y, mask=np.isnan(Y))
    mZ = np.ma.masked_array(mX.filled(0) + mY.filled(0),
                            mask=mX.mask * mY.mask)
    Z = mZ.filled(np.nan)
    
  • (EDITED) Not using np.ma

    mx = np.isnan(x)
    my = np.isnan(y)
    z = np.where(mx,0,x) + np.where(my,0,y)
    z[mx&my] = np.nan
    
Pierre GM
  • 19,809
  • 3
  • 56
  • 67
  • 1
    These solutions do not produce the desired output. He wants the non-nan terms to be added, with nan appearing in the result only if *all* values at a particular position are nan. Your solutions produce additional nans at positions where only one of the two input vectors has a nan. – BrenBarn Aug 24 '12 at 20:55
  • OK, fixed. Thanks for keeping me on my toes – Pierre GM Aug 24 '12 at 21:26
  • Also note that your last solution is something the OP explicitly said he didn't want to do (create a larger array containing both). The second solution looks nice, though. – BrenBarn Aug 24 '12 at 21:31