-1

I am writing code to add data along the length of a numpy array (for combining satellite data records). In order to do this my code reads two arrays and then uses the function

def swath_stack(array1, array2):
    """Takes two arrays of swath data and compares their dimensions.
    The arrays should be equally sized in dimension 1. If they aren't the 
    function appends the smallest along dimension 0 until they are.
    The arrays are then stacked on top of each other."""
    if array1.shape[1] > array2.shape[1]:
        no_lines = array1.shape[1] - array2.shape[1]
        a = np.zeros((array2.shape[0], no_lines))
        a.fill(-999.)
        new_array = np.hstack((array2, a))
        mask = np.zeros(new_array.shape, dtype=int)
        mask[np.where(new_array==-999.)] = 1
        array2 = ma.masked_array(new_array, mask=mask)
    elif array1.shape[1] < array2.shape[1]:
        no_lines = array2.shape[1] - array1.shape[1]
        a = np.zeros((array1.shape[0], no_lines))
        a.fill(-999.)
        new_array = np.hstack((array1, a))
        mask = np.zeros(new_array.shape, dtype=int)
        mask[np.where(new_array==-999.)] = 1
        array1 = ma.masked_array(new_array, mask=mask)
    return np.vstack((array1, array2))

to make one array of the two in the line

window_data = swath_stack(window_data, stack_data)

In the event that the arrays under consideration are equal in width the swath_stack() function reduces to np.vstack(). My problem is that I keep encountering MemoryErrorduring this stage. I know that in the case of arithmetic operators it is more memory efficient to do the arithmetic in place (i.e. array1 += array2 as opposed to array1 = array1 + array2) but I don't know how to avoid this kind of memory issue whilst using my swath_stack() function.

Can anyone please help?

1 Answers1

1

I changed your last line to np.ma.vstack, and got

In [474]: swath_stack(np.ones((3,4)),np.zeros((3,6)))
Out[474]: 
masked_array(data =
 [[1.0 1.0 1.0 1.0 -- --]
 [1.0 1.0 1.0 1.0 -- --]
 [1.0 1.0 1.0 1.0 -- --]
 [0.0 0.0 0.0 0.0 0.0 0.0]
 [0.0 0.0 0.0 0.0 0.0 0.0]
 [0.0 0.0 0.0 0.0 0.0 0.0]],
             mask =
 [[False False False False  True  True]
 [False False False False  True  True]
 [False False False False  True  True]
 [False False False False False False]
 [False False False False False False]
 [False False False False False False]],
       fill_value = 1e+20)

This preserves the masking that you created during the padding.

The masked padding doubles the memory use of the intermediate array.

Do you get memory errors when using 2 equal size arrays? I.e. with just the plain vstack? There's no way of doing in-place stacking. It must create one or more new arrays. Arrays have a fixed size, so can't grow in-place. And the final array must have a contiguous data buffer, so can't use the buffers of the originals.

It won't use masking, but np.pad might make padding a bit easier.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • I found the issue, and it was indeed to do with the array sizes (it was a little tricky to uncover but essentially had to do with an erroneous timestamp causing the creation of a huge array to fill gaps in the data). – Thomas Eldridge Sep 22 '16 at 11:54
  • separately I later encountered the issue with np.vstack not preserving masking. My solution to this is clumsier than the above, so thank you for showing me the np.ma.vstack instead! – Thomas Eldridge Sep 22 '16 at 11:55