Efficient memory usage with numpy masked arrays

Question

I have a large ndarray X (roughly (1e3, 1e3, 1e3)), where I want to do manipulations of X including and not including particular elements of the 0th axis (for each element of the 1st and 2nd axes). i.e. there are (1e3, 1e3) elements which I want to (at times) mask in or out.

The simplest thing to do would be to construct a masked array like,

Z = np.zeros_like(X, dtype=bool)
# assume `inds` is some indexing array which will target
#    the particular (1e3 x 1e3) elements I'm interested in
Z[inds] = True
Y = np.ma.masked_array(X, mask=Z)

But this uses an extra gigabyte of memory just for the masking array. Is there any way to do this without constructing a second 10^9 element array of masks? For example, is it possible to construct a sparse-matrix for the mask?

Nope; `scipy.sparse` does not implement any sort of masking. And `np,ma` cannot use `sparse` matrices. Keep in mind that `np.ma`, when doing calculations, either fills the masked values with innocuous values (e.g 0s, 1s) or compresses the array to 1d without the masked values. You could implement those steps directly if appropriate. — hpaulj, Jun 07 '17 at 16:20
@hpaulj thanks! that's very helpful. For a function like, `np.ma.std`, how does it deal with masked values? If there is no `axis` argument, then presumably the array is flattened... but what if there is an `axis` argument --- it can neither flatten, nor fill 0s, right? — DilithiumMatrix, Jun 07 '17 at 16:34
Looks like we need to study `numpy/ma/core.py`. `np.ma.std` uses the ma `std` method, which uses the `var`, which uses the `mean`, which in turn uses `sum` and `count`. `ma.sum` uses `filled(0)`. Looks like `count` uses `sum` on the `~mask` - ie counts unmasked values per axis. — hpaulj, Jun 07 '17 at 16:56

score 1 · Answer 1 · answered Jun 07 '17 at 15:06

1

If you just want to take "clean" slices, as opposed to only taking some elements from some "rows", then you could use numeric indices instead of a mask.

E.g.:

arr = np.array([[[1,2,3,4], [5,6,7,8]], [[9,8,9,8], [7,6,7,6]]])
sub_idx = np.array([0,2])
sub_arr = arr[:, :, sub_idx]

This is a copy of a subset of arr, namely the 0th and 2nd "slices" in the last dimension:

array([[[1, 3],
        [5, 7]],

       [[9, 9],
        [7, 7]]])

Note that the array that defines which indexes to use is only one-dimensional, severely reducing its memory requirements. (Though of course the copy still takes up a significant chunk of memory in your case.)

Also note that this gives you a copy, so any changes you make to the result (sub_arr) do not manifest in the original array. To do that, you'd have to copy the array back over:

sub_arr[:] = 0 # Manipulate the values
arr[sub_idx] = sub_arr

answered Jun 07 '17 at 15:06

acdr

4,538
2
19
45

Hmm, yeah, I guess I could just store those sub array values, and actually zero them out and replace them as needed... this wouldn't work for all situations (i.e. sometimes you really want to ignore the element instead of have it be zero; e.g. calculating a standard deviation or something) --- but it might work for my situation, thanks for the suggestion – DilithiumMatrix Jun 07 '17 at 15:14
I'm not saying you have to set them to zero - that was just my example of manipulating the sub-array. – acdr Jun 07 '17 at 15:28
Sure, but still, I think there are some problems for which there isn't a fill value that will work. – DilithiumMatrix Jun 07 '17 at 16:15

Efficient memory usage with numpy masked arrays

1 Answers1