python fancy indexing with a boolean masked array

Question

I have a numpy masked array of data:

data = masked_array(data = [7 -- 7 1 8 -- 1 1 -- -- 3 -- -- 3 --],
                    mask = [False True False False False True False False True True False True True False True])

I have a flag of a specific type of data, which is a boolean masked array:

flag = masked_array(data = [True False False True -- -- -- False -- True -- -- -- -- True],
                    mask = [False False False False True True True False True False True True True True False])

I want to do something like data[flag] and get the following output:

output_wanted = [7 1 -- --]

which corresponds to the data elements where the flag is True. Instead I get this:

output_real = [7 -- 7 1 8 -- 1 1 -- -- 3 -- -- 3 --]

I did not copied the masks of the outputs for better clarity.

I dont mind having an output with the size of the flag as long as it selects the data I want (the one corresponding to the True values of the flag). But I cannot figure out why it gives theses values in the real output !

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

What about something like:

import numpy as np
from numpy.ma import masked_array

data = masked_array(data = [7,     0,     7,     1,     8,     0,    1,     1,     0,    0,     3,     0,    0,    3,     0],
                    mask = [False, True,  False, False, False, True, False, False, True, True,  False, True, True, False, True])
flag = masked_array(data = [True,  False, False, True,  0,     0,    0,     False, 0,    True,  0,     0,    0,    0,     True],
                    mask = [False, False, False, False, True,  True, True,  False, True, False, True,  True, True, True,  False])

print(repr(data))
print(repr(flag))

indices = np.where(flag & ~flag.mask)
print(data[indices])

Note, you may get into trouble if the masked values in flag can't be compared with &, but it doesn't look like that's the case for you.

Output:

masked_array(data = [7 -- 7 1 8 -- 1 1 -- -- 3 -- -- 3 --],
             mask = [False  True False False False  True False False  True  True False  True  True False  True],
       fill_value = 999999)

masked_array(data = [1 0 0 1 -- -- -- 0 -- 1 -- -- -- -- 1],
             mask = [False False False False  True  True  True False  True False  True  True  True  True False],
       fill_value = 999999)

[7 1 -- --]

Edit:

An alternative way of getting the indices might also be:

indices = np.where(flag.filled(False))

Update (Edit 2):

Beware of the subtleties of indexing arrays using arrays.

Consider the following code:

import numpy as np

data = np.array([1,2,3,4,5])
mask = np.array([True, False, True, False, True])

res  = data[mask]
print(res)

As you might (or might not) expect, here, the mask serves as a "filter", filtering out the elements of data where the corresponding location in the mask is False. Because of the values I choose for the data and mask, the effect is that the indexing serves to filter out the even data values leaving only the odd ones.

The output here is: [1 3 5].

Now, consider the very similar code:

import numpy as np

data = np.array([1,2,3,4,5])
mask = np.array([1, 0, 1, 0, 1])

res  = data[mask]
print(res)

Here, the only thing changed is datatype of the mask elements, their boolean value is the same. Let's call the first mask (comprised of True/False values) mask1 and the second mask (comprised of 1/0 values) mask2.

You can inspect the datatype of arrays through the dtype attribute (e.g. print(mask.dtype)). mask1 has a dtype of bool, while mask2 has a dtype of int32.

Here, however, the output is different: [2 1 2 1 2].

What's going on here?

In fact, indexing behaves differently depending on the datatype of the array used to index. As mentioned, when the datatype of the "mask" is boolean, it serves a filtering function. But when the datatype of the "mask" is integral, it serves a "selection" function, using the elements of the index as indices of the original array.

So, in the second example, since data[1] = 2 and data[0] = 1, the result of data[mask2] is an array of length 5, not 3 (in the boolean case).

Put another way, given the following code:

res = data[mask]

If mask.dtype == int, the length of res will be equal to the length of mask.

If mask.dtype == bool, the length of res will be equal to the number of True values in mask.

Quite a difference.

Lastly, you can coerce an array of one datatype to another using the astype method.

Demonstration snippet:

import numpy as np

data = np.array([1,2,3,4,5])

# Create a boolean mask
mask1 = np.array([True, False, True, False, True])

# Create an integer "mask", using the same logical values 
mask2 = np.array([1,0,1,0,1])

# Coerce mask2 into a boolean mask
mask3 = mask2.astype(bool)

print(data)         # [1 2 3 4 5]
print("-" * 80)
print(mask1)        # [True  False  True  False  True]
print(mask1.dtype)  # bool
print(data[mask1])  # [1 3 5]
print("-" * 80)
print(mask2)        # [1 0 1 0 1]
print(mask2.dtype)  # int32
print(data[mask2])  # [2 1 2 1 2]
print("-" * 80)
print(mask3)        # [True  False  True  False  True]
print(mask3.dtype)  # bool
print(data[mask3])  # [1 3 5]

Yes that works thanks. But do you know how python deals with the indexing when given a boolean masked array ? I dont understand the output — JoVe, Jul 19 '16 at 11:18

hpaulj · Accepted Answer · 2016-08-09T19:50:28.033

If I reconstruct your arrays with:

In [28]: d=np.ma.masked_equal([7,0,7,1,8,0,1,1,0,0,3,0,0,3,0],0)

In [29]: f=np.ma.MaskedArray([True,False,False,True, False,False,False,False,True,True,True,True,True,True,True],[False, False, False, False, True, True, True, False, True, False, True, True, True, True, False])

In [30]: d
Out[30]: 
masked_array(data = [7 -- 7 1 8 -- 1 1 -- -- 3 -- -- 3 --],
             mask = [False  True False False False  True False False  True  True False  True
  True False  True],
       fill_value = 0)

In [31]: f
Out[31]: 
masked_array(data = [True False False True -- -- -- False -- True -- -- -- -- True],
             mask = [False False False False  True  True  True False  True False  True  True
  True  True False],
       fill_value = True)

The masked displays match, but I'm guessing at what the masked values are.

In [32]: d[f]
Out[32]: 
masked_array(data = [7 1 -- -- 3 -- -- 3 --],
             mask = [False False  True  True False  True  True False  True],
       fill_value = 0)

In [33]: d[f.data]
Out[33]: 
masked_array(data = [7 1 -- -- 3 -- -- 3 --],
             mask = [False False  True  True False  True  True False  True],
       fill_value = 0)

Indexing the f is the same as indexing with its data attribute. Its mask does nothing. Evidently my masked values are different from yours.

But if I index with a filled array, I get the desired array:

In [34]: d[f.filled(False)]
Out[34]: 
masked_array(data = [7 1 -- --],
             mask = [False False  True  True],
       fill_value = 0)

filled is used a lot in np.ma code, with differing fill values depending on the np operation (e.g. 0 for sum v 1 for product). Masked arrays don't usually iterate over their values skipping the masked ones; instead they convert the masked ones to innocuous values, and use regular numpy operations. The other strategy is to remove the masked values with compressed.

indices = np.where(flag.filled(False)) is mentioned in another answer, but plain boolean form works just as well.

A masked array has a data and mask attribute. Masking does not change the data values directly. That task is left to methods like filled.

It strikes me as a bug that indexing with a masked array is even allowed, given that doing so seems to break the semantic of the masked array that the fill_value shouldn't matter. I'd expect there to be a `if isinstance(boolmask, MaskedArray) and getmask(boolmask) is not False: raise ValueError` or something — Eric, Aug 09 '16 at 07:31
Indexing is implemented by passing a `indx` tuple to a `d.__getitem__` method. If `d` is a masked array, it delegates the task to `d.data[indx]` and `d.mask[indx]`. That `indx` tuple could have slices, scalars, lists, arrays, etc. Blocking masked arrays from such a tuple will require a lot of extra logic - and not only for masked array indexing. `np.arange(15)[f]`? — hpaulj, Aug 09 '16 at 19:49

score 0 · Answer 3 · edited Aug 09 '16 at 07:24

I figured out how indexing with masked arrays works.

In fact, python does not deal with this kind of indexing.

When doing something like data[flag] with flag a boolean masked array, python takes the underlying data of flag. In other words, it takes the values of the masked values before they were masked.

So beware: if the masked values are not explicitly filled with their fill_value, the indexing may look random.

Example:

>>> arr = np.array([0, 1, 2, 3, 4])
>>> flag = np.ma.masked_array([True, False, False, True, True],
                              [False, True, False, False, True])

>>> arr[flag])
array([0, 3, 4])

One way to do it is like jedwards answer.

But I think masked arrays should be avoided to flag data, it does not bring sufficient insight.

In case of a flag array that is used to access a certain type of data, masked values should be set to False. For ex if you want to interpolate the data that are not flagged.

If the flag array is used to mask a certain type of data masked values should be set to True.

`data[flag[~flag.mask]]` won't work as you expect for two reasons. The first is that `flag[~flag.mask]` returns an integer array, not boolean as you would need (see update 2 to my post). The second is that the length of the array, even if coerced to boolean, will be too small, and filter incorrect elements. In my version of numpy, attempting this shows a `VisibileDeprecationWarning` informing me of the length difference. — jedwards, Jul 19 '16 at 13:14
Yep I noticed the length problem when testing. But why does `flag[~flag.mask]` return an int array ?? — JoVe, Jul 19 '16 at 14:08

python fancy indexing with a boolean masked array

3 Answers3

Update (Edit 2):