2

I was looking into numpy issue 2972 and several related problems. It turns out that all those problems are related to the situation where the array itself is structured, but its mask is not:

In [38]: R = numpy.zeros(10, dtype=[("A", "<f2"), ("B", "<f4")])

In [39]: Rm = numpy.ma.masked_where(R["A"]<5, R)

In [41]: Rm.dtype
Out[41]: dtype([('A', '<f2'), ('B', '<f4')])

In [42]: Rm.mask.dtype
Out[42]: dtype('bool')

# Now, both `__getitem__` and `__repr__` will result in errors — see issue #2972

If I create a masked array differently, the mask dtype is structured like the dtype of the array itself:

In [44]: Q.dtype
Out[44]: dtype([('A', '<f4'), ('B', '<f4')])

In [45]: Q.mask.dtype
Out[45]: dtype([('A', '?'), ('B', '?')])

The former situation exposes several problems. For example, Rm.__repr__() and Rm["A"] both result in IndexError, although it was a ValueError in the past.

By design, is the pattern supposed to be possible, where A.dtype is structured, but A.mask.dtype is not structured?

In other words: is the bug in the __repr__ and __getitem__ methods in numpy.ma.core.MaskedArray, or is the real bug occurring before — by permitting such a masked structured array to exist in the first place?

JamesENL
  • 6,400
  • 6
  • 39
  • 64
gerrit
  • 24,025
  • 17
  • 97
  • 170

1 Answers1

3

The errors in your 1st case indicate that the methods expect the mask to have the same number (and names) of fields as the base array

__getitem__:  dout._mask = _mask[indx]
_recursive_printoption: (curdata, curmask) = (result[name], mask[name])

If the masked array is make with the 'main' constructor, the mask has the same structure

Rn = np.ma.masked_array(R, mask=R['A']>5)
Rn.mask.dtype: dtype([('A', '?'), ('B', '?')])

In other words, there is a mask value for each field of each element.

The masked_array doc evidently intends for 'same shape' to include dtype structure. Mask: Must be convertible to an array of booleans with the same shape as 'data'.

If I try to set the mask in the same way that masked_where does

Rn._mask=R['A']>5

I get the same print error. The structured mask gets overwritten with the new boolean, changing its dtype. In contrast if I use

Rn.mask=R['A']<5

Rn prints fine. .mask is a property, whose set method evidently handles the structured mask correctly.

Without digging into the code history (on github) my guess is that masked_where is a convenience function that wasn't updated when structure dtypes were added to other parts of the ma code. Compared to ma.masked_array it's a simple function that does not look at the dtype at all. Other convenience functions like ma.masked_greater use masked_where. Changing result._mask = cond to result.mask = cond might be all that is need to correct this issue.


How thoroughly have you tested the consequences of an unstructured mask?

Rm.flatten()

returns an array with a structured mask, even when it started with an unstructured one. That's because it uses Rm.__setmask__, which is sensitive to fields. And that's the set function for the mask property.

Rm.tolist()  # same error as str()

masked_where starts with:

cond = make_mask(condition)

make_mask returns the simple 'bool' dtype. It can also be called with a dtype, producing a structured mask: np.ma.make_mask(R['A']<5,dtype=R.dtype). But such a structured mask gets flattened when used in masked_where. masked_where not only allows a unstructured mask, it forces it to be unstructured.

Your unstructured mask is already partly implemented, the recordmask property:

recordmask = property(fget=_get_recordmask)

I say partly because it has a get method, but the set method is not yet implemented. See def _set_recordmask(self):

The more I look at this the more I'm convinced that masked_where is wrong. It could be changed to set a structured mask, but then it's not much different from masked_array. It might better if it raises an error when the array is structured (has dtype.names). That way masked_where will remain useful for unstructured numeric arrays, while preventing misapplication to structured ones.

I should also look at the test code.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • I realise all of that. The question is: Is the bug in `masked_where` (the mask of the structured array should be structured), or is the bug in the other functions (they should expect the mask of a structured array to be either structured or unstructured)? Logically, both make sense to me, so there is an argument that both should be permitted. The mask must have the same shape, of course. But must its dtype also have the same shape? – gerrit Jan 28 '15 at 18:17
  • I was trying to make the case that the core `ma` functionality assumes a structured mask, and the fact the `masked_where` allows you to set an unstructured mask is a bug, leftover from pre-structured days. – hpaulj Jan 28 '15 at 19:06