4

I've a question relating the behaviour of numpy.median() on masked arrays created with numpy.ma.masked_array().

As I've understood from debugging my own code, numpy.median() does not work as expected on masked arrays (see Using numpy.median on a masked array for a definition of the problem)

The answer provided was:

Explanation: If I remember correctly, the np.median does not support subclasses, so it fails to work correctly on np.ma.MaskedArray.

The conclusion therefore being that in order to calculate the median of the elements in a masked array is to use numpy.ma.median() since this is a median function dedicated to masked arrays.

My problem lies in the fact that I've just spent a considerable amount of time finding this problem since there is no way of knowing this problem.

There is no warning or exception raised when trying to calculate the median of a masked array via numpy.median().

The answer returned by this function is not what is expected, and cause serious problems when people are not aware of this.

Does anyone know if this might be considered a bug?

In my opinion, the expected behaviour should be that using numpy.median on a masked array will raise and exception of some sort.

Any thoughts???

The below test script shows the unwanted and unexpected behaviour of using numpy.median on a masked array (note that the correct and expected median value of the valid elements is 2.5!!!):

In [1]: import numpy as np

In [2]: test = np.array([1, 2, 3, 4, 100, 100, 100, 100])

In [3]: valid_elements = np.array([1, 1, 1, 1, 0, 0, 0, 0], dtype=np.bool)

In [4]: testm = np.ma.masked_array(test, ~valid_elements)

In [5]: testm
Out[5]: 
masked_array(data = [1 2 3 4 -- -- -- --],
             mask = [False False False False  True  True  True  True],
       fill_value = 999999)

In [6]: np.median(test)
Out[6]: 52.0

In [7]: np.median(test[valid_elements])
Out[7]: 2.5

In [8]: np.median(testm)
Out[8]: 4.0

In [9]: np.ma.median(testm)
Out[9]: 2.5
Community
  • 1
  • 1
Joris
  • 168
  • 9
  • So your complaint here is that `np.median` does not work, but `np.ma.median` does? – Eric Feb 22 '17 at 12:52
  • sort of... my complaint is that it doesn't give any indication (not even in the doc) that np.median gives wrong result when it is applied to a masked array. for me personally, being quite sloppy sometimes, this wrong np.median value will be undetected. so, the complaint is NOT that np.ma.median works, but that np.median does not give any indication of incorrect results on masked arrays – Joris Feb 22 '17 at 12:58
  • A large number of `np.*` functions do not work correctly on masked arrays. The problem is, the `np.*` functions do not even know that masked arrays exist. – Eric Feb 22 '17 at 13:00
  • and could I find a list somewhere which computations do not work properly? and it is not a problem that a function does not work, but that it works without warnings/exceptions but give wrong results – Joris Feb 22 '17 at 13:02
  • No such list exists, I'm afraid. The problem in your case is that `median` relies on `partition`ing at the halfway point, and I can't think of a good definition of `partition` for masked arrays – Eric Feb 22 '17 at 13:09
  • OK, that I believe, because I know that calculation the median is an order-statistics kind of thing. However, do you agree that raising an exception in this case is better than given a wrong value? – Joris Feb 22 '17 at 13:14
  • yes, but given that `np.ma` is intended to perform as an extension to `np`, there's no way for `np.median` to know that something is wrong. At any rate, you should probably file an issue on github, since this isn't really suited to stack overflow. The behaviour is certainly unfortunate, but it might be a compromise by design. Worth discussing with the rest of the numpy team, even if it turns out not to be a bug. – Eric Feb 22 '17 at 13:17
  • `testd.compressed()` can be used to fetch the valid elements. That and `filled` are widely used in MA code to produced an array that works properly in a general numpy function. – hpaulj Feb 22 '17 at 17:48
  • thanks hpaulj, I'll give that a go – Joris Feb 22 '17 at 17:59

1 Answers1

4

Does anyone know if this might be considered a bug?

Well, it is a Bug! I posted it a few months ago on their issue tracker (Link to the bug report).

The reason for this behaviour is that np.median uses the partition method of the input-array but np.ma.MaskedArray doesn't override the partition method. So when arr.partition is called in np.median it simply defaults to the basic numpy.ndarray.partition method (which is bogus for a masked array!).

MSeifert
  • 145,886
  • 38
  • 333
  • 352