0

I notice the following strange behavior with rankdata with maksed_array. Here is the code:

import numpy as np
import scipy.stats as stats

m = [True, False]
print(stats.mstats.rankdata(np.ma.masked_array([1.0, 100], mask=m)))
# result [0. 1.]

print(stats.mstats.rankdata(np.ma.masked_array([1.0, np.nan], mask=m)))
# result [1. 0.]

print(stats.mstats.rankdata([1.0, np.nan]))
# result [1. 2.]

According the scipy doc, masked values will be assigned 0 (use_missing=False). So why it outputs [1 0] in the 2nd one? Bug?

fivelements
  • 1,487
  • 2
  • 19
  • 32

1 Answers1

0

After tracing I find it is related to the argsort method of masked_array. When mstats.rankdata calls argsort, it does not specify fill_value, and endwith input parameters, which defaults to np.nan and True respectively. Based on the following code from numpy, the fill_value is np.nan.

if fill_value is None:
    if endwith:
        # nan > inf
        if np.issubdtype(self.dtype, np.floating):
            fill_value = np.nan

So in the case of masked_array of [1, 100], it is argsorting [nan, 100], which is [1, 0]. In the case of masked_array of [1, np.nan], it is argsoring [nan, nan], which can be [0,1]. Then in the rankdata function, it assume first n (n=1) from argsort is valid, which is not correct here.

n = data.count()
rk = np.empty(data.size, dtype=float)
idx = data.argsort()
rk[idx[:n]] = np.arange(1,n+1)
fivelements
  • 1,487
  • 2
  • 19
  • 32
  • There is a `np.ma.argsort` function. It may handle masked arrays better than the `np.argsort`, even though that appears to delegate to a method. – hpaulj Jun 28 '18 at 21:50