3

I have a Pandas dataframe in which each column represents a separate property, and each row holds the properties' value on a specific date:

import pandas as pd

dfstr = \
'''         AC        BO         C       CCM        CL       CRD        CT        DA        GC        GF
2010-01-19  0.844135 -0.194530 -0.231046  0.245615 -0.581238 -0.593562  0.057288  0.655903  0.823997  0.221920
2010-01-20 -0.204845 -0.225876  0.835611 -0.594950 -0.607364  0.042603  0.639168  0.816524  0.210653  0.237833
2010-01-21  0.824852 -0.216449 -0.220136  0.234343 -0.611756 -0.624060  0.028295  0.622516  0.811741  0.201083'''
df = pd.read_csv(pd.compat.StringIO(dfstr), sep='\s+')

Using the rank method, I can find the percentile rank of each property with respect to a specific date:

df.rank(axis=1, pct=True)

Output:

             AC   BO    C  CCM   CL  CRD   CT   DA   GC   GF
2010-01-19  1.0  0.4  0.3  0.7  0.2  0.1  0.5  0.8  0.9  0.6
2010-01-20  0.4  0.3  1.0  0.2  0.1  0.5  0.8  0.9  0.6  0.7
2010-01-21  1.0  0.4  0.3  0.7  0.2  0.1  0.5  0.8  0.9  0.6

What I'd like to get instead is the quantile (eg quartile, quintile, decile, etc) rank of each property. For example, for quintile rank my desired output would be:

             AC   BO    C  CCM   CL  CRD   CT   DA   GC   GF
2010-01-19   5    2     2  4     1   1     3    4    5    3
2010-01-20   2    2     5  1     1   3     4    5    3    4
2010-01-21   5    2     2  4     1   1     3    4    5    3

I might be missing something, but there doesn't seem to a built-in way to do this kind of quantile ranking with Pandas. What's the simplest way to get my desired output?

tel
  • 13,005
  • 2
  • 44
  • 62
  • Interested in one-line solution as well. Although, Once you got the rank by `percentile`, getting quartile and so on is just one more line of `map`. – Quang Hoang May 27 '19 at 21:36
  • @QuangHoang Yeah, it's surprisingly tricky. As well, I think there might be some extra edge cases to account for if there is repeated or missing data. – tel May 27 '19 at 21:44

1 Answers1

6

Method 1 mul & np.ceil

You were quite close with the rank. Just multiplying by 5 with .mul to get the desired quantile, also rounding up with np.ceil:

np.ceil(df.rank(axis=1, pct=True).mul(5))

Output

             AC   BO    C  CCM   CL  CRD   CT   DA   GC   GF
2010-01-19  5.0  2.0  2.0  4.0  1.0  1.0  3.0  4.0  5.0  3.0
2010-01-20  2.0  2.0  5.0  1.0  1.0  3.0  4.0  5.0  3.0  4.0
2010-01-21  5.0  2.0  2.0  4.0  1.0  1.0  3.0  4.0  5.0  3.0

If you want integers use astype:

np.ceil(df.rank(axis=1, pct=True).mul(5)).astype(int)

Or even better Since pandas version 0.24.0 we have nullable integer type: Int64.
So we can use :

np.ceil(df.rank(axis=1, pct=True).mul(5)).astype('Int64')

Output

            AC  BO  C  CCM  CL  CRD  CT  DA  GC  GF
2010-01-19   5   2  2    4   1    1   3   4   5   3
2010-01-20   2   2  5    1   1    3   4   5   3   4
2010-01-21   5   2  2    4   1    1   3   4   5   3

Method 2 scipy.stats.percentileofscore

d = df.apply(lambda x: [np.ceil(stats.percentileofscore(x, a, 'rank')*0.05) for a in x], axis=1).values

pd.DataFrame(data=np.concatenate(d).reshape(d.shape[0], len(d[0])), 
             columns=df.columns, 
             dtype='int', 
             index=df.index)

Output

            AC  BO  C  CCM  CL  CRD  CT  DA  GC  GF
2010-01-19   5   2  2    4   1    1   3   4   5   3
2010-01-20   2   2  5    1   1    3   4   5   3   4
2010-01-21   5   2  2    4   1    1   3   4   5   3
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • Ah, nice. I didn't know `np.ceil` would just work on a dataframe without further coercion. – tel May 27 '19 at 21:53
  • Yes, since the underlying data of DataFrames are arrays. So you can apply a `numpy` function to them. – Erfan May 27 '19 at 22:06
  • Just for your convenience, I knew I used a method from the `scipy` module for this once. Added another method @tel generate the same output. – Erfan May 27 '19 at 22:13
  • Neat. I knew that dataframes wrap Numpy arrays, but I wonder what trickery the Numpy/Panda devs came up with that allows `np.ceil` to return the desired type (ie `pd.DataFrame`) from `np.ceil` instead of a standard `np.ndarray`. – tel May 27 '19 at 22:26
  • Also, one little nitpick: `.astype(int)` doesn't work when you have missing data, since `NaN` is a float. Good news is that so far that's the only edge case failure I found with your solution(s). – tel May 27 '19 at 22:29
  • Yes good point about the `NaN`, added a solution with `nullable integer` type. @tel – Erfan May 27 '19 at 22:33