4

I have a numeric column in a dataframe from which I need to categorize that row based on it's value. For example,

id   value
1    2.0
2    3.0
3    4.5
4    5.5

I need a new category variable group based on the quantile value of the rows that have come before that row. So, for id=2, it will consider rows 1 and 2 in the quantile calculation. And then do the categorization like:

if value > quantile(90%)                              category = 'Very High'
if value > quantile(75%) & value <= quantile(90%) &   category = 'High'
if value > quantile(25%) & value <= quantile(75%) &   category = 'Normal'
if value <= quantile(25%)                             category = 'Low'

How would I calculate the quantile like that and do the comparison?

sfactor
  • 12,592
  • 32
  • 102
  • 152
  • can you give us an expected output, I feel like this will return "very high" at each values, are you expecting multiple dataframe output or just 1? – Steven G Oct 28 '16 at 13:36

1 Answers1

1

Maybe i don't understand the question very well but since your series is always increasing, the quantile of the last value will always be 1. so the output should be "very High" all the time.

this is not super efficient because I am creating a Series for each iteration but it does the trick. you could change the apply function with something from scipy to calculate the percentile, it will speed up the process

This is all because DataFrame.expanding() doesn't have the rank() method. if that would have been possible it would have been very straight forward.

def ranking(x):
    return pd.Series(x).rank(pct=True).values[-1]

ranked = sx.expanding().apply(lambda x: ranking(x))
pd.cut(ranked, (0, 0.25, 0.75, 0.90, 1), labels=['low', 'Normal', 'High', 'Very High'])
Out[97]: 
0    Very High
1    Very High
2    Very High
3    Very High
dtype: category
Categories (4, object): [low < Normal < High < Very High]

faster solution with scipy dependencies:

from scipy.stats import rankdata
ranked = sx.expanding().agg(lambda x: rankdata(x)[-1]/len(x))

In[108]: import timeit
In[109]: %timeit sx.expanding().agg(lambda x: rankdata(x)[-1]/len(x))
1000 loops, best of 3: 611 us per loop
In[110]: %timeit sx.expanding().apply(lambda x: ranking(x))
1000 loops, best of 3: 1.15 ms per loop
Steven G
  • 16,244
  • 8
  • 53
  • 77
  • maybe this would be faster `from scipy.stats import rankdata ranked = df.expanding().agg(lambda x: rankdata(x)[-1]/len(x))` – Bob Baxley Oct 28 '16 at 14:05