Example DataFrame Values -
0 78
1 38
2 42
3 48
4 31
5 89
6 94
7 102
8 122
9 122
stats.percentileofscore(temp['INCOME'].values, 38, kind='mean')
15.0
stats.percentileofscore(temp['INCOME'].values, 38, kind='strict')
10.0
stats.percentileofscore(temp['INCOME'].values, 38, kind='weak')
20.0
stats.percentileofscore(temp['INCOME'].values, 38, kind='rank')
20.0
temp['INCOME'].rank(pct=True)
1 0.20 (Only showing the 38 value index)
temp['INCOME'].quantile(0.11)
37.93
temp['INCOME'].quantile(0.12)
38.31999999999999
Based on the results above, you can see none of the methods are consistent
with the pd.quantiles() method.
I need to get the percentile for one column for each row in a dataframe (255M rows) but can't find any functions/methods that return the 'linear interpolation' method that they use in pd.quantile
& np.percentile
.
I've tried the following methods/functions -
.rank(pct=True)
This method only returns the values ranked in order, not using the percentile method that I'm looking for. Inconsistent with pd.quantiles
scipy.stats.percentileofscore
This method almost is closer to what I'm looking for but still is not 100% consistent with the 'linear interpolation' method for some reason. Related question to this problem with no real answer
I've looked through every SO answer that is related to this question but none of them use the same interpolation method that I need to use so please do not mark this as a duplicate unless you can verify they're using the same method.
At this point my last option is to just find the bin cut-offs for all 100 percentiles and apply it that way or calculate the linear interpolation myself but this seems very inefficient and will take forever to apply to 255M records.
Any other suggestions to do this?
Thanks!