15
Example DataFrame Values -  

0     78
1     38
2     42
3     48
4     31
5     89
6     94
7    102
8    122
9    122  

stats.percentileofscore(temp['INCOME'].values, 38, kind='mean')
15.0

stats.percentileofscore(temp['INCOME'].values, 38, kind='strict')
10.0

stats.percentileofscore(temp['INCOME'].values, 38, kind='weak')
20.0

stats.percentileofscore(temp['INCOME'].values, 38, kind='rank')
20.0

temp['INCOME'].rank(pct=True)
1    0.20 (Only showing the 38 value index)

temp['INCOME'].quantile(0.11)
37.93

temp['INCOME'].quantile(0.12)
38.31999999999999

Based on the results above, you can see none of the methods are consistent
with the pd.quantiles() method.

I need to get the percentile for one column for each row in a dataframe (255M rows) but can't find any functions/methods that return the 'linear interpolation' method that they use in pd.quantile & np.percentile.

I've tried the following methods/functions -

.rank(pct=True)

This method only returns the values ranked in order, not using the percentile method that I'm looking for. Inconsistent with pd.quantiles

scipy.stats.percentileofscore  

This method almost is closer to what I'm looking for but still is not 100% consistent with the 'linear interpolation' method for some reason. Related question to this problem with no real answer

I've looked through every SO answer that is related to this question but none of them use the same interpolation method that I need to use so please do not mark this as a duplicate unless you can verify they're using the same method.

At this point my last option is to just find the bin cut-offs for all 100 percentiles and apply it that way or calculate the linear interpolation myself but this seems very inefficient and will take forever to apply to 255M records.

Any other suggestions to do this?

Thanks!

bbennett36
  • 6,065
  • 10
  • 20
  • 37
  • The bin-cutoffs would be very simple, unless I'm missing something: `pd.qcut(df.col_name, q=100)` – ALollz Jun 11 '18 at 20:13
  • @ALollz Yes, that would work. However I would need to match all 255M records to the specific bin based on values for 3 different columns. I was hoping for a more efficient way but this may be what I need to do. I'm not sure of the best way to go about matching into those bins though. – bbennett36 Jun 11 '18 at 21:08
  • Not totally sure I understand, but perhaps use `stack` first to get just one column of all of the values you need to find the percentiles for? you should then be able to group based on the output of `pd.qcut` or just group based on that to begin with and do some calculation on each percentile without explicitly creating them. – ALollz Jun 11 '18 at 21:11
  • It's not surprising you get different answers with 10 rows, but I'd think all the answers converge (at least almost) with 255 million rows, no? I don't know this specific case but it's common to get slightly different answers for `qcut` between various programs (sas/stata/r/etc) as you have fewer rows and also as you have mass points. E.g. if 25% of your values are, say, zero or one, then different programs can take different strategies for splitting the zeros or ones between groups (since they inevitably span 2 or more groups). – JohnE Jun 12 '18 at 09:38
  • 2
    Can you be more explicit about exactly what percentile value you expect for *all* of the input values? – BrenBarn Jun 16 '18 at 07:27
  • @BrenBarn the linear interpolation percentile method as stated. – bbennett36 Jun 19 '18 at 00:44
  • @bbennett36: The linear percentile method is a means of computing the *percentile* from the *data values*. But your questions asks about computing a new *data value* from a given *percentile*. I gather from you question that what you are looking for is the *inverse* of the linear-interpolation percentile method, but you don't say this is explicitly. Is that correct? – BrenBarn Jun 19 '18 at 18:37

3 Answers3

43

TL; DR

Use

sz = temp['INCOME'].size-1
temp['PCNT_LIN'] = temp['INCOME'].rank(method='max').apply(lambda x: 100.0*(x-1)/sz)

   INCOME    PCNT_LIN
0      78   44.444444
1      38   11.111111
2      42   22.222222
3      48   33.333333
4      31    0.000000
5      89   55.555556
6      94   66.666667
7     102   77.777778
8     122  100.000000
9     122  100.000000

Answer

It is actually very simple, once your understand the mechanics. When you are looking for percentile of a score, you already have the scores in each row. The only step left is understanding that you need percentile of numbers that are less or equal to the selected value. This is exactly what parameters kind='weak' of scipy.stats.percentileofscore() and method='average' of DataFrame.rank() do. In order to invert it, run Series.quantile() with interpolation='lower'.

So, the behavior of the scipy.stats.percentileofscore(), Series.rank() and Series.quantile() is consistent, see below:

In[]:
temp = pd.DataFrame([  78, 38, 42, 48, 31, 89, 94, 102, 122, 122], columns=['INCOME'])
temp['PCNT_RANK']=temp['INCOME'].rank(method='max', pct=True)
temp['POF']  = temp['INCOME'].apply(lambda x: scipy.stats.percentileofscore(temp['INCOME'], x, kind='weak'))
temp['QUANTILE_VALUE'] = temp['PCNT_RANK'].apply(lambda x: temp['INCOME'].quantile(x, 'lower'))
temp['RANK']=temp['INCOME'].rank(method='max')
sz = temp['RANK'].size - 1 
temp['PCNT_LIN'] = temp['RANK'].apply(lambda x: (x-1)/sz)
temp['CHK'] = temp['PCNT_LIN'].apply(lambda x: temp['INCOME'].quantile(x))

temp

Out[]:
   INCOME  PCNT_RANK    POF  QUANTILE_VALUE  RANK  PCNT_LIN    CHK
0      78        0.5   50.0              78   5.0  0.444444   78.0
1      38        0.2   20.0              38   2.0  0.111111   38.0
2      42        0.3   30.0              42   3.0  0.222222   42.0
3      48        0.4   40.0              48   4.0  0.333333   48.0
4      31        0.1   10.0              31   1.0  0.000000   31.0
5      89        0.6   60.0              89   6.0  0.555556   89.0
6      94        0.7   70.0              94   7.0  0.666667   94.0
7     102        0.8   80.0             102   8.0  0.777778  102.0
8     122        1.0  100.0             122  10.0  1.000000  122.0
9     122        1.0  100.0             122  10.0  1.000000  122.0

Now in a column PCNT_RANK you get ratio of values that are smaller or equal to the one in a column INCOME. But if you want the "interpolated" ratio, it is in column PCNT_LIN. And as you use Series.rank() for calculations, it is pretty fast and will crunch you 255M numbers in seconds.


Here I will explain how you get the value from using quantile() with linear interpolation:

temp['INCOME'].quantile(0.11)
37.93

Our data temp['INCOME'] has only ten values. According to the formula from your link to Wiki the rank of 11th percentile is

rank = 11*(10-1)/100 + 1 = 1.99

The truncated part of the rank is 1, which corresponds to the value 31, and the value with the rank 2 (i.e. next bin) is 38. The value of fraction is the fractional part of the rank. This leads to the result:

 31 + (38-31)*(0.99) = 37.93

For the values themselves, the fraction part have to be zero, so it is very easy to do the inverse calculation to get percentile:

p = (rank - 1)*100/(10 - 1)

I hope I made it more clear.

Community
  • 1
  • 1
igrinis
  • 12,398
  • 20
  • 45
  • Thank you for this solution and in depth explanation! This works and is very quick as well. – bbennett36 Jun 20 '18 at 14:23
  • How can I do this only taking in the previous 'n' rows in the calculation? For example, if I have 100 entries but I want to calculate percentile on a rolling basis of 10 rows, how would I go about that? – User9123 Sep 10 '22 at 05:06
  • 1
    Consider using `.rolling(window_size)` before `rank()` – igrinis Sep 12 '22 at 11:13
2

This seems to work:

A = np.sort(temp['INCOME'].values)
np.interp(sample, A, np.linspace(0, 1, len(A)))

For example:

>>> temp.INCOME.quantile(np.interp([37.5, 38, 122, 121], A, np.linspace(0, 1, len(A))))
0.103175     37.5
0.111111     38.0
1.000000    122.0
0.883333    121.0
Name: INCOME, dtype: float64

Please note that this strategy only makes sense if you want to query a large enough number of values. Otherwise the sorting is too expensive.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
2

Let's consider the below dataframe:

DataFrame

In order to get the percentile of a column in pandas Dataframe we use the following code:

 survey['Nationality'].value_counts(normalize='index')

Output:

USA 0.333333

China 0.250000

India 0.250000

Bangadesh 0.166667

Name: Nationality, dtype: float64

In order to get the percentile of a column in pandas Dataframe with respect to another categorical column

pd.crosstab(survey.Sex,survey.Handedness,normalize = 'index')

The output would be something like given below

Output