0

I have a dataframe:

d = [f1  f2  f3 
     1    2   3 
     5    1   2 
     3    3   1 
     2    4   7
     ..  ..  ..]

I want to add, per feature, the percentile of the value for this feature in the row (for subset of features). So for subset = [f1,f2,f3] my dataframe will be

new_d =[f1   f2   f3  f1_per   f2_per   f3_per
         1    2   3    0         0.25     0.5
         5    1   2    1          0       0.25
         3    3   1    0.5        0.5     0
         2    4   5    0.25      0.75     1
         4    5   4    0.75       1       0.75]

What is the best way to do so?

Cranjis
  • 1,590
  • 8
  • 31
  • 64
  • Is possible explin how is count percentile here? What is `q` in percentile function? [`DataFrame.quantile`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html) ? – jezrael Nov 30 '20 at 08:15
  • @jezrael yep, meaning how much percent of the values of this feature are smaller than the value of the specific row – Cranjis Nov 30 '20 at 08:21
  • @jezrael sorry not sure I understand? For each number, I want to know what is the ratio of numbers that are smaller. so , if the axis is [1,2,3,4,5] then the number 4 is 0.75 since 3/4 of the other numbers are smaller than him. – Cranjis Nov 30 '20 at 08:28
  • Ya, I was not sure, because in input data re 4 rows, in ouput are 5 rows. so ouput is different like in your data (if understand what need.) – jezrael Nov 30 '20 at 08:29

3 Answers3

2

A way to do this is the following:

df['pct_1'] = df.f1.rank(pct=True) 
df['pct_f2'] = df.f2.rank(pct=True)
df['pct_f3'] = df.f3.rank(pct=True) 

which gives:

   f1  f2  f3  pct_1  pct_f2  pct_f3
0   1   2   3   0.25    0.50    0.75
1   5   1   2   1.00    0.25    0.50
2   3   3   1   0.75    0.75    0.25
3   2   4   7   0.50    1.00    1.00
Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
2

In ouput are 5 rows, in input are 4 rows, so output is different is use DataFrame.rank with all columns and join back to original

df = df.join(df.rank(pct=True).add_prefix('pct'))
print (df)
  f1  f2  f3  pctf1  pctf2  pctf3
0   1   2   3    0.2    0.4    0.6
1   5   1   2    1.0    0.2    0.4
2   3   3   1    0.6    0.6    0.2
3   2   4   7    0.4    0.8    1.0
4   4   5   4    0.8    1.0    0.8

If need rank with percentile by number of rows without 1:

df = df.join(df.rank().sub(1).div(len(df) - 1).add_prefix('pct'))
print (df)
   f1  f2  f3  pctf1  pctf2  pctf3
0   1   2   3   0.00   0.25   0.50
1   5   1   2   1.00   0.00   0.25
2   3   3   1   0.50   0.50   0.00
3   2   4   7   0.25   0.75   1.00
4   4   5   4   0.75   1.00   0.75
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Here is another approach, doing explicitly what you want:

res = df.apply(lambda x: np.greater.outer(x.values, x.values).sum(axis=1) / (len(x) - 1))
res.columns = [f'{c}_per' for c in df.columns]
res = df.join(res)
print(res)

Output

   f1  f2  f3  f1_per  f2_per  f3_per
0   1   2   3    0.00    0.25    0.50
1   5   1   2    1.00    0.00    0.25
2   3   3   1    0.50    0.50    0.00
3   2   4   7    0.25    0.75    1.00
4   4   5   4    0.75    1.00    0.75
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76