Compute percentile rank relative to a given population

Question

I have "reference population" (say, v=np.random.rand(100)) and I want to compute percentile ranks for a given set (say, np.array([0.3, 0.5, 0.7])).

It is easy to compute one by one:

def percentile_rank(x):
    return (v<x).sum() / len(v)
percentile_rank(0.4)
=> 0.4

(actually, there is an ootb scipy.stats.percentileofscore - but it does not work on vectors).

np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
=> [ 0.33  0.48  0.71]

This produces the expected results, but I have a feeling that there should be a built-in for this.

I can also cheat:

pd.concat([pd.Series([0.3, 0.5, 0.7]),pd.Series(v)],ignore_index=True).rank(pct=True).loc[0:2]

0    0.330097
1    0.485437
2    0.718447

This is bad on two counts:

I don't want the test data [0.3, 0.5, 0.7] to be a part of the ranking.
I don't want to waste time computing ranks for the reference population.

So, what is the idiomatic way to accomplish this?

MaxU - stand with Ukraine · Accepted Answer · 2018-01-24T22:08:25.397

4

Setup:

In [62]: v=np.random.rand(100)

In [63]: x=np.array([0.3, 0.4, 0.7])

Using Numpy broadcasting:

In [64]: (v<x[:,None]).mean(axis=1)
Out[64]: array([ 0.18,  0.28,  0.6 ])

Check:

In [67]: percentile_rank(0.3)
Out[67]: 0.17999999999999999

In [68]: percentile_rank(0.4)
Out[68]: 0.28000000000000003

In [69]: percentile_rank(0.7)
Out[69]: 0.59999999999999998

edited Jan 24 '18 at 22:08

answered Jan 24 '18 at 22:02

MaxU - stand with Ukraine

205,989
36
386
419

when both `v` and `x` are `Series` (columns in a `DataFrame`), I get `ValueError: Lengths must match to compare`. – sds Jan 24 '18 at 22:15
1

@sds, in this case you can do it this way: `(v.values – MaxU - stand with Ukraine Jan 24 '18 at 22:16

score 2 · Answer 2 · answered Jan 24 '18 at 21:59

2

I think pd.cut can do that

s=pd.Series([-np.inf,0.3, 0.5, 0.7])
pd.cut(v,s,right=False).value_counts().cumsum()/len(v)
Out[702]: 
[-inf, 0.3)    0.37
[0.3, 0.5)     0.54
[0.5, 0.7)     0.71
dtype: float64

Result from your function

np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
Out[696]: array([0.37, 0.54, 0.71])

answered Jan 24 '18 at 21:59

BENY

317,841
20
164
234

this seems to rely on the test scores being sorted. I would rather avoid that if possible. – sds Jan 24 '18 at 22:16

score 2 · Answer 3 · answered Jan 24 '18 at 21:59

2

You can use quantile:

np.random.seed(123)
v=np.random.rand(100)

s = pd.Series(v)
arr = np.array([0.3,0.5,0.7])

s.quantile(arr)

Output:

0.3    0.352177
0.5    0.506130
0.7    0.644875
dtype: float64

answered Jan 24 '18 at 21:59

Scott Boston

147,308
15
139
187

1

I think this is the _inverse_ of the function I am looking for. – sds Jan 24 '18 at 22:10

score 0 · Answer 4 · answered May 12 '22 at 16:37

0

I know I am a little late to the party, but wanted to add that pandas has another way to get what you are after with Series.rank. Just use the pct=True option.

answered May 12 '22 at 16:37

Raisin

345
1
9

Compute percentile rank relative to a given population

4 Answers4