3

Consider the following Python code:

In [1]: import numpy as np
In [2]: import scipy.stats as stats
In [3]: ar = np.array([0.8389, 0.5176, 0.1867, 0.1953, 0.4153, 0.6036, 0.2497, 0.5188, 0.4723, 0.3963])
In [4]: x = ar[-1]
In [5]: stats.percentileofscore(ar, x, kind='strict')
Out[5]: 30.0
In [6]: stats.percentileofscore(ar, x, kind='rank')
Out[6]: 40.0
In [7]: stats.percentileofscore(ar, x, kind='weak')
Out[7]: 40.0
In [8]: stats.percentileofscore(ar, x, kind='mean')
Out[8]: 35.0

The kind argument represents the interpretation of the resulting score.

Now when I use Excel's PERCENTRANK function with the same data, I get 0.3333. This appears to be correct as there are 3 values less than x=0.3963.

Can someone explain why I'm getting inconsistent results?

Jason Strimpel
  • 14,670
  • 21
  • 76
  • 106
  • I don't think 1/3 is a reasonable result for an input array of length 10. (Note that `In[6]` was probably using `kind='rank'`. I don't think you got two different results with the same parameters.) – Sven Marnach Nov 15 '11 at 15:21
  • You're right, I was using `kind='rank'` in `In[6]`. Copy/paste error. I edited the post. – Jason Strimpel Nov 15 '11 at 15:32
  • 1
    Well, as I said before: I think all the results you showed are as expected, but Excel's result is strange. Since I don't know what you did in Excel, I can't comment on the strange result. – Sven Marnach Nov 15 '11 at 15:40
  • I entered `0.8389, 0.5176, 0.1867, 0.1953, 0.4153, 0.6036, 0.2497, 0.5188, 0.4723, 0.3963` in column A so that 0.8389 is in A1 through 0.3963 in A10. Then I did =PERCENTRANK(A1:A10,A10) which returned 0.3333. It appears that Scipy does `(number of values above X) / (total values)` which in this case is `3/10=0.3` where Excel does `(number of values above X) / (total values-1)` which in this case is `3/9=0.3333`. – Jason Strimpel Nov 15 '11 at 15:49
  • is it me, or is the first example http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.percentileofscore.html#scipy.stats.percentileofscore incorrect? It should be 2/3 if it doesn't count the value, or 1/2 if it does. Correct me if I've misunderstood – Jdog Nov 15 '11 at 16:28
  • Jdog, looks fine to me, 3 is the 3rd element of 4, 2/4. http://mail.scipy.org/pipermail/scipy-dev/2008-November/010216.html – Josef Nov 15 '11 at 17:21
  • I was too slow to edit previous comment http://en.wikipedia.org/wiki/Percentile_rank – Josef Nov 15 '11 at 17:28

2 Answers2

1

This is a weird one, near as I can tell they are doing different calculations, Scipy will reproduce the excel result if called this way.

In [1]: import numpy as np
In [2]: In [2]: import scipy.stats as stats
In [3]: In [3]: ar = np.array([0.8389, 0.5176, 0.1867, 0.1953, 0.4153, 0.6036, 0.2497, 0.5188, 0.4723, 0.3963])
In [4]: In [4]: x = ar[-1]
In [5]: stats.percentileofscore(ar[:-1], x, kind='mean')
Out[5]: 33.333333333333336

using any of the kind keywords I get the same answer. This is leaving out the value in the data that is exactly equal to the query. Have a look at this PercentRank algorithm in VBA as it might have a bit of insight.

So which is right? Excel or Scipy?

Community
  • 1
  • 1
Brian Larsen
  • 1,740
  • 16
  • 28
1

When I rewrote this function in scipy.stats, I found many different definitions, some of them are included.

The basic example is when I want to rank students on a score. In this case the score includes all students, and the percentileofscore gives the rank among all students. The main distinction then is just how to handle ties.

Excel seems to use how you would rank a student relative to an existing scale, for example what's the rank of a score on the historical GRE scale. I have no idea if excel drops one entry if the score is not in the existing list.

A similar problem in statistics are "plotting positions" for quantiles. I don't find a good reference on the internet. Here is one general formula http://amsglossary.allenpress.com/glossary/search?id=plotting-position1 Wikipedia only has a short paragraph: http://en.wikipedia.org/wiki/Q-Q_plot#Plotting_positions

The literature has a large number of cases of different choices of b (or even choices of a second parameter a), that correspond to different approximations for different distributions. Several are implemented in scipy.stats.mstats.

I don't think it's a question of which is right. It's, what you want to use it for? And what's the common definition for your problem or your field?

Josef
  • 21,998
  • 3
  • 54
  • 67
  • I am using the code in a financial application. Basically I am building indicators and taking the percentileofscore of a user-defined window. The indicator then says buy or sell. The funny thing is that I am trying to replicate code in VBA, so unfortunately, unit tests are failing because of the differences between scipy and Excel. And when backtesting, the differences create large discrepancies in P/L over time. – Jason Strimpel Nov 16 '11 at 04:28
  • It's a good point that it's less of a question of which is right, but there are four methods provided in scipy. None of them replicate Excel's algo? – Jason Strimpel Nov 16 '11 at 04:29
  • I was looking for the behavior of excel (which I don't have) but it looks like they also changed their behavior http://office.microsoft.com/en-us/excel-help/percentrank-inc-function-HP010335694.aspx http://office.microsoft.com/en-us/excel-help/percentrank-inc-function-HP010335694.aspx It would be possible to write a compatible function from scratch, but except for benchmarking against VBA, I'm not sure it makes much sense. – Josef Nov 16 '11 at 13:08
  • Agreed. I think it's just a matter of working with the customer so that they understand the differences. – Jason Strimpel Nov 16 '11 at 15:45