12

When I try to calculate the 75th percentile in MATLAB, I get a different value than I do in NumPy.

MATLAB:

>> x = [ 11.308 ;   7.2896;   7.548 ;  11.325 ;   5.7822;   9.6343;
     7.7117;   7.3341;  10.398 ;   6.9675;  10.607 ;  13.125 ;
     7.819 ;   8.649 ;   8.3106;  12.129 ;  12.406 ;  10.935 ;
    12.544 ;   8.177 ]

>> prctile(x, 75)

ans =

11.3165

Python + NumPy:

>>> import numpy as np

>>> x = np.array([ 11.308 ,   7.2896,   7.548 ,  11.325 ,   5.7822,   9.6343,
     7.7117,   7.3341,  10.398 ,   6.9675,  10.607 ,  13.125 ,
     7.819 ,   8.649 ,   8.3106,  12.129 ,  12.406 ,  10.935 ,
    12.544 ,   8.177 ])

>>> np.percentile(x, 75)
11.312249999999999

I've checked the answer with R too, and I'm getting NumPy's answer.

R:

> x <- c(11.308 ,   7.2896,   7.548 ,  11.325 ,   5.7822,   9.6343,
+          7.7117,   7.3341,  10.398 ,   6.9675,  10.607 ,  13.125 ,
+          7.819 ,   8.649 ,   8.3106,  12.129 ,  12.406 ,  10.935 ,
+         12.544 ,   8.177)
> quantile(x, 0.75)
     75% 
11.31225 

What is going on here? And is there any way to make Python & R's behavior mirror MATLAB's?

James
  • 2,635
  • 5
  • 23
  • 30
  • 2
    Can you tell us the formula MATLAB is using? R has [9 different ways](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/quantile.html) to calculate quantiles. It seems MATLABs answer for the 75th matches `quantile(x, 0.75, type=2)` and `quantile(x, 0.75, type=5)` in R. – MrFlick Jul 15 '14 at 17:59
  • Sure -- from the MATLAB help page: (I can't comment it here because it is too long) http://www.mathworks.com/help/stats/prctile.html (You might need to expand the "algorithms" button at the bottom) – James Jul 15 '14 at 18:02

2 Answers2

13

MATLAB apparently uses midpoint interpolation by default. NumPy and R use linear interpolation by default:

In [182]: np.percentile(x, 75, interpolation='linear')
Out[182]: 11.312249999999999

In [183]: np.percentile(x, 75, interpolation='midpoint')
Out[183]: 11.3165

The understand the difference between linear and midpoint, consider this simple example:

In [187]: np.percentile([0, 100], 75, interpolation='linear')
Out[187]: 75.0

In [188]: np.percentile([0, 100], 75, interpolation='midpoint')
Out[188]: 50.0

To compile the latest version of NumPy (using Ubuntu):

mkdir $HOME/src
git clone https://github.com/numpy/numpy.git
git remote add upstream https://github.com/numpy/numpy.git
# Read ~/src/numpy/INSTALL.txt
sudo apt-get install libatlas-base-dev libatlas3gf-base
python setup.py build --fcompiler=gnu95
python setup.py install

The advantage of using git instead of pip is that it is super easy to upgrade (or downgrade) to other versions of NumPy (and you get the source code too):

git fetch upstream
git checkout master # or checkout any other version of NumPy
cd ~/src/numpy
/bin/rm -rf build
cdsitepackages    # assuming you are using virtualenv; otherwise cd to your local python sitepackages directory
/bin/rm -rf numpy numpy-*-py2.7.egg-info
cd ~/src/numpy
python setup.py build --fcompiler=gnu95
python setup.py install
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Hmm -- what version of NumPy are you using? I don't have the option to add an interpolation keyword parameter. (I've tried numpy and the scipy.stats.scoreatpercentile) – James Jul 15 '14 at 18:09
  • @James: [The interpolation parameter](http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html) was added in version 1.9.0. – unutbu Jul 15 '14 at 18:11
  • is there somewhere I could access version 1.9.0? (Or the latest one?) Pip seems to be installing 1.8.1 even if I use the --upgrade flag – James Jul 15 '14 at 18:21
  • 1
    `pip install -U numpy` should have worked AFAIK. However, I posted an alternative method using git to compile/install the latest version. – unutbu Jul 15 '14 at 18:49
  • 2
    That is not true for edge cases. Matlab uses midpoint interpolation for everything in between the percentile assigned to the lowest and highest data point. For percentiles outside of this range it assigns the minimum and maximum of the dataset. [source](http://de.mathworks.com/help/stats/prctile.html) – cpaulik Aug 18 '15 at 09:38
  • Downvoted until the info supplied by @cpaulik is included in the answer. – Oleg Aug 27 '15 at 17:42
  • 1
    This is not incomplete, it is **WRONG**. Matlab uses linear interpolation by default (see http://nl.mathworks.com/help/stats/prctile.html) and not midpoint. It is just a different variant of linear interpolation than numpy. See https://en.wikipedia.org/wiki/Percentile and https://en.wikipedia.org/wiki/Quantile for the different variants of linear interpolation. – Gaëtan de Menten Mar 02 '18 at 15:46
11

Since the accepted answer is still incomplete even after @cpaulik's comment, I'm posting here what is hopefully a more complete answer (although, for brevity reasons, not perfect, see below).

Using np.percentile(x, p, interpolation='midpoint') is only going to give the same answer for very specific values, namely when p/100 is a multiple of 1/n, n being the number of elements of the array. In the original question, this was indeed the case, since n=20 and p=75, but in general the two functions differ.

A short emulation of Matlab's prctile function is given by:

def quantile(x,q):
    n = len(x)
    y = np.sort(x)
    return(np.interp(q, np.linspace(1/(2*n), (2*n-1)/(2*n), n), y))

def prctile(x,p):
    return(quantile(x,np.array(p)/100))

This function, as Matlab's one, gives a piecewise linear output spanning from min(x) to max(x). Numpy's percentile function, with interpolation=midpoint, returns a piecewise constant function between the average of the two smallest elements and the average of the two largest ones. Plotting the two functions for the array in the original question gives the picture in this link (sorry can't embed it). The dashed red line marks the 75% percentile, where the two functions actually coincide.

P.S. The reason why this function is not actually equivalent to Matlab's one is that it only accepts a one-dimensional x, giving error for higher dimensional stuff. Matlab's one, on the other hand, accepts a higher dim x and operates on the first (non trivial) dimension, but implementing it correctly would probably take a bit longer. However, both this and Matlab's function should correctly work with higher dimensional inputs for p / q (thanks to the usage of np.interp that takes care of it).

Marco Spinaci
  • 1,750
  • 15
  • 22
  • Hi Marco, I am very interested in your answer but with your function I am getting static output. – MysterioProgrammer91 Mar 28 '17 at 01:40
  • I'm sorry, what do you mean by a static output? This function accepts as input a 1-dimensional array and a number and outputs a number, for example: `prctile(np.arange(100),1)` that outputs 0.5. It does not work for higher dimensional arrays. If you pass to it a multi dimensional q (or array-like) it returns an array of the same shape, e.g. `prctile(np.arange(100),[1, 3])` outputs [0.5, 2.5]. – Marco Spinaci Mar 28 '17 at 09:41