13

For a given dataset in a data frame, when I apply the describe function, I get the basic stats which include min, max, 25%, 50% etc.

For example:

data_1 = pd.DataFrame({'One':[4,6,8,10]},columns=['One'])
data_1.describe()

The output is:

        One
count   4.000000
mean    7.000000
std     2.581989
min     4.000000
25%     5.500000
50%     7.000000
75%     8.500000
max     10.000000

My question is: What is the mathematical formula to calculate the 25%?

1) Based on what I know, it is:

formula = percentile * n (n is number of values)

In this case:

25/100 * 4 = 1

So the first position is number 4 but according to the describe function it is 5.5.

2) Another example says - if you get a whole number then take the average of 4 and 6 - which would be 5 - still does not match 5.5 given by describe.

3) Another tutorial says - you take the difference between the 2 numbers - multiply by 25% and add to the lower number:

25/100 * (6-4) = 1/4*2 = 0.5

Adding that to the lower number: 4 + 0.5 = 4.5

Still not getting 5.5.

Can someone please clarify?

IanS
  • 15,771
  • 9
  • 60
  • 84
Gublooo
  • 2,550
  • 8
  • 54
  • 91
  • isn't this the `(max - min)/4`? so 10-4 = 6 then divide by 4 gives 1.5 which you then set as the interval between 4 and 10? – EdChum Sep 19 '16 at 08:02
  • I think it internally uses numpy, check the percentil code here https://github.com/numpy/numpy/blob/b91e8d8f164731bb710cc1e5173cc8ec3f8fadf5/numpy/lib/function_base.py#L3796 – Vikas Madhusudana Sep 19 '16 at 08:06
  • 2
    The beauty of open source is that you can check the code yourself. According to the [code of `describe`](https://github.com/pydata/pandas/blob/37f95cef85834207db0930e863341efb285e38a2/pandas/core/generic.py#L5181), it calls series' [`quantile` method](https://github.com/pydata/pandas/blob/37f95cef85834207db0930e863341efb285e38a2/pandas/core/series.py#L1345). The docstring has your answer. – IanS Sep 19 '16 at 08:12

2 Answers2

12

In the pandas documentation there is information about the computation of quantiles, where a reference to numpy.percentile is made:

Return value at the given quantile, a la numpy.percentile.

Then, checking numpy.percentile explanation, we can see that the interpolation method is set to linear by default:

linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j

For your specfic case, the 25th quantile results from:

res_25 = 4 + (6-4)*(3/4) =  5.5

For the 75th quantile we then get:

res_75 = 8 + (10-8)*(1/4) = 8.5

If you set the interpolation method to "midpoint", then you will get the results that you thought of.

.

Pierz
  • 7,064
  • 52
  • 59
Nikolas Rieble
  • 2,416
  • 20
  • 43
8

I think it's easier to understand by seeing this calculation as min+(max-min)*percentile. It has the same result as this function described in NumPy:

linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j

res_25 = 4+(10-4)*percentile = 4+(10-4)*25% = 5.5
res_75 = 4+(10-4)*percentile = 4+(10-4)*75% = 8.5
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
orli Zhu
  • 81
  • 1
  • 1