1

I have a very simple dataframe:

df = pd.DataFrame([5,7,10,15,19,21,21,22,22,23,23,23,23,23,24,24,24,24,25], columns=['val'])

df.median() = 23 which is right because from 19 values in the list, 23 is 10th value (9 values before 23, and 9 values after 23)

I tried to calculate 1st and 3rt quartile as:

df.quantile([.25, .75])

         val
0.25    20.0
0.75    23.5

I would have expected that from 9 values bellow median that 1st quartile should be 19, but as you can see above, python says it is 20. Similarly, for 3rd quartile, fifth number from right to left is 24, but python shows 23.5.

How does pandas calculates quartile?

Original question is from the following link: https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/identifying-outliers-iqr-rule

user3225309
  • 1,183
  • 3
  • 15
  • 31

2 Answers2

3

It uses linear interpolation by default. Here's how to use nearest instead:

df['val'].quantile([0.25, 0.75], interpolation='nearest')

Out:
0.25    19
0.75    24

More info from the official documentation on how the interpolation parameter works:

    This optional parameter specifies the interpolation method to use,
    when the desired quantile lies between two data points `i` and `j`:

    * linear: `i + (j - i) * fraction`, where `fraction` is the
      fractional part of the index surrounded by `i` and `j`.
    * lower: `i`.
    * higher: `j`.
    * nearest: `i` or `j` whichever is nearest.
    * midpoint: (`i` + `j`) / 2.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html

perl
  • 9,826
  • 1
  • 10
  • 22
  • But, here shouldn't be any interpolation, or I am missing something? – user3225309 Mar 05 '19 at 18:44
  • 1
    Let's consider 0.25 (same logic with 0.75, of course): element number should be (len(df)-1)*0.25 = (19 - 1)*0.25 = 4.5, so we're between element 4 (which is 19 -- we start counting from 0) and element 5 (which is 21). So, we have i = 19, j = 21, fraction = 0.5, and i + (j - i) * fraction = 20 – perl Mar 05 '19 at 18:54
2

Python doesn't create the quantile, Pandas does. Here take a look at the documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html It actually uses numpy's percentile function https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html#numpy.percentile

  • I still don't get it how pandas got its values. Interpolation parameter is not important here because each quartile lies on odd number. – user3225309 Mar 05 '19 at 18:44
  • Quantiles are inferential values. Where it lets you understand where you values lies. In your example in the first quartile you can expect to see values of 20 or less. If you want to see the original values you really should sort the values and divide them in 4 equal sets and then observe the first set. – Lawrence Khan Mar 05 '19 at 18:56