8

I've encountered an interesting situation while calculating the inter-quartile range. Assuming we have a dataframe such as:

import pandas as pd
index=pd.date_range('2014 01 01',periods=10,freq='D')
data=pd.np.random.randint(0,100,(10,5))
data = pd.DataFrame(index=index,data=data)

data
Out[90]: 
             0   1   2   3   4
2014-01-01  33  31  82   3  26
2014-01-02  46  59   0  34  48
2014-01-03  71   2  56  67  54
2014-01-04  90  18  71  12   2
2014-01-05  71  53   5  56  65
2014-01-06  42  78  34  54  40
2014-01-07  80   5  76  12  90
2014-01-08  60  90  84  55  78
2014-01-09  33  11  66  90   8
2014-01-10  40   8  35  36  98

# test for q1 values (this works)
data.quantile(0.25)
Out[111]: 
0    40.50
1     8.75
2    34.25
3    17.50
4    29.50

# break it by inserting row of nans
data.iloc[-1] = pd.np.NaN

data.quantile(0.25)
Out[115]: 
0    42
1    11
2    34
3    12
4    26

The first quartile can be calculated by taking the median of values in the dataframe that fall below the overall median, so we can see what data.quantile(0.25) should have yielded. e.g.

med = data.median()
q1  = data[data<med].median()
q1
Out[119]: 
0    37.5
1     8.0
2    19.5
3    12.0
4    17.0

It seems that quantile is failing to provide an appropriate representation of q1 etc. since it is not doing a good job of handling the NaN values (i.e. it works without NaNs, but not with NaNs).

I thought this may not be a "NaN" issue, rather it might be quantile failing to handle even-numbered data sets (i.e. where the median must be calculated as the mean of the two central numbers). However, after testing with dataframes with both even and odd-numbers of rows I saw that quantile handled these situations properly. The problem seems to arise only when NaN values are present in the dataframe.

I would like to use quntile to calculate the rolling q1/q3 values in my dataframe, however, this will not work with NaN's present. Can anyone provide a solution to this issue?

tnknepp
  • 5,888
  • 6
  • 43
  • 57
  • did you try `df.dropna()`? – acushner Jun 04 '14 at 19:59
  • I have applied dropna(how='all'), which makes my example misleading. However, I am not willing to apply dropna(how='any'), since I do not want to lose valid data just because a NaN sits in the next column over. Your suggestion is good. That being said, there still seems to be a fundamental problem with quantile (or so I think!). – tnknepp Jun 04 '14 at 20:04
  • sorry bud, good luck! – acushner Jun 04 '14 at 20:34
  • 4
    Are you sure that result should be `data[data – alko Jun 04 '14 at 20:36
  • 1
    And this (above comment of @alko) gives indeed the same result as pandas quantile – joris Jun 04 '14 at 21:41

1 Answers1

4

Internally, quantile uses numpy.percentile over the non-null values. When you change the last row of data to NaNs you're essentially left with an array array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) in the first column

Calculating np.percentile(array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) gives 42.

From the docstring:

Given a vector V of length N, the qth percentile of V is the qth ranked value in a sorted copy of V. A weighted average of the two nearest neighbors is used if the normalized ranking does not match q exactly. The same as the median if q=50, the same as the minimum if q=0 and the same as the maximum if q=100.

TomAugspurger
  • 28,234
  • 8
  • 86
  • 69
  • And to add, if you use `np.percentile` directly (with the NaNs), eg with `df2.apply(np.percentile, args=[25])` will you give other values, as the NaNs are not removed before calculating the quantile as with pandas. – joris Jun 04 '14 at 21:46
  • @TomAugspurger the code `np.percentile(np.array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.], 0.25)` returns 32.999, not 42. I think using `np.quantile()` can return 42, the code `np.quantile(np.array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.], 0.25)` return 42 . – Chong Onn Keat Nov 17 '21 at 12:10