how to compute quantile bins in Pandas when there are ties in the data?

Question

Consider the following simple example. I am interested in getting a categorical variable that contains categories corresponding to quantiles.

  df = pd.DataFrame({'A':'foo foo foo bar bar bar'.split(),
                       'B':[0, 0, 1]*2})

df
Out[67]: 
     A  B
0  foo  0
1  foo  0
2  foo  1
3  bar  0
4  bar  0
5  bar  1

In Pandas, qtile does the job. Unfortunately, qtile will fail here because of the ties in the data.

df['C'] = df.groupby(['A'])['B'].transform(
                     lambda x: pd.qcut(x, 3, labels=range(1,4)))

gives the classic ValueError: Bin edges must be unique: array([ 0. , 0. , 0.33333333, 1. ])

Is there another robust solution (from any other python package) that does not require to reinvent the wheel?

It has to be. I dont want to code myself my own quantile bin function. Any decent stats package can handle ties when creating quantile bins (SAS, Stata, etc).

I want to have something that is based on sound methodological choices and robust.

For instance, look here for a solution in SAS https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146840.htm.

Or here for the well known xtile in Stata (http://www.stata.com/manuals13/dpctile.pdf). Note this SO post Definitive way to match Stata weighted xtile command using Python?

What am I missing? Maybe using Scipy?

Many thanks!

score 3 · Accepted Answer · answered Jul 26 '16 at 17:42

3

IIUC, you can use numpy.digitize

df['C'] = df.groupby(['A'])['B'].transform(lambda x: np.digitize(x,bins=np.array([0,1,2])))

     A  B  C
0  foo  0  1
1  foo  0  1
2  foo  1  2
3  bar  0  1
4  bar  0  1
5  bar  1  2

answered Jul 26 '16 at 17:42

Nickil Maveli

29,155
8
82
85

thanks @NickilMaveli but it seems `numpy.digitize` does not create quantile bins, but rather linearly spaced bins – ℕʘʘḆḽḘ Jul 26 '16 at 17:58
1

In that case you can pass the output of `pd.quantile()` method to `np.digitize` function. If non-unique values are present, then it would assign the integer associated with the last quartile(here, 3). – Nickil Maveli Jul 26 '16 at 18:24
very nice suggestion indeed. unfortunately, I think putting them to the minimum quartile is more common.. maybe there is another solution out there.. – ℕʘʘḆḽḘ Jul 26 '16 at 22:17

how to compute quantile bins in Pandas when there are ties in the data?

1 Answers1