Consider the following simple example. I am interested in getting a categorical variable that contains categories corresponding to quantiles.
df = pd.DataFrame({'A':'foo foo foo bar bar bar'.split(),
'B':[0, 0, 1]*2})
df
Out[67]:
A B
0 foo 0
1 foo 0
2 foo 1
3 bar 0
4 bar 0
5 bar 1
In Pandas, qtile
does the job. Unfortunately, qtile
will fail here because of the ties in the data.
df['C'] = df.groupby(['A'])['B'].transform(
lambda x: pd.qcut(x, 3, labels=range(1,4)))
gives the classic ValueError: Bin edges must be unique: array([ 0. , 0. , 0.33333333, 1. ])
Is there another robust solution (from any other python package) that does not require to reinvent the wheel?
It has to be. I dont want to code myself my own quantile bin function. Any decent stats package can handle ties when creating quantile bins (SAS
, Stata
, etc).
I want to have something that is based on sound methodological choices and robust.
For instance, look here for a solution in SAS https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146840.htm.
Or here for the well known xtile in Stata (http://www.stata.com/manuals13/dpctile.pdf). Note this SO post Definitive way to match Stata weighted xtile command using Python?
What am I missing? Maybe using Scipy
?
Many thanks!