15

I have a data frame, from which I can select a column (series) as follows:

df:

            value_rank
275488          90
275490          35
275491          60
275492          23
275493          23
275494          34
275495          75
275496          40
275497          69
275498          14
275499          83
...             ...

value_rank is a previously created percentile rank from a larger data-set. What I am trying to do, is to create bins of this data set, e.g. quintile

pd.qcut(df.value_rank, 5, labels=False)


275488    4
275490    1
275491    3
275492    1
275493    1
275494    1
275495    3
275496    2
...      ...

This appears fine, as expected, but it isn't.

In fact, I have 1569 columns. The nearest number divisible by 5 bins is 1565 which should give 1565 / 5 = 313 observations in each bin. There are 4 extra records, so I would expect to have 4 bins with 314 observations, and one with 313 observations. Instead, I get this:

obs =  pd.qcut(df.value_rank, 5, labels=False)
obs.value_counts()

0    329
3    314
1    313
4    311
2    302

I have no nans in df, and cannot think of any reason why this is happening. Literally beginning to tear my hair out!

Here is a small example:

df:

            value_rank
286742               11
286835               53
286865               40
286930               31
286936               45
286955               27
287031               30
287111               36
287269               30
287310               18

pd.qcut gives this:

pd.qcut(df.value_rank, 5, labels = False).value_counts()
bin  count
1    3
4    2
3    2
0    2
2    1

There should be 2 observations in each bin, not 3 in bin 1 and 1 in bin 2!

Carl
  • 598
  • 2
  • 11
  • 25

4 Answers4

14

qcut is trying to compensate for repeating values. This is earlier to visualize if you return the bin limits along with your qcut results:

In [42]: test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
In [43]: test_series = pd.Series(test_list, name='value_rank')

In [49]: pd.qcut(test_series, 5, retbins=True, labels=False)
Out[49]:
(array([0, 0, 1, 1, 1, 2, 3, 3, 4, 4]),
 array([ 11. ,  25.2,  30. ,  33. ,  41. ,  53. ]))

You can see that there was no choice but to set the bin limit at 30, so qcut had to "steal" one from the expected values in the third bin and place them in the second. I'm thinking that this is just happening at a larger scale with your percentiles since you're basically condensing their ranks into a 1 to 100 scale. Any reason not to just run qcut directly on the data instead of the percentiles or return percentiles that have greater precision?

Robert Rodkey
  • 423
  • 3
  • 9
  • Thanks, very helpful! In my case, I think it would be difficult to return the raw data, as the ranks are created as part of a separate screening process. The rank is also a composite rank, constructed using several other factors which are first decile ranked, and then added together in a weighted sum. With each individual component rank, we decide how to deal with extreme values/negative numbers separately, so constructing a raw-rank by doing a weighted sum of the raw numbers and then ranking here could cause extreme values in one sub-rank to dominate. – Carl Feb 04 '16 at 10:23
  • Actually, I do still have a question: why is it that qcut isn't throwing a non-unique bin edges error? surely that is the problem that you have highlighted, namely that qcut is unable to sort into equal-width bins because of ambiguity caused by overlap of the values to be ranked? – Carl Feb 04 '16 at 13:33
  • @Carl: That error is thrown when the bins that are calculated are non-unique. In the above example, you can see that 11., 25.2, 30., 33., 41., 53. are all different. An extreme example would be using a list of [1, 1, 1, 1, 1] in the above script. – Robert Rodkey Feb 10 '16 at 17:39
12

Just try with the below code :

pd.qcut(df.rank(method='first'),nbins)
AnksG
  • 488
  • 4
  • 9
4

If you must get equal (or nearly equal) bins, then here's a trick you can use with qcut. Using the same data as the accepted answer, we can force these into equal bins by adding some random noise to the original test_list and binning according to those values.

test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]

np.random.seed(42) #set this for reproducible results
test_list_rnd = np.array(test_list) + np.random.random(len(test_list)) #add noise to data

test_series = pd.Series(test_list_rnd, name='value_rank')
pd.qcut(test_series, 5, retbins=True, labels=False)

Output:

(0    0
 1    0
 2    1
 3    2
 4    1
 5    2
 6    3
 7    3
 8    4
 9    4
 Name: value_rank, dtype: int64,
 array([ 11.37454012,  25.97573801,  30.42160255,  33.11683016,
         41.81316392,  53.70807258]))

So, now we have two 0's, two 1's, two 2's and two 4's!

Disclaimer

Obviously, use this at your discretion because results can vary based on your data; like how large your data set is and/or the spacing, for instance. The above "trick" works well for integers because even though we are "salting" the test_list, it will still rank order in the sense that there will won't be a value in group 0 greater than a value in group 1 (maybe equal, but not greater). If, however, you have floats, this may be tricky and you may have to reduce the size of your noise accordingly. For instance if you had floats like 2.1, 5.3, 5.3, 5.4, etc., you should should reduce the noise by dividing by 10: np.random.random(len(test_list)) / 10. If you have arbitrarily long floats, however, you probably would not have this problem in the first place, given the noise already present in "real" data.

scottlittle
  • 18,866
  • 8
  • 51
  • 70
1

This problem arises from duplicate values. A possible solution to force equal sized bins is to use the index as the input for pd.qcut after sorting the dataframe:

import random

df = pd.DataFrame({'A': [random.randint(3, 9) for x in range(20)]}).sort_values('A').reset_index()
del df['index']
df = df.reset_index()
df['A'].plot.hist(bins=30);

picture: https://i.stack.imgur.com/ztjzn.png

df.head()
df['qcut_v1'] = pd.qcut(df['A'], q=4)
df['qcut_v2'] = pd.qcut(df['index'], q=4)
df

picture: https://i.stack.imgur.com/RB4TN.png

df.groupby('qcut_v1').count().reset_index()

picture: https://i.stack.imgur.com/IKtsW.png

df.groupby('qcut_v2').count().reset_index()

picture: https://i.stack.imgur.com/4jrkU.png

sorry I cannot post images since I don't have at least 10 reputation on stackoverflow -.-

fulowa
  • 11
  • 2