How to qcut with non unique bin edges?

Question

My question is the same as this previous one:

however, I still want to include the 0 values in a fractile. Is there a way to do this? In other words, if I have 600 values, 50% of which are 0, and the rest are let's say between 1 and 100, how would I categorize all the 0 values in fractile 1, and then the rest of the non-zero values in fractile labels 2 to 10 (assuming I want 10 fractiles). Could I convert the 0's to nan, qcut the remaining non nan data into 9 fractiles (1 to 9), then add 1 to each label (now 2 to 10) and label all the 0 values as fractile 1 manually? Even this is tricky, because in my data set in addition to the 600 values, I also have another couple hundred which may already be nan before I would convert the 0s to nan.

Update 1/26/14:

I came up with the following interim solution. The problem with this code though, is if the high frequency value is not on the edges of the distribution, then it inserts an extra bin in the middle of the existing set of bins and throws everything a little (or a lot) off.

def fractile_cut(ser, num_fractiles):
    num_valid = ser.valid().shape[0]
    remain_fractiles = num_fractiles
    vcounts = ser.value_counts()
    high_freq = []
    i = 0
    while vcounts.iloc[i] > num_valid/ float(remain_fractiles):
        curr_val = vcounts.index[i]
        high_freq.append(curr_val)
        remain_fractiles -= 1
        num_valid = num_valid - vcounts[i]
        i += 1
    curr_ser = ser.copy()
    curr_ser = curr_ser[~curr_ser.isin(high_freq)]
    qcut = pd.qcut(curr_ser, remain_fractiles, retbins=True)
    qcut_bins = qcut[1]
    all_bins = list(qcut_bins)
    for val in high_freq:
        bisect.insort(all_bins, val)
    cut = pd.cut(ser, bins=all_bins)
    ser_fractiles = pd.Series(cut.labels + 1, index=ser.index)
    return ser_fractiles

simple example with test input and desired output would be nice — Roman Pekar, Nov 23 '13 at 09:14
so you'd like equal numbers of samples in each bin ... but if doing so makes the same value appear in more than one bin, then put that "pervasive" value into its own bin? (and do so for all such pervasive values?) — John Prior, Jan 24 '14 at 23:19
Yes, exactly. So if the sample was [13, 13, 13, 15, 18, 19], then the quartile bins would be [13, 13, 13], [15], [18], [19] — geronimo, Jan 25 '14 at 04:18
@geronimo, so is there a suggested way to do that , because i am facing a similar issue , qcut fails as my data has a lot of zero's and I need to put them in one bin. — swati saoji, Apr 01 '15 at 02:59
See https://github.com/pydata/pandas/issues/7751 for a discussion about how to handle non-unique edges. — dukebody, May 13 '15 at 11:37

luca · Answer 1 · 2020-07-29T08:47:16.833

167

The problem is that pandas.qcut chooses the bins/quantiles so that each one has the same number of records, but all records with the same value must stay in the same bin/quantile (this behaviour is in accordance with the statistical definition of quantile).

The solutions are:

1 - Use pandas >= 0.20.0 that has this fix. They added an option duplicates='raise'|'drop' to control whether to raise on duplicated edges or to drop them, which would result in less bins than specified, and some larger (with more elements) than others.

2 - Decrease the number of quantiles. Less quantiles means more elements per quantile

3 - Rank your data with DataFrame.rank(method='first'). The ranking assigns a unique value to each element in the dataframe (the rank) while keeping the order of the elements (except for identical values, which will be ranked in order they appear in the array, see method='first')

Example:

pd.qcut(df, nbins) <-- this generates "ValueError: Bin edges must be unique"

Then use this instead:

pd.qcut(df.rank(method='first'), nbins)

4 - Specify a custom quantiles range, e.g. [0, .50, .75, 1.] to get unequal number of items per quantile

5 - Use pandas.cut that chooses the bins to be evenly spaced according to the values themselves, while pandas.qcut chooses the bins so that you have the same number of records in each bin

edited Jul 29 '20 at 08:47

answered Nov 11 '16 at 13:01

luca

7,178
7
41
55

You state: `The problem is that pandas.qcut chooses the bins so that you have the same number of records in each bin/quantile, but the same value cannot fall in multiple bins/quantiles.` but im not quite sure how this follows? 0, just like any other number can be placed into a bin? – Parsa Oct 17 '19 at 20:34
4

@par if you have too many `0` then `pd.qcut` is forced to place `0` in multiple bins/quantiles to keep equal the number of items per bin/quantile. At the same time, the same value (`0` in this case) shouldn't be present in multiple quantiles/bins, that's why `pd.qcut` fails – luca Oct 18 '19 at 09:12
1

Sorry, one question. If I decide to use ".rank(method='first')", shouldn't I sort my data in descending order first? Otherwise, I will end getting my data wrong labeled. – Abimael Domínguez Oct 10 '21 at 05:24
@AbimaelDomínguez If I interpret your question correctly, you don't need to sort the data because the `rank` method doesn't alter the position of your input – luca Mar 18 '22 at 11:14

mgoldwasser · Answer 2 · 2016-06-13T18:16:53.927

Another way to do this is to introduce a minimal amount of noise, which will artificially create unique bin edges. Here's an example:

a = pd.Series(range(100) + ([0]*20))

def jitter(a_series, noise_reduction=1000000):
    return (np.random.random(len(a_series))*a_series.std()/noise_reduction)-(a_series.std()/(2*noise_reduction))

# and now this works by adding a little noise
a_deciles = pd.qcut(a + jitter(a), 10, labels=False)

we can recreate the original error using something like this:

a_deciles = pd.qcut(a, 10, labels=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 173, in qcut
    precision=precision, include_lowest=True)
  File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 192, in _bins_to_cuts
    raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([  0.        ,   0.        ,   0.        ,   3.8       ,
        11.73333333,  19.66666667,  27.6       ,  35.53333333,
        43.46666667,  51.4       ,  59.33333333,  67.26666667,
        75.2       ,  83.13333333,  91.06666667,  99.        ])

Works great, makes logical sense. Instead of doing a copy-pasta, I just wrote out the `jitter` function by hand, and missed some crucial parens, so my data exploded... but simple fix and now I can `qcut`! Thanks!! — dwanderson, Dec 15 '16 at 18:58
This solution is particularly useful if we want to randomly assign the duplicate values into bins, which can be the case in many real-life scenarios. — data-monkey, Aug 27 '21 at 09:19

score 20 · Answer 3 · answered Apr 28 '15 at 21:28

You ask about binning with non-unique bin edges, for which I have a fairly simple answer. In the case of your example, your intent and the behavior of qcut diverge where in the pandas.tools.tile.qcut function where bins are defined:

bins = algos.quantile(x, quantiles)

Which, because your data is 50% 0s, causes bins to be returned with multiple bin edges at the value 0 for any value of quantiles greater than 2. I see two possible resolutions. In the first, the fractile space is divided evenly, binning all 0s, but not only 0s, in the first bin. In the second, the fractile space is divided evenly for values greater than 0, binning all 0s and only 0s in the first bin.

import numpy as np
import pandas as pd
import pandas.core.algorithms as algos
from pandas import Series

In both cases, I'll create some random sample data fitting your description of 50% zeroes and the remaining values between 1 and 100

zs = np.zeros(300)
rs = np.random.randint(1, 100, size=300)
arr=np.concatenate((zs, rs))
ser = Series(arr)

Solution 1: bin 1 contains both 0s and low values

bins = algos.quantile(np.unique(ser), np.linspace(0, 1, 11))
result = pd.tools.tile._bins_to_cuts(ser, bins, include_lowest=True)

The result is

In[61]: result.value_counts()
Out[61]: 
[0, 9.3]        323
(27.9, 38.2]     37
(9.3, 18.6]      37
(88.7, 99]       35
(57.8, 68.1]     32
(68.1, 78.4]     31
(78.4, 88.7]     30
(38.2, 48.5]     27
(48.5, 57.8]     26
(18.6, 27.9]     22
dtype: int64

Solution 2: bin1 contains only 0s

mx = np.ma.masked_equal(arr, 0, copy=True)
bins = algos.quantile(arr[~mx.mask], np.linspace(0, 1, 11))
bins = np.insert(bins, 0, 0)
bins[1] = bins[1]-(bins[1]/2)
result = pd.tools.tile._bins_to_cuts(arr, bins, include_lowest=True)

The result is:

In[133]: result.value_counts()
Out[133]: 
[0, 0.5]        300
(0.5, 11]        32
(11, 18.8]       28
(18.8, 29.7]     30
(29.7, 39]       35
(39, 50]         26
(50, 59]         31
(59, 71]         31
(71, 79.2]       27
(79.2, 90.2]     30
(90.2, 99]       30
dtype: int64

There is work that could be done to Solution 2 to make it a little prettier I think, but you can see that the masked array is a useful tool to approach your goals.

This is really helpful. How might you modify solution 2 so it includes 0s or 1s in bin1? — samthebrand, Dec 07 '15 at 19:48

score 14 · Answer 4 · answered Oct 27 '17 at 16:03

If you want to enforce equal size bins, even in the presence of duplicate values, you can use the following, 2 step process:

Rank your values, using method='first' to have python assign a unique rank to all your records. If there is a duplicate value (i.e. a tie in the rank), this method will choose the first record it comes to and rank in that order.

df['rank'] = df['value'].rank(method='first')

Use qcut on the rank to determine equal sized quantiles. Below example creates deciles (bins=10).

df['decile'] = pd.qcut(df['rank'].values, 10).codes

Sorry, one question. If I decide to use ".rank(method='first')", shouldn't I sort my data in descending order first? Otherwise, I will end getting my data wrong labeled. — Abimael Domínguez, Oct 10 '21 at 05:24

score 1 · Answer 5 · answered Apr 27 '15 at 21:33

1

I've had a lot of problems with qcut as well, so I used the Series.rank function combined with creating my own bins using those results. My code is on Github:

https://gist.github.com/ashishsingal1/e1828ffd1a449513b8f8

answered Apr 27 '15 at 21:33

ashishsingal

2,810
3
18
26

score 1 · Answer 6 · answered Apr 07 '21 at 15:09

I had this problem as well, so I wrote a small function, which only treats the non zero values and then inserts the labels where the original was not 0.

def qcut2(x, n=10):
  x = np.array(x)
  x_index_not0 = [i for i in range(len(x)) if x[i] > 0]
  x_cut_not0 = pd.qcut(x[x > 0], n-1, labels=False) + 1
  y = np.zeros(len(x))
  y[x_index_not0] = x_cut_not0
  return y

How to qcut with non unique bin edges?

6 Answers6

Linked

Related