pandas using qcut on series with fewer values than quantiles

Question

I have thousands of series (rows of a DataFrame) that I need to apply qcut on. Periodically there will be a series (row) that has fewer values than the desired quantile (say, 1 value vs 2 quantiles):

>>> s = pd.Series([5, np.nan, np.nan])

When I apply .quantile() to it, it has no problem breaking into 2 quantiles (of the same boundary value)

>>> s.quantile([0.5, 1])
0.5    5.0
1.0    5.0
dtype: float64

But when I apply .qcut() with an integer value for number of quantiles an error is thrown:

>>> pd.qcut(s, 2)
...
ValueError: Bin edges must be unique: array([ 5.,  5.,  5.]).
You can drop duplicate edges by setting the 'duplicates' kwarg

Even after I set the duplicates argument, it still fails:

>>> pd.qcut(s, 2, duplicates='drop')
....
IndexError: index 0 is out of bounds for axis 0 with size 0

How do I make this work? (And equivalently, pd.qcut(s, [0, 0.5, 1], duplicates='drop') also doesn't work.)

The desired output is to have the 5.0 assigned to a single bin and the NaN are preserved:

0     (4.999, 5.000]
1                NaN
2                NaN

What's your desired output for pd.qcut(s, 2)? You only have 1 unique value and why do you want to create more than 1 bins? — Allen Qin, May 18 '17 at 18:44
I'm extracting a very specific case to address. In reality I have thousands of Series, all of which I need to cut. But qcut() runs into problem with an outlier row like this. I modified the question with the desired output. — Zhang18, May 19 '17 at 14:43
surround the `qcut` with a `try-except` block to catch the faulty Series (Be specific enough to only get the ones too short) and deal with the ones too short sem-manually — Maarten Fabré, May 19 '17 at 14:52
did you manage to resolve this? I am getting the same error and can't find a solution — jeangelj, Feb 15 '18 at 21:34
No, no solution is known to the original problem as of 2/21/2018 — Zhang18, Feb 21 '18 at 20:57

score 7 · Answer 1 · answered May 20 '17 at 08:49

7

Ok, this is a workaround which might work for you.

pd.qcut(s,len(s.dropna()),duplicates='drop')
Out[655]: 
0    (4.999, 5.0]
1             NaN
2             NaN
dtype: category
Categories (1, interval[float64]): [(4.999, 5.0]]

answered May 20 '17 at 08:49

Allen Qin

19,507
8
51
67

Adding duplicates='drop' did it for me. – Renata Ghisloti Feb 08 '23 at 14:29

score 0 · Answer 2 · answered May 02 '22 at 18:21

You can try filling your object/number cols with the appropriate filling ('null' for string and 0 for numeric)

#fill numeric cols with 0
numeric_columns = df.select_dtypes(include=['number']).columns
df[numeric_columns] = df[numeric_columns].fillna(0)

#fill object cols with null
string_columns = df.select_dtypes(include=['object']).columns
df[string_columns] = df[string_columns].fillna('null')

score -6 · Answer 3 · answered Nov 01 '17 at 16:14

-6

Use python 3.5 instead of python 2.7 . This worked for me

answered Nov 01 '17 at 16:14

Tarun Talreja

163
1
12

8

Can you elaborate as to why? Not ever user coming to this question will be able to switch their project to 3.x, but might be able to work around the issue if they can find out what it is. – toonarmycaptain Nov 01 '17 at 16:40

pandas using qcut on series with fewer values than quantiles

3 Answers3

Linked