8

I have thousands of series (rows of a DataFrame) that I need to apply qcut on. Periodically there will be a series (row) that has fewer values than the desired quantile (say, 1 value vs 2 quantiles):

>>> s = pd.Series([5, np.nan, np.nan])

When I apply .quantile() to it, it has no problem breaking into 2 quantiles (of the same boundary value)

>>> s.quantile([0.5, 1])
0.5    5.0
1.0    5.0
dtype: float64

But when I apply .qcut() with an integer value for number of quantiles an error is thrown:

>>> pd.qcut(s, 2)
...
ValueError: Bin edges must be unique: array([ 5.,  5.,  5.]).
You can drop duplicate edges by setting the 'duplicates' kwarg

Even after I set the duplicates argument, it still fails:

>>> pd.qcut(s, 2, duplicates='drop')
....
IndexError: index 0 is out of bounds for axis 0 with size 0

How do I make this work? (And equivalently, pd.qcut(s, [0, 0.5, 1], duplicates='drop') also doesn't work.)

The desired output is to have the 5.0 assigned to a single bin and the NaN are preserved:

0     (4.999, 5.000]
1                NaN
2                NaN
Zhang18
  • 4,800
  • 10
  • 50
  • 67
  • What's your desired output for pd.qcut(s, 2)? You only have 1 unique value and why do you want to create more than 1 bins? – Allen Qin May 18 '17 at 18:44
  • 1
    I'm extracting a very specific case to address. In reality I have thousands of Series, all of which I need to cut. But qcut() runs into problem with an outlier row like this. I modified the question with the desired output. – Zhang18 May 19 '17 at 14:43
  • surround the `qcut` with a `try-except` block to catch the faulty Series (Be specific enough to only get the ones too short) and deal with the ones too short sem-manually – Maarten Fabré May 19 '17 at 14:52
  • did you manage to resolve this? I am getting the same error and can't find a solution – jeangelj Feb 15 '18 at 21:34
  • No, no solution is known to the original problem as of 2/21/2018 – Zhang18 Feb 21 '18 at 20:57

3 Answers3

7

Ok, this is a workaround which might work for you.

pd.qcut(s,len(s.dropna()),duplicates='drop')
Out[655]: 
0    (4.999, 5.0]
1             NaN
2             NaN
dtype: category
Categories (1, interval[float64]): [(4.999, 5.0]]
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
0

You can try filling your object/number cols with the appropriate filling ('null' for string and 0 for numeric)

#fill numeric cols with 0
numeric_columns = df.select_dtypes(include=['number']).columns
df[numeric_columns] = df[numeric_columns].fillna(0)

#fill object cols with null
string_columns = df.select_dtypes(include=['object']).columns
df[string_columns] = df[string_columns].fillna('null')
max
  • 671
  • 5
  • 13
-6

Use python 3.5 instead of python 2.7 . This worked for me

Tarun Talreja
  • 163
  • 1
  • 12
  • 8
    Can you elaborate as to why? Not ever user coming to this question will be able to switch their project to 3.x, but might be able to work around the issue if they can find out what it is. – toonarmycaptain Nov 01 '17 at 16:40