My question is the same as this previous one:
Binning with zero values in pandas
however, I still want to include the 0 values in a fractile. Is there a way to do this? In other words, if I have 600 values, 50% of which are 0, and the rest are let's say between 1 and 100, how would I categorize all the 0 values in fractile 1, and then the rest of the non-zero values in fractile labels 2 to 10 (assuming I want 10 fractiles). Could I convert the 0's to nan, qcut the remaining non nan data into 9 fractiles (1 to 9), then add 1 to each label (now 2 to 10) and label all the 0 values as fractile 1 manually? Even this is tricky, because in my data set in addition to the 600 values, I also have another couple hundred which may already be nan before I would convert the 0s to nan.
Update 1/26/14:
I came up with the following interim solution. The problem with this code though, is if the high frequency value is not on the edges of the distribution, then it inserts an extra bin in the middle of the existing set of bins and throws everything a little (or a lot) off.
def fractile_cut(ser, num_fractiles):
num_valid = ser.valid().shape[0]
remain_fractiles = num_fractiles
vcounts = ser.value_counts()
high_freq = []
i = 0
while vcounts.iloc[i] > num_valid/ float(remain_fractiles):
curr_val = vcounts.index[i]
high_freq.append(curr_val)
remain_fractiles -= 1
num_valid = num_valid - vcounts[i]
i += 1
curr_ser = ser.copy()
curr_ser = curr_ser[~curr_ser.isin(high_freq)]
qcut = pd.qcut(curr_ser, remain_fractiles, retbins=True)
qcut_bins = qcut[1]
all_bins = list(qcut_bins)
for val in high_freq:
bisect.insort(all_bins, val)
cut = pd.cut(ser, bins=all_bins)
ser_fractiles = pd.Series(cut.labels + 1, index=ser.index)
return ser_fractiles