How to group skewed data in pandas with adaptive intervals

Question

Let's say a column in my dataframe contains data in this frequency:

>>> vals = list(range(11000,12000)) + list(range(5600,6120)) + list(range(0,40,4)) + \
       list(range(0,10000,300)) + list(range(1200,1400,3)) + list(range(0,10000,1100))
>>> df = pd.DataFrame({'freq' : vals})

I want to look at their frequency distribution. What I am doing now is simply,

>>> df.freq.value_counts(bins=20).sort_index()

(-12.0, 599.95]         13
(599.95, 1199.9]         3
(1199.9, 1799.85]       69
(1799.85, 2399.8]        3
(2399.8, 2999.75]        2
(2999.75, 3599.7]        3
(3599.7, 4199.65]        2
(4199.65, 4799.6]        3
(4799.6, 5399.55]        2
(5399.55, 5999.5]      403
(5999.5, 6599.45]      122
(6599.45, 7199.4]        3
(7199.4, 7799.35]        3
(7799.35, 8399.3]        2
(8399.3, 8999.25]        3
(8999.25, 9599.2]        2
(9599.2, 10199.15]       3
(10199.15, 10799.1]      0
(10799.1, 11399.05]    400
(11399.05, 11999.0]    600
Name: freq, dtype: int64

But as you can see, there is nothing intelligent about it. There are lot's of bins with very small number of frequencies. I would like to have them combined, if they are under a particular threshold (e.g. 5). So I what I would like to have is something like:

(-12.0, 599.95]         13
(599.95, 1199.9]         3
(1199.9, 1799.85]       69
(1799.85, 5399.55]      15
(5399.55, 5999.5]      403
(5999.5, 6599.45]      122
(6599.45, 10799.1]      16
(10799.1, 11399.05]    400
(11399.05, 11999.0]    600

I can not think of anything suitable, because I am not comfortable with intervals. Also if one can suggest some better way to get frequency distribution with intelligent spacing that would be great as well.

NOTE: I am not looking for manipulation in number of bins, as that would have to be something manual, and I want to avoid that.

score 3 · Accepted Answer · answered Dec 11 '19 at 08:14

You can try qcut:

pd.qcut(df.freq, q=20).value_counts()

Output:

(-0.001, 1395.0]      83
(11835.0, 11917.0]    82
(1395.0, 5662.0]      82
(5662.0, 5743.0]      82
(5743.0, 5825.0]      82
(5825.0, 5907.0]      82
(5907.0, 5989.0]      82
(5989.0, 6070.0]      82
(6070.0, 11015.0]     82
(11015.0, 11097.0]    82
(11917.0, 11999.0]    82
(11179.0, 11261.0]    82
(11261.0, 11343.0]    82
(11343.0, 11425.0]    82
(11425.0, 11507.0]    82
(11507.0, 11589.0]    82
(11589.0, 11671.0]    82
(11671.0, 11753.0]    82
(11753.0, 11835.0]    82
(11097.0, 11179.0]    82
Name: freq, dtype: int64

score 2 · Answer 2 · answered Dec 11 '19 at 08:15

You can use quantile() to learn how to evenly distribute the items to different baskets, example:

>>> df.freq.quantile(0.9) # 90% of values are <= 11835
11835.0
>>> df.freq.quantile(0.5) # 50% of values are <= 11179
11179.0
>>> df.freq.quantile(0.2)  # 20% of values are <= 5825
5825.0
>>> df.freq.quantile(0.1)
5662.0

Those are the values for us to evenly distribute the baskets

>>> df[df.freq < 5662].shape[0]
164
>>> df[(df.freq >= 5662) & (df.freq < 5825)].shape[0]
164

How to group skewed data in pandas with adaptive intervals

2 Answers2