13

Can I make pandas cut/qcut function to return with bin endpoint or bin midpoint instead of a string of bin label?

Currently

pd.cut(pd.Series(np.arange(11)), bins = 5)

0     (-0.01, 2]
1     (-0.01, 2]
2     (-0.01, 2]
3         (2, 4]
4         (2, 4]
5         (4, 6]
6         (4, 6]
7         (6, 8]
8         (6, 8]
9        (8, 10]
10       (8, 10]
dtype: category

with category / string values. What I want is

0     1.0
1     1.0
2     1.0
3     3.0
4     3.0

with numerical values representing edge or midpoint of the bin.

jf328
  • 6,841
  • 10
  • 58
  • 82

3 Answers3

13

I noticed that a category has a mid property, so you can calculate the middle via an apply:

In [1]: import pandas as pd
   ...: import numpy as np
   ...: df = pd.DataFrame({"val":np.arange(11)})
   ...: df["bins"] = pd.cut(df["val"], bins = 5)
   ...: df["bin_centres"] = df["bins"].apply(lambda x: x.mid)
   ...: df
Out[1]:
    val          bins bin_centres
0     0  (-0.01, 2.0]       0.995
1     1  (-0.01, 2.0]       0.995
2     2  (-0.01, 2.0]       0.995
3     3    (2.0, 4.0]       3.000
4     4    (2.0, 4.0]       3.000
5     5    (4.0, 6.0]       5.000
6     6    (4.0, 6.0]       5.000
7     7    (6.0, 8.0]       7.000
8     8    (6.0, 8.0]       7.000
9     9   (8.0, 10.0]       9.000
10   10   (8.0, 10.0]       9.000
erncyp
  • 1,649
  • 21
  • 23
9

I see that this is an old post but I will take the liberty to answer it anyway.

It is now possible (ref @chrisb's answer) to access the endpoints for categorical intervals using left and right.

s = pd.cut(pd.Series(np.arange(11)), bins = 5)

mid = [(a.left + a.right)/2 for a in s]
Out[34]: [0.995, 0.995, 0.995, 3.0, 3.0, 5.0, 5.0, 7.0, 7.0, 9.0, 9.0]

Since intervals are open to the left and closed to the right, the 'first' interval (the one starting at 0), actually starts at -0.01. To get a midpoint using 0 as the left value you can do this

mid_alt = [(a.left + a.right)/2 if a.left != -0.01 else a.right/2 for a in s]
Out[35]: [1.0, 1.0, 1.0, 3.0, 3.0, 5.0, 5.0, 7.0, 7.0, 9.0, 9.0]

Or, you can say that the intervals are closed to the left and open to the right

t = pd.cut(pd.Series(np.arange(11)), bins = 5, right=False)
Out[38]: 
0       [0.0, 2.0)
1       [0.0, 2.0)
2       [2.0, 4.0)
3       [2.0, 4.0)
4       [4.0, 6.0)
5       [4.0, 6.0)
6       [6.0, 8.0)
7       [6.0, 8.0)
8     [8.0, 10.01)
9     [8.0, 10.01)
10    [8.0, 10.01)

But, as you see, you get the same problem at the last interval.

mortysporty
  • 2,749
  • 6
  • 28
  • 51
8

There's a work-in-progress proposal for an 'IntervalIndex' that would make this type of operation very straightforward.

But for now, you can get the bins by passing the retbins argument and calculate the midpoints.

In [8]: s, bins = pd.cut(pd.Series(np.arange(11)), bins = 5, retbins=True)

In [11]: mid = [(a + b) /2 for a,b in zip(bins[:-1], bins[1:])]

In [13]: s.cat.rename_categories(mid)
Out[13]: 
0     0.995
1     0.995
2     0.995
3     3.000
4     3.000
5     5.000
6     5.000
7     7.000
8     7.000
9     9.000
10    9.000
dtype: category
Categories (5, float64): [0.995 < 3.000 < 5.000 < 7.000 < 9.000]
chrisb
  • 49,833
  • 8
  • 70
  • 70