python bin data and return bin midpoint (maybe using pandas.cut and qcut)

Question

Can I make pandas cut/qcut function to return with bin endpoint or bin midpoint instead of a string of bin label?

Currently

pd.cut(pd.Series(np.arange(11)), bins = 5)

0     (-0.01, 2]
1     (-0.01, 2]
2     (-0.01, 2]
3         (2, 4]
4         (2, 4]
5         (4, 6]
6         (4, 6]
7         (6, 8]
8         (6, 8]
9        (8, 10]
10       (8, 10]
dtype: category

with category / string values. What I want is

with numerical values representing edge or midpoint of the bin.

score 13 · Answer 1 · answered Apr 15 '19 at 12:24

I noticed that a category has a mid property, so you can calculate the middle via an apply:

In [1]: import pandas as pd
   ...: import numpy as np
   ...: df = pd.DataFrame({"val":np.arange(11)})
   ...: df["bins"] = pd.cut(df["val"], bins = 5)
   ...: df["bin_centres"] = df["bins"].apply(lambda x: x.mid)
   ...: df
Out[1]:
    val          bins bin_centres
0     0  (-0.01, 2.0]       0.995
1     1  (-0.01, 2.0]       0.995
2     2  (-0.01, 2.0]       0.995
3     3    (2.0, 4.0]       3.000
4     4    (2.0, 4.0]       3.000
5     5    (4.0, 6.0]       5.000
6     6    (4.0, 6.0]       5.000
7     7    (6.0, 8.0]       7.000
8     8    (6.0, 8.0]       7.000
9     9   (8.0, 10.0]       9.000
10   10   (8.0, 10.0]       9.000

mortysporty · Accepted Answer · 2018-08-14T18:46:54.017

I see that this is an old post but I will take the liberty to answer it anyway.

It is now possible (ref @chrisb's answer) to access the endpoints for categorical intervals using left and right.

s = pd.cut(pd.Series(np.arange(11)), bins = 5)

mid = [(a.left + a.right)/2 for a in s]
Out[34]: [0.995, 0.995, 0.995, 3.0, 3.0, 5.0, 5.0, 7.0, 7.0, 9.0, 9.0]

Since intervals are open to the left and closed to the right, the 'first' interval (the one starting at 0), actually starts at -0.01. To get a midpoint using 0 as the left value you can do this

mid_alt = [(a.left + a.right)/2 if a.left != -0.01 else a.right/2 for a in s]
Out[35]: [1.0, 1.0, 1.0, 3.0, 3.0, 5.0, 5.0, 7.0, 7.0, 9.0, 9.0]

Or, you can say that the intervals are closed to the left and open to the right

t = pd.cut(pd.Series(np.arange(11)), bins = 5, right=False)
Out[38]: 
0       [0.0, 2.0)
1       [0.0, 2.0)
2       [2.0, 4.0)
3       [2.0, 4.0)
4       [4.0, 6.0)
5       [4.0, 6.0)
6       [6.0, 8.0)
7       [6.0, 8.0)
8     [8.0, 10.01)
9     [8.0, 10.01)
10    [8.0, 10.01)

But, as you see, you get the same problem at the last interval.

score 8 · Answer 3 · answered Sep 23 '15 at 18:54

There's a work-in-progress proposal for an 'IntervalIndex' that would make this type of operation very straightforward.

But for now, you can get the bins by passing the retbins argument and calculate the midpoints.

In [8]: s, bins = pd.cut(pd.Series(np.arange(11)), bins = 5, retbins=True)

In [11]: mid = [(a + b) /2 for a,b in zip(bins[:-1], bins[1:])]

In [13]: s.cat.rename_categories(mid)
Out[13]: 
0     0.995
1     0.995
2     0.995
3     3.000
4     3.000
5     5.000
6     5.000
7     7.000
8     7.000
9     9.000
10    9.000
dtype: category
Categories (5, float64): [0.995 < 3.000 < 5.000 < 7.000 < 9.000]

python bin data and return bin midpoint (maybe using pandas.cut and qcut)

3 Answers3

Linked

Related