4

I have a column of values like below:

col
12
76
34

for which I need to generate a new column with the bucket labels for col1 as mentioned below:

col1     bucket-labels
12            8-16
76            64-128 
34            32-64

Here the values in the column might vary and the number of results also.

Edit: The intervals of the bucket label should be in the range of 2^n

maninekkalapudi
  • 958
  • 2
  • 10
  • 23

2 Answers2

7

First get maximal value of power 2 by one of solution from here, create bins by list comprehension, labels by zip and pass it to cut function:

import math
a = df['col'].max()
bins = [1<<exponent for exponent in range(math.ceil(math.log(a, 2))+1)]
#another solution
#bins = [1<<exponent for exponent in range((int(a)-1).bit_length() + 1)]
print (bins)
[1, 2, 4, 8, 16, 32, 64, 128]

labels = ['{}-{}'.format(i, j) for i, j in zip(bins[:-1], bins[1:])] 

df['bucket-labels'] = pd.cut(df['col'], bins=bins, labels=labels)
print (df)
   col bucket-labels
0   12          8-16
1   34         32-64
2   76        64-128
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I think labels creation may not needed [pd.cut(df.col,bins).astype(str).str.slice(start=1,stop=-1).str.replace(', ','-')] – Naga kiran Nov 15 '18 at 10:50
  • @jazreal: Thanks for the answer, I could get the exact results that I needed. However, there is another case that I forgot to mention the case for values less than 2 i.e., the bucket values for values from 0 to 2. For example: if the values on `col` is `0.7`, the bucket range should be `0.5-1.0`. I've tried the following but I got the `ValueError: math domain error` for `col value- 0.7` `bins = [1< – maninekkalapudi Nov 19 '18 at 06:04
  • @ManikanthaNekkalapudi - so for values between `0-2` is multiple buckets? like `0-0.5`, `0.5-1`, `1-1.5` and `1.5-2` ? – jezrael Nov 19 '18 at 06:08
  • Yes, multiple buckets but it should comply my earlier condition i.e., the bucket ranges are in the range 2^n. `0.7` belongs to `(2^-1) - ((2^0))`. In such case the range `1-1.5` and `1.5-2` doesn't fit. The powers of `2` should be `int` – maninekkalapudi Nov 19 '18 at 06:13
  • @ManikanthaNekkalapudi - so is necessary add one bins like `0-1`, then bins are `0,1,2,4,8,...` ? – jezrael Nov 19 '18 at 06:30
  • yes the bins should be `0,1,2,4,8,...`. Here we've considered the max value for as `a`, can we create another variable `b=df['col'].min()' such that it'll try to generate bucket from lowest to highest ranges of the column values? – maninekkalapudi Nov 19 '18 at 06:35
  • @ManikanthaNekkalapudi - So need `bins = [0] + [1< – jezrael Nov 19 '18 at 06:39
2

Using pd.cut with 2 power bins:

bins = [2**i for i in range(0,int(np.log2(df.col.max()))+2)]
#alternative [2**i for i in range(0,np.ceil(np.log2(df.col.max()))+1)]
bin_labels = [f'{x}-{y}' for x, y in zip(bins[:-1], bins[1:])]
df['bucket-labels'] = pd.cut(df.col, bins=bins, labels=bin_labels)

print(df)
   col bucket-labels
0   12          8-16
1   76        64-128
2   34         32-64
Space Impact
  • 13,085
  • 23
  • 48