0

I am trying to create a proper bin for a timestamp interval column,

using code such as

df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00']))

The Resulting df looks like:

time_interval  |           bin
  00:17:00        (0 days 00:10:00, 0 days 00:20:00]
  01:42:00                NaN
  00:15:00        (0 days 00:10:00, 0 days 00:20:00]
  00:00:00                NaN
  00:06:00        (0 days 00:00:00, 0 days 00:10:00]

Which is a little off as the result I want is jjust the time value and not the days and also I want the upper limit or last bin to be 60 mins or inf ( or more)

Desired Output:

time_interval  |           bin
      00:17:00        (00:10:00,00:20:00]
      01:42:00        (00:60:00,inf]
      00:15:00        (00:10:00,00:20:00]
      00:00:00        (00:00:00,00:10:00]
      00:06:00        (00:00:00,00:10:00]

Thanks for looking!

Chris90
  • 1,868
  • 5
  • 20
  • 42

2 Answers2

1

In pandas inf for timedeltas not exist, so used maximal value. Also for include lowest values is used parameter include_lowest=True if want bins filled by timedeltas:

b = pd.to_timedelta(['00:00:00','00:10:00','00:20:00',
                     '00:30:00','00:40:00',
                     '00:50:00','00:60:00'])
b = b.append(pd.Index([pd.Timedelta.max]))
df['Bin'] = pd.cut(df['time_interval'],  include_lowest=True, bins=b)
print (df)
  time_interval                                             Bin
0      00:17:00              (0 days 00:10:00, 0 days 00:20:00]
1      01:42:00  (0 days 01:00:00, 106751 days 23:47:16.854775]
2      00:15:00              (0 days 00:10:00, 0 days 00:20:00]
3      00:00:00     (-1 days +23:59:59.999999, 0 days 00:10:00]
4      00:06:00     (-1 days +23:59:59.999999, 0 days 00:10:00]

If want strings instead timedeltas use zip for create labels with append 'inf':

vals = ['00:00:00','00:10:00','00:20:00',
        '00:30:00','00:40:00', '00:50:00','00:60:00']

b = pd.to_timedelta(vals).append(pd.Index([pd.Timedelta.max]))

vals.append('inf')
labels = ['{}-{}'.format(i, j) for i, j in zip(vals[:-1], vals[1:])] 

df['Bin'] = pd.cut(df['time_interval'],  include_lowest=True, bins=b, labels=labels)
print (df)
  time_interval                Bin
0      00:17:00  00:10:00-00:20:00
1      01:42:00       00:60:00-inf
2      00:15:00  00:10:00-00:20:00
3      00:00:00  00:00:00-00:10:00
4      00:06:00  00:00:00-00:10:00
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

You could just use labels to solve it -

df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00', '24:00:00']), labels=['(00:00:00,00:10:00]', '(00:10:00,00:20:00]', '(00:20:00,00:30:00]', '(00:30:00,00:40:00]', '(00:40:00,00:50:00]', '(00:50:00,00:60:00]', '(00:60:00,inf]'])
E. Gertz
  • 241
  • 4
  • 13