5

I want to insert a new column called total in final_dfwhich is a cumulative sum of value in df if it occurs between the times in final_df. It sums the values if it occurs between the start and end in final_df. So for example during the time range 01:30 to 02:00 in final_df - both index 0 and 1 in df occur between this time range so the total is 15 (10+5).

I have two pandas dataframes:

df

import pandas as pd

d = {'start_time': ['01:00','00:00','00:30','02:00'], 
     'end_time': ['02:00','03:00','01:30','02:30'], 
     'value': ['10','5','20','5']}

df = pd.DataFrame(data=d)

final_df

final_df = {'start_time': ['00:00, 00:30, 01:00, 01:30, 02:00, 02:30'],
            'end_time': ['00:30, 01:00, 01:30, 02:00, 02:30, 03:00']}

final_df = pd.DataFrame(data=final_d)

output I want final_df

start_time  end_time total
00:00       00:30    5
00:30       01:00    25
01:00       01:30    35
01:30       02:00    15
02:30       03:00    10

My try

final_df['total'] = final_df.apply(lambda x: df.loc[(df['start_time'] >= x.start_time) & 
                                            (df['end_time'] <= x.end_time), 'value'].sum(), axis=1)

Problem 1

I get the error: TypeError: ("'>=' not supported between instances of 'str' and 'datetime.time'", 'occurred at index 0')

I converted the relevant columns to datetime as follows:

df[['start_time','end_time']] = df[['start_time','end_time']].apply(pd.to_datetime, format='%H:%M')
final_df[['start_time','end_time']] = final_df[['start_time','end_time']].apply(pd.to_datetime, format='%H:%M:%S')

But I don't want to convert to datetime. Is there a way around this?

Problem 2

The sum is not working properly. It's only looking for exact match for the time range. So the output is:

 start_time  end_time total
    00:00       00:30    0
    00:30       01:00    0
    01:00       01:30    0
    01:30       02:00    0
    02:30       03:00    5
thor
  • 281
  • 1
  • 6
  • 13

1 Answers1

3

One way to not use apply could be like this this.

df_ = (df.rename(columns={'start_time':1, 'end_time':-1}) #to use in the calculation later
         .rename_axis(columns='mult') # mostly for esthetic
         .set_index('value').stack() #reshape the data
         .reset_index(name='time') # put the index back to columns
      )
df_ = (df_.set_index(pd.to_datetime(df_['time'], format='%H:%M')) #to use resampling technic
          .assign(total=lambda x: x['value'].astype(float)*x['mult']) #get plus or minus the value depending start/end
          .resample('30T')[['total']].sum() # get the sum at the 30min bounds
          .cumsum() #cumulative sum from the beginning
      )
# create the column for merge with final resul
df_['start_time'] = df_.index.strftime('%H:%M')

# merge
final_df = final_df.merge(df_)

and you get

print (final_df)
  start_time end_time  total
0      00:00    00:30    5.0
1      00:30    01:00   25.0
2      01:00    01:30   35.0
3      01:30    02:00   15.0
4      02:00    02:30   10.0
5      02:30    03:00    5.0

But if you want to use apply, first you need to ensure that the columns are the good dtype and then you did the inegality in the reverse order like:

df['start_time'] = pd.to_datetime(df['start_time'], format='%H:%M')
df['end_time'] = pd.to_datetime(df['end_time'], format='%H:%M')
df['value'] = df['value'].astype(float)
final_df['start_time'] = pd.to_datetime(final_df['start_time'], format='%H:%M')
final_df['end_time'] = pd.to_datetime(final_df['end_time'], format='%H:%M')

final_df.apply(
    lambda x: df.loc[(df['start_time'] <= x.start_time) & #see other inequality
                     (df['end_time'] >= x.end_time), 'value'].sum(), axis=1)
0     5.0
1    25.0
2    35.0
3    15.0
4    10.0
5     5.0
dtype: float64
Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • 1
    Nice answer @Ben.T. – Shubham Sharma Jun 12 '20 at 14:00
  • Nice answer, both solutions worked. For the apply solution, for some reason the last time range in my final_df (i.e. 23:30 to 00:00. Note not included my sample above ) summed everything in the column which shouldn't be the case. – thor Jun 12 '20 at 14:26
  • 1
    @thor hmm like this I'm not sure why, I'll try to test few things but yeah bounds effect are possible, especially when playing with intervals :) – Ben.T Jun 12 '20 at 14:39
  • @thor so after thinking a bit to your comment, it makes sense that you get the sum of all and like this I'm not sure there is an easy fix to it. if you write the inequality with this interval, then you get (df['start_time'] <= 23:30) & (df['end_time'] >= 00:00) which gives indeed all the values in df. – Ben.T Jun 12 '20 at 20:48