1

A similar question has been asked here before

But when I tried all the available solutions it was giving me error.

Code:

print sum(data['Activity_Duration'],datetime.timedelta())  
#import operator  
#print reduce(operator.add, data['Activity_Duration'])  

Error:

OverflowError

1 #print sum(data['Activity_Duration'],datetime.timedelta())
2 import operator
----> 3 print reduce(operator.add, data['Activity_Duration'])

OverflowError: long too big to convert

Am i missing something or can we come up with a more scalable solution?

Info: My data has 436746 rows.

I am working on 8 gig machine and data size is 650MB

Community
  • 1
  • 1
Vivek Kalyanarangan
  • 8,951
  • 1
  • 23
  • 42

1 Answers1

4

I think you need sum:

print (df['Activity_Duration'].sum())

Sample:

import pandas as pd

start = pd.to_datetime('2015-02-24')
end = pd.to_datetime('2016-04-25')
rng = pd.date_range(start, end, freq='6D')

start = pd.to_datetime('2015-02-26')
end = pd.to_datetime('2016-04-27')
rng1 = pd.date_range(start, end, freq='6D')

df = pd.DataFrame({'Date1': rng, 'Date2': rng1})  

df['Activity_Duration'] = df.Date2 - df.Date1
print (df)
        Date1      Date2  Activity_Duration
0  2015-02-24 2015-02-26             2 days
1  2015-03-02 2015-03-04             2 days
2  2015-03-08 2015-03-10             2 days
3  2015-03-14 2015-03-16             2 days
4  2015-03-20 2015-03-22             2 days
5  2015-03-26 2015-03-28             2 days
6  2015-04-01 2015-04-03             2 days
7  2015-04-07 2015-04-09             2 days
8  2015-04-13 2015-04-15             2 days
9  2015-04-19 2015-04-21             2 days
...
...


print (df['Activity_Duration'].sum())
144 days 00:00:00

If need output in float:

import numpy as np

df['Activity_Duration'] = (df.Date2 - df.Date1) / np.timedelta64(1, 'D')
print (df)
        Date1      Date2  Activity_Duration
0  2015-02-24 2015-02-26                2.0
1  2015-03-02 2015-03-04                2.0
2  2015-03-08 2015-03-10                2.0
3  2015-03-14 2015-03-16                2.0
4  2015-03-20 2015-03-22                2.0
...
...
...

print (df['Activity_Duration'].sum())
144.0

Another solution is dt.days - output is int:

print (df['Activity_Duration'].dt.days.sum())
144

Timedelta limitations.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I would still like the output in timedelta like your first suggestion but I am getting this error now - --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in () 2 #import operator 3 #print reduce(operator.add, data['Activity_Duration']) ----> 4 data['Activity_Duration'].sum() ValueError: overflow in timedelta operation – Vivek Kalyanarangan Jun 20 '16 at 15:54
  • I think it means output is too big number. what return second suggestion? – jezrael Jun 20 '16 at 15:58
  • Sum is more as `106751`, it is max value of timedelta - [see](http://stackoverflow.com/questions/36766546/sum-overflow-timedeltas-in-python-pandas) – jezrael Jun 20 '16 at 16:01
  • okk I understand thanks for that. So is there a way I can get this computation done? – Vivek Kalyanarangan Jun 20 '16 at 16:04
  • 1
    I think solution is my second suggestion or `df['Activity_Duration'].dt.days.sum()` – jezrael Jun 20 '16 at 16:07
  • `df['Activity_Duration'].dt.days.sum()` worked like magic! thanks!!! Please edit your answer so that it includes this and I will accept :) – Vivek Kalyanarangan Jun 21 '16 at 05:47