I have two pandas dataframes, each with two columns: a measurement and a timestamp. I need to multiply the first differences of the measurements, but only if there is a time overlap between the two measurement intervals. How can I do this efficiently, as the size of the dataframes gets large? Example:
dfA
mesA timeA
0 125 2015-01-14 04:44:49
1 100 2015-01-14 05:16:23
2 115 2015-01-14 08:57:10
dfB
mesB timeB
0 140 2015-01-14 00:13:17
1 145 2015-01-14 08:52:01
2 120 2015-01-14 11:31:44
Here I would multiply (100-125)*(145-140)
since there is a time overlap between the intervals [04:44:49, 05:16:23]
and [00:13:17, 08:52:01]
, but not (100-125)
and(120-145)
, since there isn't one. Similarly, I would have (115-100)*(145-140)
but also (115-100)*(120-145)
, since both have a time overlap.
In the end I will have to sum all the relevant products in a single value, so the result need not be a dataframe. In this case:
s = (100-125)*(145-140)+(115-100)*(145-140)+(115-100)*(120-145) = -425
My current solution:
s = 0
for i in range(1, len(dfA)):
startA = dfA['timeA'][i-1]
endA = dfA['timeA'][i]
for j in range(1, len(dfB)):
startB = dfB['timeB'][j-1]
endB = dfB['timeB'][j]
if (endB>startA) & (startB<endA):
s+=(dfA['mesA'][i]-dfA['mesA'][i-1])*(dfB['mesB'][j]-dfB['mesB'][j-1])
Although it seems to work, it is very inefficient and becomes impractical with very large datasets. I believe it could be vectorized more efficiently, perhaps using numexpr
, but I still haven't found a way.
EDIT: other data
mesA timeA
0 125 2015-01-14 05:54:03
1 100 2015-01-14 11:39:53
2 115 2015-01-14 23:58:13
mesB timeB
0 110 2015-01-14 10:58:32
1 120 2015-01-14 13:30:00
2 135 2015-01-14 22:29:26
s = 125