I am wondering if there is some vectorized trick to do this better than the trivial solution where one implements this in a loop and lots of binary searches:
Dataframe 1 has a time field that contains arbitrary time stamps (microsecond resolution) and it is not small, containing a few million rows.
Dataframe 2 also has a time field that contains arbitrary time stamps, but those time stamps will not usually match any time stamp in DF1. DF2 also contains columns with measurement values.
Now, DF1 should get joined with DF2 in the following way: for each row in DF1 with timestamp t1 find all the rows in DF2 which have timestamps in the interval t1-INTVL to t1, e.g. all timestamps within the last half hour. Aggregate the measurement columns over those rows and join the row in DF1 with those aggregated columns.
Are there any methods which could help doing this better than just trivially implementing literally what I described in Python code?
Since people downvoted this perhaps because I did not provide the code for the "trivial" solution, here is my current trivial solution (dfg=DF1, dfb=DF2):
# Step 1: for all rows create a dictionary from original timestamp to aggregated variable series
aggcols = [c for c in DF2.columns if c not in ["timestamp"]] # columns which can be summed
def calc_agg(df, ts_to, cols):
ts_from = ts_to - pd.Timedelta(minutes=30)
intvl_rows = df[(df["timestamp"] <= ts_to) & (df["timestamp"] >= ts_from)]
agg = intvl_rows[cols].sum()
return agg
ts2ser = {
ts: calc_agg(DF2, ts, aggcols) for ts in DF1["timestamp"]
}
# Step 2: add the aggregated columns to the original DF1
for c in aggcols:
DF1[c] = DF1["timestamp"].apply(lambda x: ts2ser[x][c])
This works but it literally processes DF1 in a row-by-row fashion.
So the question was: do I miss anything in pandas/numpy that could make this more scalable?