-2

I am wondering if there is some vectorized trick to do this better than the trivial solution where one implements this in a loop and lots of binary searches:

Dataframe 1 has a time field that contains arbitrary time stamps (microsecond resolution) and it is not small, containing a few million rows.

Dataframe 2 also has a time field that contains arbitrary time stamps, but those time stamps will not usually match any time stamp in DF1. DF2 also contains columns with measurement values.

Now, DF1 should get joined with DF2 in the following way: for each row in DF1 with timestamp t1 find all the rows in DF2 which have timestamps in the interval t1-INTVL to t1, e.g. all timestamps within the last half hour. Aggregate the measurement columns over those rows and join the row in DF1 with those aggregated columns.

Are there any methods which could help doing this better than just trivially implementing literally what I described in Python code?

Since people downvoted this perhaps because I did not provide the code for the "trivial" solution, here is my current trivial solution (dfg=DF1, dfb=DF2):

# Step 1: for all rows create a dictionary from original timestamp to aggregated variable series
aggcols = [c for c in DF2.columns if c not in ["timestamp"]]  # columns which can be summed
def calc_agg(df, ts_to, cols):
    ts_from = ts_to - pd.Timedelta(minutes=30)
    intvl_rows = df[(df["timestamp"] <= ts_to) & (df["timestamp"] >= ts_from)]
    agg = intvl_rows[cols].sum()
    return agg
ts2ser = {
    ts: calc_agg(DF2, ts, aggcols) for ts in DF1["timestamp"]
}
# Step 2: add the aggregated columns to the original DF1
for c in aggcols:
    DF1[c] = DF1["timestamp"].apply(lambda x: ts2ser[x][c])

This works but it literally processes DF1 in a row-by-row fashion.

So the question was: do I miss anything in pandas/numpy that could make this more scalable?

jpp1
  • 2,019
  • 3
  • 22
  • 43

1 Answers1

1

You could try the following:

df1 = df1.merge(
    pd.concat(
        [df2, df1.loc[~df1["timestamp"].isin(df2["timestamp"]), ["timestamp"]]]
    )
    .fillna(0)
    .sort_values("timestamp")
    .rolling("30T", on="timestamp").sum(),
    on="timestamp", how="left"
)
  • First add rows to df2 with those timestamps of df1 that aren't already in df2. The values outside timestamp of these new rows are NaN. If 0 would be better, then add a .fillna(0) after the concat.
  • Sort the new dataframe on timestamp.
  • Sum .rolling on timestamp with a 30 minutes window. Maybe you have to adjust the keyword argument closed to "both" (or since Pandas 1.4.0 inclusive="both") - depends on what you exactly want.
  • Now merge the result wiht df1 on timestamps.

I've run this against sample dataframes with several million rows and it seems reasonable fast, way faster than your version. It also reproduces the results from your program.

Timus
  • 10,974
  • 5
  • 14
  • 28
  • 1
    Thank you this is a great solution! And it is indeed a LOT faster: in one of my tests, the speedup is larger than 250 times! Thank you for sharing this, especially to a a question that got so many downvotes! – jpp1 Sep 23 '22 at 21:02