Resample/Slice Pandas Timeseries/Dataframe by minimum timedelta

Question

I have a pandas dataframe or series with timestamps at irregular intervals. I want to filter the table so that between 2 rows the minimum distance of for example 20ms is kept. The distance may also be larger. Here is an example table on the left and right, how this table should be filtered based on the minimum distance.

317   2022-12-31 00:00:00.360               317   2022-12-31 00:00:00.360 
318   2022-12-31 00:00:00.364               318   
319   2022-12-31 00:00:00.368               319
320   2022-12-31 00:00:00.372               320   2022-12-31 00:00:00.372
321   2022-12-31 00:00:00.376               321
322   2022-12-31 00:00:00.380               322
323   2022-12-31 00:00:00.384               323
324   2022-12-31 00:00:00.388               324
325   2022-12-31 00:00:00.392               325   2022-12-31 00:00:00.392
326   2022-12-31 00:00:00.396      ->       326
327   2022-12-31 00:00:00.414               327   2022-12-31 00:00:00.414
328   2022-12-31 00:00:00.416               328
329   2022-12-31 00:00:00.420               329
330   2022-12-31 00:00:00.425               330
331   2022-12-31 00:00:00.428               331
332   2022-12-31 00:00:00.432               332
333   2022-12-31 00:00:00.438               333   2022-12-31 00:00:00.438

This is my code so far with which I solved the problem. Actually just a simple for loop:

res=[timestamps[0]]
# iterate over all timestamps with correct samplerate in following window
for dtin timestamps[1:]:
    # check if the difference between the current timestamp and the last timestamp in res is equal or bigger than the needed minimum of 20ms =min_delta.
    if dt- res[-1] >= min_delta:
         # if yes, add timestamp to res
         res.append(date)

However, my problem is that the dataframe is very very large and the simple for loop is too inefficient.

I need a faster solution and have already experimented with .resample .diff, vectorization and similar, but never got the desired result! Does anyone have an idea how I could solve the problem efficiently? Thanks already for your ideas!

score 0 · Answer 1 · answered Jun 20 '23 at 07:35

You can use the diff method to calculate the time difference between each row and then use boolean indexing to filter the dataframe based on a minimum time difference threshold. Here is an example code snippet:

import pandas as pd

# create example dataframe with irregular timestamps
df = pd.DataFrame({'timestamp': pd.date_range('2022-12-31 00:00:00', periods=100, freq='5ms')})

# calculate time differences
time_diffs = df['timestamp'].diff()

# set minimum time difference threshold
min_delta = pd.Timedelta('20ms')

# filter dataframe based on time differences
filtered_df = df.loc[time_diffs >= min_delta]

In this code snippet, df['timestamp'].diff() calculates the time difference between each row of the timestamp column. Then we set a minimum delta value of 20ms using min_delta = pd.Timedelta('20ms'). Finally, we use boolean indexing to filter the dataframe based on the time differences greater than or equal to the minimum delta value using df.loc[time_diffs >= min_delta].

This should give you a filtered dataframe where the time differences between adjacent rows are greater than or equal to the specified minimum delta value.

Resample/Slice Pandas Timeseries/Dataframe by minimum timedelta

1 Answers1