0

I have a dataframe df:

                       col1     col2    col3
2020-01-02 08:50:00    360.0    -131.0  -943.0
2020-01-02 08:52:01    342.0    -130.0  -1006.0
2020-01-02 08:55:04    321.0    -130.0  -997.0
... ... ... ...
2022-01-03 14:44:56    1375.0   -91.0   -728.0
2022-01-03 14:50:57    1381.0   -118.0  -692.0
2022-01-03 14:50:58    1382.0   -115.0  -697.0
2022-01-03 14:50:59    1390.0   -111.0  -684.0
2022-01-03 14:55:58    1442.0   -106.0  -691.0

I want a function that obtains the indices that:

Are NOT within a specific time (e.g., 5 minutes) of each other.

For example:

masked_df = time_mask(df.index, pd.Timedelta(minutes=5))

masked_df:

                       col1     col2    col3
2020-01-02 08:50:00    360.0    -131.0  -943.0
2020-01-02 08:55:04    321.0    -130.0  -997.0
... ... ... ...
2022-01-03 14:44:56    1375.0   -91.0   -728.0
2022-01-03 14:50:57    1381.0   -118.0  -692.0
2022-01-03 14:55:58    1442.0   -106.0  -691.0

The function time_mask should obtain the first index that is not within 5 minutes of the previously added index. Below is my iterative attempt to solve this problem:

def get_clean_ix_from_rolling(idx, time_delt):
    
    clean_ix = []
    prev_ix = idx[0]
    clean_ix.append(prev_ix)
    for i, x in enumerate(idx):
        if((x-prev_ix) >= time_delt):
            clean_ix.append(x)
            prev_ix = x

    ix = pd.to_datetime(clean_ix)
    return ix

How can I speed up my code above?

1 Answers1

0

Shift your index by one row using .shift() and subtract it by the value of next value using .sub(). Get the minute difference using astype and check if it is equal to your time_delta using .eq(). Finally mask the index and get the results:

import pandas as pd

# Create some sample data
df = pd.DataFrame(index=["2020-01-02 08:50:00",
                           "2020-01-02 08:55:04",
                           "2022-01-03 14:44:56",
                           "2022-01-03 14:50:57",
                           "2022-01-03 14:55:57",
                           "2022-01-03 14:56:57",
                           "2022-01-03 14:57:57",
                           "2022-01-03 14:58:57",
                           "2022-01-03 14:59:57"])
                                                      
df.index = pd.to_datetime(df.index)
df

Output:

2020-01-02 08:50:00
2020-01-02 08:55:04
2022-01-03 14:44:56
2022-01-03 14:50:57
2022-01-03 14:55:57
2022-01-03 14:56:57
2022-01-03 14:57:57
2022-01-03 14:58:57
2022-01-03 14:59:57

time_delta = 5
idx = df.index.to_series()
s = idx.shift(-1).sub(idx).astype('timedelta64[m]').eq(time_delta)
df[~s]

Output:

2020-01-02 08:55:04
2022-01-03 14:44:56
2022-01-03 14:55:57
2022-01-03 14:56:57
2022-01-03 14:57:57
2022-01-03 14:58:57
2022-01-03 14:59:57

Edit:

I doubt that you can do what you described in the comments without an explicit for loop, in which case you'd better use itertools() or itertuple(). They will be faster, but I don't recommend using them either. Consider a change in approach like the following:

import pandas as pd
import numpy as np
# Create some sample data
dates = ["2022-01-02 08:50:00",
                           "2022-01-02 08:55:04",
                           "2022-01-03 14:44:56",
                           "2022-01-03 14:50:57",
                           "2022-01-03 14:55:57",
                           "2022-01-03 15:00:57",
                           "2022-01-03 15:02:57",
                           "2022-01-03 15:05:57",
                           "2022-01-03 15:08:57",
                           "2022-01-03 15:10:57"]
df = pd.DataFrame(index=pd.to_datetime(dates),
                  data=np.random.randint(low=-100, high=100, size=(10, 3))).add_prefix('col')
 df

Output:

                 col0 col1 col2
2022-01-02 08:50:00 1   43  71
2022-01-02 08:55:04 42  86  20
2022-01-03 14:44:56 -47 -97 64
2022-01-03 14:50:57 -5  76  59
2022-01-03 14:55:57 67  98  -36
2022-01-03 15:00:57 -44 85  33
2022-01-03 15:02:57 -8  14  67
2022-01-03 15:05:57 85  -84 -83
2022-01-03 15:08:57 -86 46  81
2022-01-03 15:10:57 -77 29  -63

df.resample('5min').agg('first').dropna(axis=0)

Output:

                    col0    col1    col2
2022-01-02 08:50:00 1.0     43.0    71.0
2022-01-02 08:55:00 42.0    86.0    20.0
2022-01-03 14:40:00 -47.0   -97.0   64.0
2022-01-03 14:50:00 -5.0    76.0    59.0
2022-01-03 14:55:00 67.0    98.0    -36.0
2022-01-03 15:00:00 -44.0   85.0    33.0
2022-01-03 15:05:00 85.0    -84.0   -83.0
2022-01-03 15:10:00 -77.0   29.0    -63.0
ali bakhtiari
  • 1,051
  • 4
  • 23
  • This is incorrect. I don't want just the next row to be 5 minutes away, but any row after that. So, the above output should not have 14:55 onwards – Vanillihoot Jan 03 '23 at 19:29
  • Resampling every 5 minutes and taking the first also does not work. Imagine having an index at 14:49 and another at 14:51. If 14:49 is the only index in that minute, it will violate the time rule. – Vanillihoot Jan 04 '23 at 14:54