1

I would like to drop 30 minutes of data after column is -1.

I have a large dataset and I need to remove 30 minutes from the dataframe after the column 'diff' == -1

I have the following code:


df12_3.head() 

   index           Timestamp  Status_flag  diff
0      0 2020-03-02 10:40:00            0  -1.0
1      1 2020-03-02 10:41:00            0   0.0
2      2 2020-03-02 10:42:00            0   0.0
3      3 2020-03-02 10:43:00            0   0.0
4      4 2020-03-02 10:44:00            0   0.0

idx =df12_3['Timestamp'][df12_3['diff'] == -1]

idx.head(1)
0      2020-03-02 10:40:00

halfhour = datetime.timedelta(minutes=30)

idx2=df12_3['Timestamp'][df12_3['diff'] == -1]+halfhour

idx2.head(1)

0   2020-03-02 11:10:00

df12_3.loc[(df12_3.index < idx.iloc[0] ) | (df12_3.index > idx2.iloc[0])]

This removes the first 30 minute range. Is there a way to set this up so I can remove every 30minutes from when column diff == -1. I would be also happy to remove by number of rows e.g. when diff==-1 remove the next 30 rows.

user19077881
  • 3,643
  • 2
  • 3
  • 14
lourew
  • 65
  • 1
  • 1
  • 7

1 Answers1

1

You can use a groupby.transform operation and boolean indexing:

df['Timestamp'] = pd.to_datetime(df['Timestamp'])

out = df.loc[df.groupby(df['diff'].eq(-1).cumsum())['Timestamp']
               .transform('first').add(pd.Timedelta('30min'))
               .le(df['Timestamp'])
            ]

Intermediates (with slightly different input for clarity):

   index           Timestamp  Status_flag  diff  cumsum               first              +30min  >Timestamp
0      0 2020-03-02 10:40:00            0  -1.0       1 2020-03-02 10:40:00 2020-03-02 11:10:00       False
1      1 2020-03-02 10:41:00            0   0.0       1 2020-03-02 10:40:00 2020-03-02 11:10:00       False
2      2 2020-03-02 10:42:00            0   0.0       1 2020-03-02 10:40:00 2020-03-02 11:10:00       False
3      3 2020-03-02 11:03:00            0   0.0       1 2020-03-02 10:40:00 2020-03-02 11:10:00       False
4      4 2020-03-02 11:14:00            0   0.0       1 2020-03-02 10:40:00 2020-03-02 11:10:00        True

Alternative:

out = df.loc[df.groupby(df['diff'].eq(-1).cumsum())['Timestamp']
               .transform('first').rsub(df['Timestamp'])
               .gt(pd.Timedelta('30min'))
            ]
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Hi thank you the code identifies the diff at -1 and the following 30 minutes. I tried to use df.drop to remove the sets of 30 minutes of data but error: KeyError: "['Timestamp' 'index' 'Status_flag' 'diff'] not found in axis". – lourew Jun 20 '23 at 12:45
  • @lourew yes, sorry you need to invert the mask, Use `.le(df['Timestamp'])` rather than `.gt(df['Timestamp'])`. See update – mozway Jun 20 '23 at 12:50
  • thank you so much! it works. I will have a look into what this all means but if you have any links that would be great but no worries if not I can find them :) – lourew Jun 20 '23 at 12:55
  • Hi @mozway. is there a way to edit this code to get the 5 minutes before diff == 1 e.g. here is 5 minutes after: `out = df.loc[df.groupby(df['diff'].eq(1).cumsum())['Timestamp'] .transform('first').rsub(df['Timestamp']) .le(pd.Timedelta('5min')) ]` – lourew Aug 14 '23 at 13:33