2

I have the following dataset that I'm hoping to apply some custom logic to:

data = pd.DataFrame({'ID': ['A','B','B','C','C','D','D'], 
'Date': ['2018-07-02T02:21:12.000+0000','2018-07-02T02:28:29.000+0000','2018-07-02T02:28:31.000+0000','2018-07-02T02:30:58.000+0000','2018-07-02T02:31:01.000+0000','2018-07-02T02:42:46.000+0000','2018-07-02T02:41:47.000+0000'],
'Action': ['Start','Start','Start','Stop','Stop','Start','Start'],
'Group': [5,13,13,19,19,2,2],
'Value': [100,110,110,95,95,280,280]
})

Rows 1:2, 3:4, and 5:6 are all identical except for the values in the Date column, which are off by a matter of seconds. Is there a way to remove duplicates if 1) the "date" timedelta between similar rows is less than 1 minute and 2) all other information is identical?

The result should look like the following:

result = pd.DataFrame({
'ID': ['A','B','C','D'], 
'Date': ['2018-07-02T02:21:12.000+0000','2018-07-02T02:28:29.000+0000','2018-07-02T02:30:58.000+0000','2018-07-02T02:42:46.000+0000'],
'Action': ['Start','Start','Stop','Start'],
'Group': [5,13,19,2],
'Value': [100,110,95,280]
})
Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Dfeld
  • 187
  • 9
  • 1
    This can be ambiguous. Say you have multiple rows within a 2 minute period. Which 1 minute slot do you choose to group by? – jpp Jul 19 '18 at 15:29
  • jpp makes a very good point. The solution I provided cascades, meaning that if you had 10 measurements all 59 seconds apart, you would still only keep the first measurement, and consider the other 9 duplicates. Perhaps this is what you want, but if not, you will need to perform some other type of bucketing. If you only ever have 2 dates for each group it wont really be an issue. – ALollz Jul 19 '18 at 16:43

1 Answers1

5

It seems the logic you want can be accomplished with a self merge using pd.merge_asof, which allows for a combination of matching on exact keys and nearest keys.

You will exclude exact matches, and use the default backward direction, which means that you will always keep only the earliest time entry in the case that two (or more) entries occur within the specified time window.

import pandas as pd

data['Date'] = pd.to_datetime(data.Date)
data = data.sort_values('Date')

merged = pd.merge_asof(
              data, 
              data.rename(columns={'Date': 'Date2'}), 
              by=[x for x in data.columns if x != 'Date'], 
              left_on='Date', right_on='Date2',
              direction='backward',
              allow_exact_matches=False)

# Remove duplicates within 1 minute
merged[((merged.Date - merged.Date2) > pd.Timedelta(minutes=1)) | (merged.Date2.isnull())].drop(columns='Date2')

Outputs:

  ID                Date Action  Group  Value
0  A 2018-07-02 02:21:12  Start      5    100
1  B 2018-07-02 02:28:29  Start     13    110
3  C 2018-07-02 02:30:58   Stop     19     95
5  D 2018-07-02 02:41:47  Start      2    280
ALollz
  • 57,915
  • 7
  • 66
  • 89