Drop almost duplicates rows based on timestamp

Question

I'm trying to remove some data almost duplicates. I'm looking for a way to detect the closest (edited_at) trip made by the user without losing informations.

So I want to solve this problem by calculating the difference between succesive timestamps and I remove minimum difference (zero in this example: step-1).

I am open for other suggestions

Note:

Don't use drop_duplicates() function.

Dataframe:

data = [[111, 121, "2019-10-22 05:00:00", 0],
        [111, 121, "2019-10-22 05:00:00", 1],
        [111, 123, "2019-10-22 07:10:00", 0], 
        [111, 123, "2019-10-22 07:10:00", 1], 
        [111, 123, "2019-10-22 07:10:00", 2],
        [111, 124, "2019-10-22 07:20:00", 0],
        [111, 124, "2019-10-22 07:20:00", 1],
        [111, 124, "2019-10-22 07:20:00", 2],
        [111, 124, "2019-10-22 07:20:00", 3],
        [111, 125, "2019-10-22 19:20:00", 0], 
        [111, 125, "2019-10-22 19:20:00", 1],
        [222, 223, "2019-11-24 06:00:00", 0], 
        [222, 223, "2019-11-24 06:00:00", 1], 
        [222, 244, "2019-11-24 06:15:00", 0],
        [222, 244, "2019-11-24 06:15:00", 1],
        [222, 255, "2019-11-24 18:15:10", 0],
        [222, 255, "2019-11-24 18:15:10", 1]]
df = pd.DataFrame(data, columns = ["user_id", "prompt_uuid", "edited_at", "prompt_num"]) 

df['edited_at'] = pd.to_datetime(df['edited_at'])

Step-1:

111, 121, "2019-10-22 05:00:00", 0, somthing, 
111, 121, "2019-10-22 05:00:00", 1, somthing, 
111, 123, "2019-10-22 07:10:00", 0, 140,
111, 123, "2019-10-22 07:10:00", 1, 140,
111, 123, "2019-10-22 07:10:00", 2, 140,
111, 124, "2019-10-22 07:20:00", 0,  10,
111, 124, "2019-10-22 07:20:00", 1,  10,
111, 124, "2019-10-22 07:20:00", 2,  10,
111, 124, "2019-10-22 07:20:00", 3,  10,
111, 125, "2019-10-22 19:20:00", 0, 720, 
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 223, "2019-11-24 06:00:00", 0,   0, 
222, 223, "2019-11-24 06:00:00", 1,   0, 
222, 244, "2019-11-24 06:15:00", 0,  15,
222, 244, "2019-11-24 06:15:00", 1,  15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720

Step-2:

111, 121, "2019-10-22 05:00:00", 0,  somthing,
111, 121, "2019-10-22 05:00:00", 1,  somthing, 
111, 124, "2019-10-22 07:20:00", 0,  10,
111, 124, "2019-10-22 07:20:00", 1,  10,
111, 124, "2019-10-22 07:20:00", 2,  10,
111, 124, "2019-10-22 07:20:00", 3,  10,
111, 125, "2019-10-22 19:20:00", 0, 720, 
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 244, "2019-11-24 06:15:00", 0,  15,
222, 244, "2019-11-24 06:15:00", 1,  15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720

Neither Case 1 nor Case 2 was understood to be a clear standard. Isn't the latest date and the last data the same in chronological order? If that is the case. You can get it by `groupby().last()`. — r-beginners, Jul 17 '20 at 05:38
I have updated my question to make it clearer with a track to explore — Adil Blanco, Jul 17 '20 at 20:48
@Scott Boston sorry, it is an oversight, now is that clear to you ? The goal is to eliminate the `prompt_uuid` which are almost similar — Adil Blanco, Jul 19 '20 at 21:50
Okay. now explain why are you eliminating the 7:10 what is the logic? and in group 2 you are eliminating the first group at 6:00. — Scott Boston, Jul 19 '20 at 21:53
Imagine you are using a track application, you have gone through the points [0, 1, 2] and after realizing that it isn't true then you corrected your points by [0, 1, 2, 3] and the only one column that shows this difference `edited_at` most of the time are too close. This case explain `user_uuid: 111 & prompt_uuid: 123, 124` — Adil Blanco, Jul 19 '20 at 22:04
@ALollz I must keep this `prompt_uuid`, the `prompt_uuid` that I want to delete often doesn't make much difference (sometimes it's milliseconds). I created this dataframe to make it clear but apparently it isn't the case. If you check my question this is just a suggestion — Adil Blanco, Jul 19 '20 at 23:46

score 0 · Accepted Answer · answered Jul 19 '20 at 23:52

Because your DataFrame is duplicated with respect to ['user_id', 'prompt_uuid'] taking a simple diff does not give the time difference between successive groups. First drop_duplicates then calculate the time difference within each 'user_id'. You can then filter this to find the smallest time difference for each user:

s = df.drop_duplicates(['user_id', 'prompt_uuid']).copy()
s['time_diff'] = s.groupby('user_id')['edited_at'].diff(-1).abs()
s = s[s['time_diff'] == s.groupby('user_id')['time_diff'].transform('min')]

#    user_id  prompt_uuid           edited_at  prompt_num time_diff
#2       111          123 2019-10-22 07:10:00           0  00:10:00
#11      222          223 2019-11-24 06:00:00           0  00:15:00

Now if you want to further subset this to the rows where the time difference was within some small threshold (i.e. you're fine keeping a group where the minimum time difference is 4 hours...)

# Doesn't alter `s` in this example as both min_diffs are < 1 hour.
min_time = '1 hour'
s = s[s['time_diff'].le(min_time)]

Now s represents the unique ['user_id', 'prompt_uuid'] groups that you want to remove from the DataFrame. We accomplish this with an outer excluding merge, using the indicator:

keys = ['user_id', 'prompt_uuid']
df = (df.merge(s[keys], on=keys, how='outer', indicator=True)
        .query('_merge == "left_only"')
        .drop(columns='_merge'))

    user_id  prompt_uuid           edited_at  prompt_num
0       111          121 2019-10-22 05:00:00           0
1       111          121 2019-10-22 05:00:00           1
5       111          124 2019-10-22 07:20:00           0
6       111          124 2019-10-22 07:20:00           1
7       111          124 2019-10-22 07:20:00           2
8       111          124 2019-10-22 07:20:00           3
9       111          125 2019-10-22 19:20:00           0
10      111          125 2019-10-22 19:20:00           1
13      222          244 2019-11-24 06:15:00           0
14      222          244 2019-11-24 06:15:00           1
15      222          255 2019-11-24 18:15:10           0
16      222          255 2019-11-24 18:15:10           1

It doesn't cover all my real cases, but its give me an idea to solve the problem. Thank you. — Adil Blanco, Jul 20 '20 at 17:51

score -1 · Answer 2 · answered Jul 17 '20 at 03:03

I may not understand all the requirements, but I've deduced from the example output I expect to see.' Split to get the status of the 'resp' column. Get the first row of that split status with groupby().firts(). Now we've fixed the column names and column order.

df1 = pd.concat([df, df['resp'].str.split(',', expand=True)], axis=1).drop('resp',axis=1)
df1 = df1.groupby(1, as_index=False).first().sort_values('edited_at', ascending=True)
df1.drop(0, axis=1,inplace=True)
df1.columns = ['resp','prompt_uuid','displayed_at','edited_at','latitude','longitude','prompt_num','uuid']
df1 = df1.iloc[:,[1,0,2,3,4,5,6,7]]

df1
prompt_uuid resp    displayed_at    edited_at   latitude    longitude   prompt_num  uuid
1   ab123-9600-3ee130b2c1ff foot    2019-10-22 22:39:57 2019-10-22 23:15:07 44.618787   -72.616841  0   4248-b313-ef2206755488
2   ab123-9600-3ee130b2c1ff metro   2019-10-22 22:50:35 2019-10-22 23:15:07 44.617968   -72.615851  1   4248-b313-ef2206755488
4   ab123-9600-3ee130b2c1ff work    2019-10-22 22:59:20 2019-10-22 23:15:07 44.616902   -72.614793  2   4248-b313-ef2206755488
3   zw999-1555-8ee140b2w1aa shopping    2019-11-23 08:01:35 2019-10-23 08:38:07 44.617968   -72.615851  1   4248-b313-ef2206755488
0   zw999-1555-8ee140b2w1bb bike    2019-11-23 07:39:57 2019-10-23 08:45:24 44.618787   -72.616841  0   4248-b313-ef2206755488

Thanks for the reply, but that's not what i'm looking for. check I added more explanation, I apologize if I was not clear — Adil Blanco, Jul 17 '20 at 05:04

Drop almost duplicates rows based on timestamp

2 Answers2