I'm trying to remove some data almost duplicates. I'm looking for a way to detect the closest (edited_at
) trip made by the user without losing informations.
So I want to solve this problem by calculating the difference between succesive timestamps and I remove minimum difference (zero in this example: step-1).
I am open for other suggestions
Note:
Don't use drop_duplicates()
function.
Dataframe:
data = [[111, 121, "2019-10-22 05:00:00", 0],
[111, 121, "2019-10-22 05:00:00", 1],
[111, 123, "2019-10-22 07:10:00", 0],
[111, 123, "2019-10-22 07:10:00", 1],
[111, 123, "2019-10-22 07:10:00", 2],
[111, 124, "2019-10-22 07:20:00", 0],
[111, 124, "2019-10-22 07:20:00", 1],
[111, 124, "2019-10-22 07:20:00", 2],
[111, 124, "2019-10-22 07:20:00", 3],
[111, 125, "2019-10-22 19:20:00", 0],
[111, 125, "2019-10-22 19:20:00", 1],
[222, 223, "2019-11-24 06:00:00", 0],
[222, 223, "2019-11-24 06:00:00", 1],
[222, 244, "2019-11-24 06:15:00", 0],
[222, 244, "2019-11-24 06:15:00", 1],
[222, 255, "2019-11-24 18:15:10", 0],
[222, 255, "2019-11-24 18:15:10", 1]]
df = pd.DataFrame(data, columns = ["user_id", "prompt_uuid", "edited_at", "prompt_num"])
df['edited_at'] = pd.to_datetime(df['edited_at'])
Step-1:
111, 121, "2019-10-22 05:00:00", 0, somthing,
111, 121, "2019-10-22 05:00:00", 1, somthing,
111, 123, "2019-10-22 07:10:00", 0, 140,
111, 123, "2019-10-22 07:10:00", 1, 140,
111, 123, "2019-10-22 07:10:00", 2, 140,
111, 124, "2019-10-22 07:20:00", 0, 10,
111, 124, "2019-10-22 07:20:00", 1, 10,
111, 124, "2019-10-22 07:20:00", 2, 10,
111, 124, "2019-10-22 07:20:00", 3, 10,
111, 125, "2019-10-22 19:20:00", 0, 720,
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 223, "2019-11-24 06:00:00", 0, 0,
222, 223, "2019-11-24 06:00:00", 1, 0,
222, 244, "2019-11-24 06:15:00", 0, 15,
222, 244, "2019-11-24 06:15:00", 1, 15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720
Step-2:
111, 121, "2019-10-22 05:00:00", 0, somthing,
111, 121, "2019-10-22 05:00:00", 1, somthing,
111, 124, "2019-10-22 07:20:00", 0, 10,
111, 124, "2019-10-22 07:20:00", 1, 10,
111, 124, "2019-10-22 07:20:00", 2, 10,
111, 124, "2019-10-22 07:20:00", 3, 10,
111, 125, "2019-10-22 19:20:00", 0, 720,
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 244, "2019-11-24 06:15:00", 0, 15,
222, 244, "2019-11-24 06:15:00", 1, 15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720