I have an SQL table, which contains some duplicate records that I want to remove. Removal should happen under 2 conditions together:
- Records have same value under
score
column - Records happened within 8 hours from each other.
The removed records should be the ones that have an older date, among all matching records, so only the most recent record among matching records should be present in the new query result.
So far, I've managed only to create a code that removes such duplicate, only if the records happened on the same day of the month, so it's missing any records that span over 2 consecutive days - How to solve this?
Original DB looks like:
user_id score visited_at visit_id
------- ---------------- ------------------- ----------
22 75.0 2018-05-14 23:39:14 169
22 75.0 2018-05-14 18:36:26 168
22 75.0 2018-05-13 02:04:46 166
2 55.0 2018-05-12 18:38:24 165
22 78.0 2018-05-12 18:14:34 164
22 75.0 2018-05-12 18:45:12 164
22 55.0 2018-05-08 12:36:12 161
SQL command to partly remove duplicates:
SELECT COUNT(*) AS ct
, it.user_id
, it.score
, UNIX_TIMESTAMP(CONVERT_TZ(it.visited_at,'+00:00',@@global.time_zone)) DIV 86400 AS diff
, it.visited_at
, it.visit_id
FROM `vw_items` it
GROUP
BY user_id
, score
, diff
ORDER
BY visited_at DESC
Result:
ct user_id score diff visited_at visit_id
------ ------- ---------------- ------ ------------------- ----------
2 22 75.0 17665 2018-05-14 23:39:14 169
1 22 75.0 17664 2018-05-13 02:04:46 166
1 2 55.0 17663 2018-05-12 18:38:24 165
1 22 78.0 17663 2018-05-12 18:14:34 164
1 22 75.0 17663 2018-05-12 18:45:12 164
1 22 55.0 17659 2018-05-08 12:36:12 161
But I need a command that will also remove record:
1 22 75.0 17663 2018-05-12 18:45:12 164
Because it has the same score as another record, that is more recent, which occurred within 8 hours from that record:
1 22 75.0 17664 2018-05-13 02:04:46 166