0

I have an SQL table, which contains some duplicate records that I want to remove. Removal should happen under 2 conditions together:

  1. Records have same value under score column
  2. Records happened within 8 hours from each other.

The removed records should be the ones that have an older date, among all matching records, so only the most recent record among matching records should be present in the new query result.

So far, I've managed only to create a code that removes such duplicate, only if the records happened on the same day of the month, so it's missing any records that span over 2 consecutive days - How to solve this?

Original DB looks like:

user_id  score             visited_at           visit_id  
-------  ----------------  -------------------  ----------
     22  75.0              2018-05-14 23:39:14         169
     22  75.0              2018-05-14 18:36:26         168
     22  75.0              2018-05-13 02:04:46         166
      2  55.0              2018-05-12 18:38:24         165
     22  78.0              2018-05-12 18:14:34         164
     22  75.0              2018-05-12 18:45:12         164
     22  55.0              2018-05-08 12:36:12         161

SQL command to partly remove duplicates:

SELECT COUNT(*) AS ct
     , it.user_id
     , it.score
     , UNIX_TIMESTAMP(CONVERT_TZ(it.visited_at,'+00:00',@@global.time_zone)) DIV 86400 AS diff
     , it.visited_at
     , it.visit_id
  FROM `vw_items` it
 GROUP 
    BY user_id
     , score
     , diff 
 ORDER 
    BY visited_at DESC

Result:

    ct  user_id            score    diff  visited_at           visit_id  
------  -------  ----------------  ------  -------------------  ----------
     2       22  75.0               17665  2018-05-14 23:39:14         169
     1       22  75.0               17664  2018-05-13 02:04:46         166
     1        2  55.0               17663  2018-05-12 18:38:24         165
     1       22  78.0               17663  2018-05-12 18:14:34         164
     1       22  75.0               17663  2018-05-12 18:45:12         164
     1       22  55.0               17659  2018-05-08 12:36:12         161

But I need a command that will also remove record:

     1       22  75.0               17663  2018-05-12 18:45:12         164

Because it has the same score as another record, that is more recent, which occurred within 8 hours from that record:

     1       22  75.0               17664  2018-05-13 02:04:46         166
Strawberry
  • 33,750
  • 13
  • 40
  • 57
rockyraw
  • 1,125
  • 2
  • 15
  • 36
  • Would a row on the same day, but say 9 hours earlier be left in place? – Sloan Thrasher Jun 04 '18 at 14:55
  • yes, such row shouldn't get removed – rockyraw Jun 04 '18 at 15:06
  • See http://meta.stackoverflow.com/questions/333952/why-should-i-provide-an-mcve-for-what-seems-to-me-to-be-a-very-simple-sql-query – Strawberry Jun 04 '18 at 15:06
  • Should the row that's kept be the max for the day? If there are multiple rows during the same day with multiple overlaps of 8 hour time periods, how do you determine which are to be kept? For instance, if you have rows at 8am, 10am, 12pm, 2pm, 4pm, and 8pm, which rows would be kept (assuming same scores) If you select the 8pm row to keep, that would leave the 10am row, but the 2pm row would eliminate the 10am row. – Sloan Thrasher Jun 04 '18 at 15:36
  • yes, in such scenario everything should be eliminated expcet 8pm – rockyraw Jun 04 '18 at 17:39

1 Answers1

0

I believe what you're looking for is the DATE_SUB function

DATE_SUB(it.visited_at, INTERVAL 8 HOUR)

This will create a datetime that you can compare to find things within 8 hours of a given record. I'd write more answer, but it looks like that's the only piece of the puzzle you're missing.

D. Kendall
  • 316
  • 2
  • 9
  • The function for adding is, unsurprisingly DATE_ADD() --- https://stackoverflow.com/questions/589652/add-2-hours-to-current-time-in-mysql – D. Kendall Jun 04 '18 at 15:00
  • This won't help since it will just give datetime after substracting 8 hours from actuall visit time. currently I first creae unixtimestam of visited_at whichl gives total number of seconds since 1-1-1970, once I have that total seconds, I divide it with 86400 (seconds of 24 hour) so it will give division value and fractional par will be ignored so that way I can same diff value for same day and later on group by that diff value. – rockyraw Jun 04 '18 at 17:42