0

I have a table consisting of the following integer columns:

group_id, person_id, sequence

Ideally, the primary key would be (group_id, person_id), but there are some duplicates I need to remove first. Whenever there are duplicates, I want to keep only the lowest sequence value.

I figured out a query to select some of the errant rows, but it only gets the highest sequence number in the duplicate set.

SELECT COUNT(*) AS num, group_id, person_id, MAX(sequence)
FROM my_table
GROUP BY group_id, person_id
HAVING COUNT(*) > 1;

I'm sure I'm missing something simple. Is there an easy way to remove these duplicates?

Thanks.

Rachie
  • 433
  • 1
  • 6
  • 17

2 Answers2

0

All column should be duplicate. So group by apply all columns like this

select * from my_table where not EXISTS (
    SELECT group_id, person_id, min(sequence)
    FROM my_table
    GROUP BY group_id, person_id
    HAVING COUNT(*) > 1);
Mukesh Kalgude
  • 4,814
  • 2
  • 17
  • 32
0

Try writing a query that returns the rows you want to delete. Assuming that the combination of (group_id,person_id,sequence) is UNIQUE, and you don't have NULL values...

 SELECT t.* 
   FROM my_table t
   JOIN ( SELECT o.group_id
               , o.person_id
               , MAX(o.sequence) AS max_sequence
            FROM my_table o
           GROUP BY o.group_id, o.person_id
          HAVING COUNT(*) > 1
        ) d
    ON d.group_id      = t.group_id
   AND d.person_id     = t.person_id
   AND d.max_sequence  = t.sequence

We can convert that into a DELETE statement by replacing the SELECT keyword with the DELETE keyword.

Or, when I'm deleting rows with statements similar to this, I will usually create a table as a "backup" of the rows I'm going to delete.

Just precede the SELECT with CREATE TABLE some_new_table_name AS.

Then, we can reference the "saved" rows in a DELETE query

DELETE t.*
  FROM my_table t
  JOIN some_new_table_name d
    ON d.group_id      = t.group_id
   AND d.person_id     = t.person_id
   AND d.max_sequence  = t.sequence

This approach only gets "one" of the duplicates. If the original query had count values larger than 2, then we'd need to repeat this enough times, each time deleting the highest sequence value, repeating that until there are no count values greater than 1.

If there are a lot of duplicates to remove, we could use a slightly different pattern to get them in one fell swoop.

Instead of returning MAX(sequence) (the row we want to delete), we can instead return MIN(sequence), the row we want to keep. And we'd change the predicate,

    AND d.max_sequence  = t.sequence

to be

    AND d.min_sequence  <> t.sequence

So that we delete all rows for that group_id, person_id EXCEPT for the one with the minimum value.

I strongly recommend you write this as a SELECT first, before converting it into a DELETE statement. And I also recommend you have a good backup of the table and/or "save" copies of the rows you are going to delete. Just in case you need to restore some rows.

spencer7593
  • 106,611
  • 15
  • 112
  • 140