Delete of duplicate records

Question

I have a table where I would like to identify duplicate records based on two columns(id and role) and I use a third column (unit) to select a subset of records to analyze and do the deletion within. Here comes the table and a few rows a example data:

id | role | unit
----------------
946| 1001 |   1
946| 1002 |   1
946| 1003 |   1
946| 1001 |   2 
946| 1002 |   2
900| 1001 |   3
900| 1002 |   3
900| 1001 |   3

An analysis of unit 1 and 2 should identify two rows to delete 946/1001 and 946/1002. It doesn't matter if I delete the rows labeled unit 1 or 2. In a subsequent step I will update all records labeled unit=2 to unit=1.

I have a select statement capable to identify the rows to delete:

SELECT * FROM (SELECT 
        unit, 
        id, 
        role,  
        ROW_NUMBER() OVER (
            PARTITION BY 
                id, 
                role 
            ORDER BY 
                id, 
                role
        ) row_num
     FROM thetable WHERE unit IN (1,2)  ) as x
WHERE row_num > 1;

This query will give this result:

id | role | unit
----------------
946| 1001 |   2 
946| 1002 |   2

Now I would like to combine this with DELETE to delete the identified records. I have come pretty close (I believe) with this statement:

DELETE FROM thetable tp1 WHERE EXISTS 

(SELECT 
        unit, 
        id, 
        role,  
        ROW_NUMBER() OVER (
            PARTITION BY 
                id, 
                role 
            ORDER BY 
                id, 
                role
        ) as row_num
     FROM 
        thetable tp2 
        WHERE unit IN (1,2) AND 
        tp1.unit=tp2.unit AND 
        tp1.role=tp2.role AND 
        tp1.id=tp2.id AND row_num >1
)

However, the row_num is not recognized as column. So how should I modify this statement to delete the two identified records?

You cannot use ROW_NUMBER() in where clause neither the alias defined on SELECT on the same level [Why no windowed functions in where clauses?](https://stackoverflow.com/questions/13997177/why-no-windowed-functions-in-where-clauses/33522288#33522288). — Lukasz Szozda, Dec 05 '20 at 15:09

score 2 · Accepted Answer · answered Dec 05 '20 at 15:21

2

It is very simple with EXISTS:

DELETE FROM thetable t
WHERE t.unit IN (1,2)
AND EXISTS (
  SELECT 1 FROM thetable
  WHERE (id, role) = (t.id, t.role) AND unit < t.unit
)

See the demo.
Results:

>  id | role | unit
> --: | ---: | ---:
> 946 | 1001 |    1
> 946 | 1002 |    1
> 946 | 1003 |    1
> 900 | 1001 |    3
> 900 | 1002 |    3
> 900 | 1001 |    3

answered Dec 05 '20 at 15:21

forpas

160,666
10
38
76

1

Very nice proposal. And it works! Thanks for setting the demo case. That´s really helpful. – ubee Dec 06 '20 at 05:51

score 1 · Answer 2 · answered Dec 05 '20 at 15:09

You could phrase this as:

delete from thetable t 
where t.unit > (
    select min(t1.unit)
    from thetable t1
    where t1.id = t.id and t1.role = t.role
)

This seems like a simple way to solve the assignment, basically phrasing as: delete rows for which another row exists with a smaller unit and the same id and role.

As for the query you wanted to write, using row_number(), I think that would be:

delete from thetable t
using (
    select t.*, row_number() over(partition by id, role order by unit) rn
    from mytable t
) t1
where t1.id = t.id and t1.role = t.role and t1.unit = t.unit and t1.rn > 1

Delete of duplicate records

2 Answers2