0

I have a list of duplicate rows in the DB with unique GUID types for IDs. Somethig like this:

ID                                      Date                       Name
C3A19F77-0A99-4CA6-9375-A5062051BB86    2017-09-01 04:57:34.000    John
5252C4B7-FA62-4BDE-90F8-A5064051D801    2017-09-01 04:57:34.000    John
73EB45AE-8E69-468E-A6E5-A50670E55C0E    2019-09-01 13:50:53.000    Steve
5D0A3F64-6D55-4EFC-8961-A50620E4369B    2019-09-01 13:50:53.000    Steve
E2533E75-6E97-4966-A3B8-A50841055236    2016-09-03 15:51:26.000    Jill
134FC9F4-B86F-4864-943D-A50821064F83    2016-09-03 15:51:26.000    Jill

I'm trying to get rid of duplicates and grab one instance of GUID/ID for each unique row. Turn this into something like this:

  ID                                      Date                       Name             
  C3A19F77-0A99-4CA6-9375-A5062051BB86  2017-09-01 04:57:34.000    John
  73EB45AE-8E69-468E-A6E5-A50670E55C0E  2019-09-01 13:50:53.000    Steve
  E2533E75-6E97-4966-A3B8-A50841055236  2016-09-03 15:51:26.000    Jill

I found a pretty funky way to get distinct IDs by returning MAX(ID) on a row partitioned by date. It works but is using MAX() function on a GUID type to work as FIRST() normal?

SELECT Id, MAX(ID) OVER ( partition by sentDate) as DistinctId

Are there any unexpected issues I need to be aware of with this approach?

InspiredBy
  • 4,271
  • 6
  • 40
  • 67
  • You should probably be adding Name into the partition too (i.e. MAX(ID) OVER (PARTITION BY sentDate, Name)) - Overall, you may face query performance issues by using 'PARTITION BY' vs using GROUP BY as PARTITION BY will only execute after assigning row numbers (see discussion: https://stackoverflow.com/questions/11233125/is-there-a-performance-difference-in-using-a-group-by-with-max-as-the-aggregat) – MTay Aug 15 '20 at 02:18

2 Answers2

1

I think you want aggregation:

select max(id), date, name
from t
group by date, name;

Or if you want a random matching row:

select id, date, name
from (select t.*,
             row_number() over (partition by date, name order by newid()) as seqnum
      from t
     ) t
where seqnum = 1;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

What are you doing with the results? Why do you want the ID if it doesn't really matter which one you show? Are you copying the results to another table? Do you eventually want to delete the duplicates?

You can use this to get just one:

SELECT A.ID, 
    A.Date, 
    A.Name
FROM [tablename] A
  INNER JOIN
(
 SELECT *, 
        RANK() OVER(PARTITION BY Date, 
                                 Name
        ORDER BY id) rank
 FROM [tablename]
) B ON A.ID = B.ID
WHERE rank = 1;

If you want to see the duplicates, and if there are more than duplicates, you can just change the WHERE clause to:

WHERE rank > 1;

If you eventually want to delete the duplicates, you can use the same technique:

DELETE A
    FROM [tablename] A
         INNER JOIN
    (
        SELECT *, 
               RANK() OVER(PARTITION BY Date, 
                                        Name
               ORDER BY id) rank
        FROM [tablename]
    ) B ON A.ID = B.ID
    WHERE rank > 1;
Mike Taylor
  • 161
  • 4