0

This is a "toy" example of a table that has many columns and 100s of thousands of rows.

I want FILTER OUT any rows containing the same AcctNo, CustomerName and CustomerContact, but KEEP the ID for ONE of the duplicates (so i can access the record later).

  • Example:

    ID  AcctNo  CustomerName  CustomerContact
    1   1111    Acme Foods    John Smith
    2   1111    Acme Foods    John Smith
    3   1111    Acme Foods    Judy Lawson
    4   2222    YoyoDyne Inc  Thomas Pynchon
    5   2222    YoyoDyne Inc  Thomas Pynchon
    <= I want to save IDs 2, 3, and 5
    
  • Fiddle: https://www.db-fiddle.com/f/bEECHi6XnvKAeXC4Xthrrr/1

Q: What SQL do I need to accomplish this?

FoggyDay
  • 11,962
  • 4
  • 34
  • 48

2 Answers2

3
select MAX(ID) as KeepID,AcctNo,CustomerName,CustomerContact 
from test
GROUP BY AcctNo,CustomerName,CustomerContact
Robert Sheahan
  • 2,100
  • 1
  • 10
  • 12
1

So basically what you want is, partition your table by AcctNo, CustomerName and CustomerContact. It's unclear in the question how you want select which ID you need to keep, but for that you need to modify the the following query. But this should give you a starting point.

SELECT * 
FROM   test 
       JOIN (SELECT id, 
                    Row_number() 
                      OVER ( 
                        partition BY acctno, customername, customercontact) rn 
             FROM   test) A 
         ON test.id = A.id 
WHERE  A.rn = 1 

This should return something like this:

ID AcctNo CustomerName CustomerContact id rn
1 11111 Acme Foods John Smith 1 1
3 11111 Acme Foods Judy Lawson 3 1
4 22222 Yoyodyne Inc. Thomas Pynchon 4 1

What this is doing is basically first calculating row num based on the partition criteria and then picking only one row per partition.

Pradatta
  • 3,000
  • 1
  • 18
  • 22
  • 2
    Please don't use images for data... use formatted/tabular text. – Dale K Apr 07 '21 at 23:20
  • @Pradatta - 1) In answer to your comment above, I tried many things ... but I didn't want to "clutter" my question with a bunch of failed attempts. 2) Regarding your reply: I *considered* partitions and row_number(), but I would have preferred a simpler solution. Like [Robert Sheahan's](https://stackoverflow.com/a/66995135/3135317). 3) Do you know which approach would be more "efficient" (for large-ish datasets)? – FoggyDay Apr 08 '21 at 00:11
  • 1
    @FoggyDay The MAX with Group By is more efficient I think. Here's a nice explanation why: https://stackoverflow.com/questions/11233125/is-there-a-performance-difference-in-using-a-group-by-with-max-as-the-aggregat – Pradatta Apr 08 '21 at 09:06
  • 1
    Excellent citation - Thank you. – FoggyDay Apr 08 '21 at 18:59