PROC SQL to remove duplicate pairs of observations?

Question

enter image description here I need to remove observations with duplicate pairs (ie. the original observation and it's duplicate). I have more than hundred columns, but sometimes for a given ID I am getting a pair of different Load_Date and a pair of Contactor columns. The below code I am using to remove all cases when I have duplicate pairs as described above enter image description here :

Could you let me know if I am doing it the right way? From the described data, I need to keep records for ID value C and D only. I did check my output and it looks like it worked, still not sure because I am new to proc sql. Thanks!

Do you want to completely remove duplicates, or do you want to keep one of them? — FrankPl, Jan 30 '14 at 21:41
I want to completely remove the duplicates, but here we aren't just talking about keeping one of the dups. I want to remove the pair of observations that had the duplicate and it's original. — user2993827, Jan 30 '14 at 21:44
Just thinking it's easier. But if you could offer a better way I would appreciate it. — user2993827, Jan 30 '14 at 21:52
Also, for non-sas SQL folks, SAS will allow selecting non-group by fields (differing from many implementations); it will perform the GROUP BY with those fields and then join back to the original data (thus likely returning more rows than the `group by` would). — Joe, Jan 30 '14 at 21:52
As far as load date and contractor, are you removing records with *both* duplicated, or any pair of record duplicating *either* field? Sample data would be helpful. — Joe, Jan 30 '14 at 21:53
I just added a sample dataset. I need an output with ID values C and D. Yes, the load date and contractor are always different. Thank you. — user2993827, Jan 30 '14 at 22:06

Joe · Answer 1 · 2014-01-31T15:46:54.293

3

SAS solution, assuming you want only want to remove records duplicating ID.

proc sort data=have;
by ID;
run;

data want;
set have;
by ID;
if first.ID and last.ID;
run;

This removes records that are identical for ID. If VAR1 and VAR2 are also relevant, you could add those as well to the sort; the FIRST and LAST should be the rightmost variable in the by statement.

edited Jan 31 '14 at 15:46

answered Jan 30 '14 at 21:59

Joe

62,789
6
49
67

Unfortunately this code doesn't give me ID's C and D as an output. – user2993827 Jan 30 '14 at 22:29
Left out `by` statement, sorry - try now. – Joe Jan 31 '14 at 15:47

score 1 · Answer 2 · answered Jan 31 '14 at 17:24

1

And a proc SQL solution. Same output, but one step and no explicit sorting.

proc sql; 
    create table WANT as
    SELECT *
    FROM have
    GROUP BY ID
    /* Put your criteria here, can use any COUNT DISTINCT*/
    HAVING COUNT(*)=1
;quit;

answered Jan 31 '14 at 17:24

Tim Sands

948
4
8

PROC SQL to remove duplicate pairs of observations?

2 Answers2