0

enter image description hereI need to remove observations with duplicate pairs (ie. the original observation and it's duplicate). I have more than hundred columns, but sometimes for a given ID I am getting a pair of different Load_Date and a pair of Contactor columns. The below code I am using to remove all cases when I have duplicate pairs as described aboveenter image description here:

Could you let me know if I am doing it the right way? From the described data, I need to keep records for ID value C and D only. I did check my output and it looks like it worked, still not sure because I am new to proc sql. Thanks!

user2993827
  • 73
  • 1
  • 2
  • 8
  • Do you want to completely remove duplicates, or do you want to keep one of them? – FrankPl Jan 30 '14 at 21:41
  • I want to completely remove the duplicates, but here we aren't just talking about keeping one of the dups. I want to remove the pair of observations that had the duplicate and it's original. – user2993827 Jan 30 '14 at 21:44
  • Any particular reason you want to use PROC SQL for this? – Joe Jan 30 '14 at 21:50
  • Just thinking it's easier. But if you could offer a better way I would appreciate it. – user2993827 Jan 30 '14 at 21:52
  • Also, for non-sas SQL folks, SAS will allow selecting non-group by fields (differing from many implementations); it will perform the GROUP BY with those fields and then join back to the original data (thus likely returning more rows than the `group by` would). – Joe Jan 30 '14 at 21:52
  • As far as load date and contractor, are you removing records with *both* duplicated, or any pair of record duplicating *either* field? Sample data would be helpful. – Joe Jan 30 '14 at 21:53
  • I just added a sample dataset. I need an output with ID values C and D. Yes, the load date and contractor are always different. Thank you. – user2993827 Jan 30 '14 at 22:06

2 Answers2

3

SAS solution, assuming you want only want to remove records duplicating ID.

proc sort data=have;
by ID;
run;

data want;
set have;
by ID;
if first.ID and last.ID;
run;

This removes records that are identical for ID. If VAR1 and VAR2 are also relevant, you could add those as well to the sort; the FIRST and LAST should be the rightmost variable in the by statement.

Joe
  • 62,789
  • 6
  • 49
  • 67
1

And a proc SQL solution. Same output, but one step and no explicit sorting.

proc sql; 
    create table WANT as
    SELECT *
    FROM have
    GROUP BY ID
    /* Put your criteria here, can use any COUNT DISTINCT*/
    HAVING COUNT(*)=1
;quit;
Tim Sands
  • 948
  • 4
  • 8