1

How to remove duplicates in SAS data step.

data uscpi;
      input year month cpi;
   datalines;
   1990  6 129.9
   1990  7 130.4
   1990  8 131.6
   1990  9 132.7
   1991  4 135.2
   1991  5 135.6
   1991  6 136.0
   1991  7 136.2
   ;
   run;

PROC SORT DATA = uscpi OUT = uscpi_dist NODUPKEY; 
 BY year ; 
 RUN; 

i can with proc step, but how to remove it in data step. Thanks in advance

zellus
  • 9,617
  • 5
  • 39
  • 56
santosh315345
  • 23
  • 1
  • 2
  • 5
  • 1
    Here's a way to do it with hash objects: http://stackoverflow.com/a/5705176/17743 – cmjohns Feb 14 '14 at 17:17
  • Which ones do you wanna keep? Just doing it by 'year' will randomly remove records. I dont think that is what you want? – Victor Nov 13 '15 at 19:27

1 Answers1

6

You can use the first. & last. automatic variables created by SAS when using by-group processing. They give more control on which row you consider as duplicate. Please read the manual to understand by group processing in a Data Step

 data uscpi_dedupedByYear;
 set uscpi_sorted;
 by year;
 if first.year; /*only keep the first occurence of each distinct year.  */
 /*if last.year; */ /*only keep the last occurence of each distinct year*/

 run;

A lot depends on who your input dataset is sorted. For ex: If your input dataset is sorted by year & month and you use if first.year; then you can see that it only keeps the earliest month in any given year. However, if your dataset is sorted by year & descending month then if first.year; retains last month in any given year.

This behaviour obviously differs from how nodupkey works.

Community
  • 1
  • 1