0

I have two data sets :

  • "mothers" - 5,512 observations where the variable "MOTHER" = 1
  • "all_women" - 2,336,750 observations where the variable "MOTHER" = 0

I combine the two as follows:

data combined;
set mothers all_women;

Now as the mothers are already in the datset all_women, I want to delete the repeated entries - with the condition that I keep the observations where "MOTHER"=1.

I tried the following:

proc sort data=combined; by ID DESCENDING MOTHER; run;

proc sort data=combined nodupkeys; by ID; run;

yet I lose some of the mothers because I am left with only 5458 observations where "MOTHER"=1. What have I done to introduce this error?

user2568648
  • 3,001
  • 8
  • 35
  • 52
  • Have you confirmed that "mothers" has 5,512 unique ids? – Jeff Oct 16 '14 at 13:22
  • It seems weird to SET the two data sets together and expect the same number of records as the first data set, I would have expected a merge instead of SET in your combined data set. – Reeza Oct 16 '14 at 14:03

2 Answers2

2

Instead of using NODUPKEY, use FIRST./LAST. processing.

proc sort data=combined; 
  by ID DESCENDING MOTHER; 
run;

data want;
  set combined;
  by ID descending mother;
  if not (first.ID) and (mother=0) then delete;
run;

That would keep any ID that had mother=0 only (keep 1 record per), and would keep all mother=1 rows.

Joe
  • 62,789
  • 6
  • 49
  • 67
1

Have you checked whether there were any duplicate IDs in the mothers dataset? The second proc sort would have eliminated those rows.

You can check like so:

proc sort data = mothers nodupkey out = mothers_dedup dupout = mothers_dups;
  by ID;
run;

If mothers_dups contains more than 0 observations, this might account for the problem.

user667489
  • 9,501
  • 2
  • 24
  • 35
  • Just checked - the source of the problem is that there are duplicate IDs in the mothers dataset. However I need to keep these duplicates as they correspond to multiple births. Is there a way to only eliminate the duplicates where "MOTHER" = 0 ? – user2568648 Oct 16 '14 at 13:34