1

Why does the result of the following two alternative ways of stacking two datasets differ?

data work.a; 
    length ds $1;
    ds = 'A';
    do i = 1 to 3;
        output;
    end;
run;
data work.b; 
    length ds $1;
    ds = 'B';
    do i = 1 to 3;
        do j = 1 to 3;
            output;
        end;
    end;
run;

*- ALTERNATIVE 1 -*;
data work.c;
    set work.a work.b;
    if j = . then j = i;
run;

*- ALTERNATIVE 2 -*;
data work.d;
    set work.a work.b;
run;
data work.d;
    set work.d;
    if j = . then j = i;
run;

My guess would be that both dataset c and d have j = i where ds = 'A'.

1 Answers1

2

The reason is the automatic RETAIN on variables which are read in via the SET statement. All variables read in via the set statement are retained in the PDV, until the next iteration of the SET, in which new values (from the datasets in the set statement) are replaced.

During compilation, all variables in the datasets in the SET statement are created in the PDV.
In the first iteration of execution, the first dataset in the SET statement (work.a) is read, and the value 1 is given to i. The variable j is null (.) so the logic condition is true, and j becomes 1. In the second iteration, i is replaced with 2 as per the value in dataset work.a, and j is not replaced (as work.b is not read) - so it is retained, and so it is not null, and so the logic is not executed.

For a further example, see example 3 of this sugi paper. It shows you can fix this problem by simply creating a new variable as follows:

data work.c;
    set work.a work.b;
    if j = . then XXX = i;
run;

The RETAIN statement (by contrast) is necessary to retain NEW variables that are not compiled from the datasets in the SET statement.

Allan Bowe
  • 12,306
  • 19
  • 75
  • 124