1

I have a data set in SAS that I need to transpose. It has the form id date type value and I need to convert it into id date valueoftype1 valueoftype2 ...

Is there any efficient way of accomplishing this? My data is huuuge.

For example;

data one; 
input ID date type $ value; 

cards; 
1 2001 A 2
1 2002 A 4
1 2001 B 3
2 2001 B 1
2 2002 A 5
2 2002 C 2
2 2003 C 5
3 2001 B 6
4 2002 B 8
4 2003 B 4
4 2001 A 2
;

I wish to convert it in to following form; (last three columns are valA, valB, valC)

1 2001 2 3 .
1 2002 4 . .
2 2001 . 1 .
2 2002 5 . 2
2 2003 . . 5
3 2001 . 6 .
4 2001 2 . .
4 2002 . 8 .
4 2003 . 4 .
zaldir
  • 19
  • 3
  • Are you aware of `proc transpose`? I'd be surprised if there's a more efficient way than a `proc` designed for the job. – Michael Richardson May 29 '13 at 13:04
  • I know of `proc transpose` but is there any other way? Because my data set has over a billion observations, (millions of IDS and almost a thousand of dates for almost each of the IDs along with 8 different types). Is there any shortcut using other functions such as retain and etc? Thanks. – zaldir May 29 '13 at 13:29
  • 2
    Based upon my understanding of SAS efficiency, almost anything that can be done in a `proc` should be done in a `proc`. Using a data step instead is usually MUCH less efficient. – Michael Richardson May 29 '13 at 13:39

3 Answers3

2

PROC TRANSPOSE will do this very, very efficiently, I'd venture to say equal to or better than the most efficient method of any other DBMS out there. Your data is already beautifully organized for that method, also. You just need a sort by ID DATE, unless you already have an index for that combination (which if you have billions of records is a necessity IMO). No other solution will come close, unless you have enough memory to put it all in memory - which would be rather insane for that size dataset (even 1 billion records would be a minimum of 7GB, and if you have millions of IDs then it's clearly not a 1 byte ID; i'd guess 25-30 GB or more.)

proc sort data=one;
by id date;
run;
proc transpose data=one out=want;
by id date;
id type;
var value;
run;

A naive test on my system, with the following:

data one; 
do id = 1 to 1e6;
  do date = '01JAN2010'd to '01JAN2012'd;
    type = byte(ceil(ranuni(7)*26)+64);
    value = ceil(ranuni(7)*20);
    output;
  end;
end;
run;
proc sort data=one;
by id date;
run;
proc transpose data=one out=want;
by id date;
id type;
var value;
run;

That dataset is ~20GB compressed (OPTIONS COMPRESS=YES). It took about 4 minutes 15 seconds to write initially, took 11 minutes to sort, and took 45 minutes to PROC TRANSPOSE, writing a ~100GB compressed file. I'd guess that's the best you can do; of those 45 minutes, over 20 were likely writing out (5x bigger dataset will take over 5x the time to write out, plus compression overhead); I was also doing other things at the time, so the CPU time was probably inflated some as it didn't get my entire processor (this is my desktop, a 4 core i5). I don't think this is particularly unreasonable processing time at all.

You might consider looking at your needs, and perhaps a transpose isn't really what you want - do you really want to grow your table that much? Odds are you can achieve your actual goal (your analysis/etc.) without transposing the entire dataset.

Joe
  • 62,789
  • 6
  • 49
  • 67
  • As a side note; if you're using SPDE, you can skip the sort, and probably can get very good performance from this, far better than I got here. – Joe May 29 '13 at 15:04
0
if first.date then a=.;b=.;c=.;d=.; 

has to be replaced with:

if first.date then do;
    a=.;b=.;c=.;d=.;
end;

or

if first.date then call missing(a,b,c,d);

Also instead of

if last.date then do; output; a=.;b=.;c=.;d=.; end;

now, it should be enough :

if last.date then output;

I guess a datastep will always be more efficient then PROC TRANSPOSE on big data. The limit is you have to find out of distinct values of the transposed variable and create new variables for those. I think that's the overhead of PROC TRANSPOSE - it first finds out the values. (I'm sorry I edited your own answer, so it might not be clear now what was the problem.)

vasja
  • 4,732
  • 13
  • 15
0

An alternative data step approach (DOW-loop):

proc sort data = one;
  by ID date;
run;

data two;
  do _n_ = 1 by 1 until(last.date);
    set one;
    by ID DATE;
    if type = "A" then valA = value;
    else if type = "B" then valB = value;
    else if type = "C" then valC = value;
  end;
  drop value;
run;

On my system, using a dataset 1/10 of the size of the one Joe used, it took 2 minutes to sort and 9 minutes 40 seconds to use proc transpose. The DOW loop did the same thing in 7 minutes 4 seconds. In this particular scenario, it isn't terribly impressive, but it has one big advantage over proc transpose: you can use it to transpose multiple variables in a single pass. Here's the code I used:

data one; 
do id = 1 to 1e5;
  do date = '01JAN2010'd to '01JAN2012'd;
    type = byte(ceil(ranuni(7)*26)+64);
    value = ceil(ranuni(7)*20);
    output;
  end;
end;
run;

data two;
  do _n_ = 1 by 1 until(last.DATE);
    set one;
    array vals[26] val65-val90;
    by ID DATE;
    do i = 1 to 26;
      if type = byte(64 + i) then vals[i] = value;
    end;
  end;
  drop value i;
run;

Renaming all 26 transposed type variables dynamically is a bit tricky, but this can be done via call execute:

data _null_;
  call execute('proc datasets lib = work nolist;');
  call execute('modify two;');
  call execute('rename');
  do i = 1 to 26;
    call execute(compress('val' || i + 64) || ' = ' || compress('val' || byte(64+i)));
  end;
  call execute(';');
  call execute('run;');
  call execute('quit;');
run;
user667489
  • 9,501
  • 2
  • 24
  • 35