Michael:
The question is describing a PIVOT operation, also known as TRANSPOSE in SAS lingo, Paste/Special Transpose or PIVOT table in Excel.
If you stick with Proc SQL statements there is no PIVOT operator. SQL Server and other data bases do have PIVOT operators. But suppose you do stick with SAS Proc SQL. You are correct that you will need those many CASE statements in order to create the across variables.
There are numerous ways to pivot data in SAS. Here are six ways:
Sample data
data have;
do row = 1 to 500;
cost_cat = ceil(100 * ranuni(123));
cost = 10 + floor(50 * ranuni(123));
units_sold = floor (20 * ranuni(123));
output;
end;
run;
Way 1 - Proc TRANSPOSE : Pivot for presentation only
Class variables are used in the table statement to layout the rows and columns.
proc tabulate data=have;
class cost_cat units_sold;
var cost;
table units_sold, cost_cat*cost*sum / nocellmerge;
run;
Way 2 - Proc REPORT : Pivot for presentation only
cost category and cost columns are stacked. Cost
does not have a define
statement and will default to display sum
. The sum is performed for cost over values in each group * across:
proc report data=have;
columns units_sold (cost_cat, cost) ;
define units_sold / group;
define cost_cat / across;
run;
Way 3 - Proc MEANS + Proc TRANSPOSE : Pivot for data
Transpose will create a data set with columns 'out of order' because the columns are created in the order in which the id values appear as you step through the units_solds.
This can be prevented by adding extra data to have
. The data would have units_sold = -1 and there would be a row for each cost_cat value. The extra group is removed as part of the TRANSPOSE out= data set options -- for example: (... where=(units_sold ne -1))
proc means noprint data=have;
class units_sold cost_cat;
var cost;
ways 2;
output sum=sum out=haveMeans ;
run;
proc transpose data=haveMeans out=wantAcross1(drop=_name_) prefix=cost_sum_;
by units_sold;
var sum;
id cost_cat;
;
run;
Way 4 - SQL `wallpaper` code generated by Macro : Specific to one data set
The macro is simpler because it is specific to the data set in question. For a more general case the salient aspects of the statement generation can be abstracted and further macro-ized (see Way 5)
%macro pivot_across;
%local i;
proc sql;
create table wantAcross2 as
select units_sold
%do i = 1 %to 100; %* codegen many sql select expressions;
, sum ( case when cost_cat = &i then cost else 0 end ) as cost_sum_&i
%end;
from have
group by units_sold;
quit;
%mend;
%pivot_across;
Tip: With a few changes the code gen can be Proc SQL pass-through and perform the pivot remotely.
Way 5 - SQL `wallpaper` code generated by Macro : Any data set
Well not quite any data set. This macro in its current form handles id variables that are numeric and whose values are expressible exactly as a perceived numeric literal emitted by cats()
. A more robust version would examine the type of the id variable and quote the id values compared to in the generated CASE statements. The most robust version would have a code gen'd CASE statement that checked id values per put(..., RB8.)
%macro sql_transpose (data=, out=, by=, var=, id=, aggregate_function=sum, default=0, prefix=, suffix=);
/*
* CASE statement codegener will need tweaking to handle character id variables (i.e. QUOTE of the &id)
* CASE statement codegener will need tweaking to handle numeric id variables that have non-integer values
* inexpressible as a simple source code numeric literal. (i.e. may need to compare data when formnatted as RB4.);
*/
%local case_statements;
proc sql noprint;
select
"&aggregate_function ("
|| "CASE when &id = " || cats(idValues.&id) || " then &var else &default end"
|| ") as &prefix" || cats(idValues.&id) || "&suffix"
into :case_statements
separated by ','
from (select distinct &id from &data) as idValues
order by &id
;
%*put NOTE: %superq(case_statements);
create table &out as
select &by, &case_statements
from &data
group by &by;
quit;
%mend;
%sql_transpose
( data=have
, out=wantAcross3
, by=units_sold
, id=cost_cat
, var=cost
, prefix=cost_sum_
);
Tip: With a few changes the code gen can be Proc SQL pass-through and perform the pivot remotely. Special attention would need to be made for gathering the data behind the case_statements
.
Way 6 - Hash table : Numerically indexed pivot columns
If you are a hashaholic this code might not seem extravagant.
data _null_;
if 0 then set have(keep=units_sold cost_cat cost); * prep pdv;
* hash for tracking id values;
declare hash ids(ordered:'a');
ids.defineKey('cost_cat');
ids.defineDone();
* hash for tracking sums
* NOTE: The data node has a sum variable instead of using
* argument tags suminc: and keysum: This was done because HITER NEXT() does not
* automatically load the keysum value into its PDV variable (meaning
* another lookup via .SUM() would have to occur in order to obtain it);
call missing (cost_sum);
declare hash sums(ordered:'a');
sums.defineKey('units_sold', 'cost_cat');
sums.defineData('units_sold', 'cost_cat', 'cost_sum');
sums.defineDone();
* scan the data - track the id values and sums for pivoted output;
do while (not done);
set have(keep=units_sold cost_cat cost) end=done;
ids.ref();
if (0 ne sums.find()) then cost_sum = 0;
cost_sum + cost;
sums.replace();
end;
* create a dynamic output target;
* a pool of pdv host variables is required for target;
array cells cost_sum_1 - cost_sum_10000;
call missing (of cost_sum_1 - cost_sum_10000);
* need to iterate across the id values in order to create a
* variable name that will be part of the wanted data node;
declare hiter across('ids');
declare hash want(ordered:'a');
want.defineKey('units_sold');
want.defineData('units_sold');
do while (across.next() = 0);
want.defineData(cats('cost_sum_',cost_cat)); * sneaky! ;
end;
want.defineDone();
* populate target;
* iterate through the sums filling in the PDV variables
* associated with the dynamically defined data node;
declare hiter item('sums');
prior_key1 = .; prior_key2 = .;
do while (item.next() = 0);
if units_sold ne prior_key1 then do;
* when the 'group' changes add an item to the want hash, which will reflect the state of the PDV;
if prior_key1 ne . then do;
key1_hold = units_sold;
units_sold = prior_key1;
want.add(); * save 'row' to hash;
units_sold = key1_hold;
call missing (of cells(*));
end;
end;
cells[cost_cat] = cost_sum;
prior_key1 = units_sold;
end;
want.add();
* output target;
want.output (dataset:'wantAcross4');
stop;
run;
Verification
Proc COMPARE
will show that all the want
outputs are the same.
proc compare nomissing
noprint data=wantAcross1 compare=wantAcross2 out=diff1_2 outnoequal;
id units_sold;
run;
proc compare
noprint data=wantAcross2 compare=wantAcross3 out=diff2_3 outnoequal;
id units_sold;
run;
proc compare nomissing
noprint data=wantAcross3 compare=wantAcross4 out=diff3_4 outnoequal;
id units_sold;
run;