How can I write a DATA step that will drop all variables from the input dataset except the ones that I explicitly define within the dataset?

Question

I want to generate a new SAS dataset using table foo as the input and one-to-one correspondence with records in the output dataset bar. I wand to drop variables from foo by default but I also require all of the fields of foo be available (to derive new variables) and also that some variables from foo to be kept (if explicitly indicated).

I'm currently managing an explicit list of variables to drop= but it results in long and unwieldy syntax in the data-set-option declaration.*

DATA bar (drop=id data_value2);
set foo;

new_id                 = id;
data_value1            = data_value1;     /* Explicitly included for clarity */
new_derived_data_value = data_value2 * 2; /* etc. */

format new_id                  $fmt_id.
       data_value1             $fmt_dat.
       new_derived_data_value  $fmt_ddat.
;
RUN;

The output table I want should have only fields data_value1, new_data and new_derived_data_value.

I'm looking for the most syntactically succinct way of reproducing the same effect as :

SELECT 
  id AS new_id
  ,data_value1
  ,data_value2 * 2 AS new_derived_data_value 
FROM foo

How can I write a DATA step that will drop all variables from the input dataset except the ones that I explicitly define within the dataset?

* Update: I could use aaa--hhh type notatation but even this can be unwieldy if the ordering of the variables changes over time or I later decide I'd like to keep variable ddd.

Then I would have to maintain a list of the variables I would like to keep. I'm looking for syntax to drop all varialbes from `foo` whilst keeping them available during the DATA step. — JustinJDavies, Sep 11 '14 at 12:46
The DATA step equivalent of `SELECT id AS new_id, data_value1 AS new_data, data_value2 * 2 AS new_derived_data_value FROM foo` — JustinJDavies, Sep 11 '14 at 12:47
So there is no way to implicitly `keep` variables I define explicitly and `drop` variables that come along with the `set` statement? It seems that SQL is strictly superior in terms of maintainability of code for this type of operation. — JustinJDavies, Sep 11 '14 at 12:49
Another option is don't rename variables while in the step (just create any new ones and format the existing ones) then use an explicit output statement — Jon Clements, Sep 11 '14 at 12:51
Jon, could you illustrate with a code example? I'm not familiar with the `output;` syntax except for `if true then output;` as an alternative to the SQL `WHERE` syntax. I thought output meant "write out a record once the DATA step cursor is at this point". — JustinJDavies, Sep 11 '14 at 12:55
@JustinJDavies ughgh... ignore me - it's been years since I've done any SAS... the `output` statement I'm thinking of is from another language, not SAS - my bad :) — Jon Clements, Sep 11 '14 at 12:59
`output` is not remotely related to the `where` syntax. It can be used to selectively output to different datasets if you create multiple tables in one pass, or to output a row multiple times. (But it also doesn't help with this problem.) The `if ... then output` isn't very common - more common is subsetting if (`if ... ;` which halts execution of that row if false). — Joe, Sep 11 '14 at 14:43
@JustinJDavies How is SQL superior in terms of maintainability? SQL makes a different choice than SAS here, but you're used to it, so you don't complain. In general it's easier to maintain SAS programs in this regards IMO - you have more flexibility. Hardcoded KEEP/DROP lists aren't the only way to go. — Joe, Sep 11 '14 at 14:52
@Joe - what I mean by maintainability is that as I develop the process, and the variables in `foo` increase, with the `DATA` step I outline above I have to edit code to drop the new variables. If I write the SQL statement, it will implicitly ignore any unused variables in `foo` and I don't need to explicitly `drop` them anywhere. — JustinJDavies, Sep 11 '14 at 16:23
You could write a KEEP statement in SAS that is no more or less maintainable than SQL (either way you have to hardcode the variables). DROP is less maintainable, agreed, but KEEP is not. — Joe, Sep 11 '14 at 16:24
Except with `keep` I have to both `keep` the variable (so I can calculate using it) and then also `drop` it if I don't want it in the final output. Furthermore, I have to type the variable name to use it in a calculation - the variable will appear three times. In the SQL version it is always available, and implicitly dropped if I don't explicitly ask for it. — JustinJDavies, Sep 11 '14 at 16:26
@Joe the `if (true) then output;` can be used to the same result as a `WHERE (true)` clause at the end of a SQL statement, if some logical expression replaces `(true)`. Observations that don't evaluate as `true` in the `if` statement simply won't generate a record, likewise for the SQL `WHERE`. Am I missing something? — JustinJDavies, Sep 11 '14 at 17:32
`if (true) then output;` operates very differently than SQL `where`. It's also a statement combination a SAS programmer probably might use a few times in their lives. `if true;` is subsetting if, and works similarly to SQL `where`, except it occurs wherever in the data step- so not exactly like where. `WHERE (true)` is the most similar to SQL where (and in fact is almost exactly the same, including some SQL specific syntax being allowed). `if true then output;` doesn't subset rows nor does it stop the data step on that row - it solely controls what gets output to the output dataset — Joe, Sep 11 '14 at 18:37
which is an extremely important distinction. Almost any time you might be able to use `if (true) then output;` in a single dataset data step, you should instead use `if (true);`, since that short circuits execution, making it faster, and otherwise having exactly the same effect for less code. — Joe, Sep 11 '14 at 18:38

score 3 · Accepted Answer · answered Sep 11 '14 at 14:41

3

I would store the variable names in a macro list, obtained from the DICTIONARY tables. You can then drop them all easily in a data step. e.g.

proc sql noprint;
select name into :vars separated by ' '
from dictionary.columns
where libname = 'SASHELP' and memname='CLASS';
quit;

data want (drop=&vars.);
set sashelp.class;
name1=name;
age1=age;
run;

answered Sep 11 '14 at 14:41

Longfish

7,582
13
19

This is exactly the way I'd do it - simple and straightforward. – Joe Sep 11 '14 at 14:43

score 1 · Answer 2 · answered Sep 11 '14 at 14:48

1

Keith's solution is the best production solution, but a quick alternative assuming you know the first and last variables in the dataset:

data want;
  set class;
  drop name--weight;
  name1=name;
  age1=age;
run;

answered Sep 11 '14 at 14:48

Joe

62,789
6
49
67

Will this result in eg `name` being output work if I also define `name=name` after the `drop` statement? – JustinJDavies Sep 11 '14 at 16:21
No, that wouldn't have any effect. – Joe Sep 11 '14 at 16:23
Is there any way to output a variable named `name` given the drop statement in your answer? At risk breaking SO rules and asking another question in the comments - can I override a `drop` instruction? If I put `drop name; keep name;` which would take precedence? (does the order matter?) – JustinJDavies Sep 11 '14 at 16:28
Drop wins that conflict. The real options you have are the data set option drop/keep vs. the data step drop/keep statement. There you could keep a larger set of things in the data step statement then narrow in the data set option (ie, `data want(keep=...);`) Doesn't help in your particular situation, though, as `DROP` in data step would prevent it from making it to the data set. – Joe Sep 11 '14 at 18:33
dataset options drop, keep, rename, where etc are processed in alphabetical order – Longfish Sep 12 '14 at 08:21

How can I write a DATA step that will drop all variables from the input dataset except the ones that I explicitly define within the dataset?

2 Answers2