2

I could not find information about this problem, or could not specify the question correctly.

Let me ask the question with code:
Is this operation

data work.tmp;
    set work.tmp;
    * some changes to data here;
run;

or especially

proc sort data = work.tmp out = work.tmp;
    by x;
run;

dangerous in any way, or considered a bad practice in SAS? Note the same input and output dataset names, which is my main point. Does SAS handle this situation correctly so there would be no ambiguous results with running this kind of data step/procedure?

Matek
  • 641
  • 5
  • 16

2 Answers2

3

The latter, sorting into itself, is done fairly frequently; as sort is just re-arranging the dataset, and (unless you are depending on the order being otherwise, or unless you use a where clause to filter the dataset or rename/keep/drop options) doesn't do any permanent harm to the dataset, it's not considered bad practice, as long as tmp is in work (or a libname intended to be used as a working directory). SAS creates a temporary file to do the sort, and when it's successful it deletes the old one and renames the temporary file; no substantial risk of corruption.

The former, setting a dataset to itself in a data step, is usually not considered a good practice. That's because a data step often does something irreversible - ie, if you run it once it has a different result than if you run it again. Thus, you risk not knowing what status your datastet has; and while with sort you can rely on knowing because you get an obvious error if it's not properly sorted most of the time, with the data step you might never know. As such, each data step should generally produce a new dataset (at least, new to that thread). There are times when it's necessary to do this, or at least would be substantially wasteful not to - perhaps a macro that sometimes does a long data step and sometimes doesn't - but usually you can program around it.

It's not dangerous in the sense that the file system will get confused, though; similar to sort, SAS will simply create a temporary file, fill the new dataset, then delete the old one and rename the temporary file.

(I leave aside mention of things like modify which must set a dataset to itself, as that has an obvious answer...)

Joe
  • 62,789
  • 6
  • 49
  • 67
  • Space permitting, generation datasets can be useful where datasets are being overwritten. Probably also worth mentioning NOT to put a `where` clause in a `proc sort` if a new dataset is not specified (unless you definitely want to sort and remove rows at the same time!) – Longfish Aug 17 '15 at 16:08
  • I don't consider generation datasets a useful solution to the above; the problem I have with re-using dataset names is largely not *knowing* whether you have the before or after dataset. Generation wouldn't really help that (much). – Joe Aug 17 '15 at 16:38
  • Thank you for a very nice answer Joe. May I ask about "as long as the dataset is in the working directory"? Would SAS not create tmp dataset and then rename it (just like you said) in the non-working directory? Could performing operations like this on a non-working be harmful? – Matek Aug 17 '15 at 19:00
  • 2
    @Matek Oh, no, it would work that way either way - WORK has no special meaning outside of the automatic clearing out and the defaulting when no libname is specified. But I would not, usually, sort a libname in a permanent location as it shouldn't be necessary: either sort it before you put it out (you can specify `out=perm.dsname` for example) or if you're using it, sort it coming in. Once a permanent dataset is put out, it should be left alone - so other users can expect it to be consistent and not change its sort order just because you wanted it in a different order. – Joe Aug 17 '15 at 19:03
  • Got it :) thank you for these didactic explanations. – Matek Aug 18 '15 at 07:34
2

Some examples of why this is not considered good practice. Say you're working interactively, and you have the following code dataset named tmp:

data tmp;
  set sashelp.class;
run;

If you were to run the below code twice, it would run fine the first time, but on the second run you would receive a warning as the variable age no longer exists on that dataset:

data tmp;
  set tmp;
  drop age;
run;

In this case, it's a pretty harmless example, and you are lucky enough that SAS is simply giving a warning. Depending on what the data step was doing though, it could just just as easily have been something that generates an error, e.g.:

data tmp;
  set tmp (rename=(age=blah));
run;

Or even worse, it may generate no ERROR or WARNING, and change the expected results like the below code:

data tmp;
  set tmp;
  weight = log(weight);
run;

Our intention is to apply a simple log transformation to the weight variable in preparation for modeling, but if we accidentally run the step a second time, we are calculating the log(log(weight)). No warnings or errors will be given and looking at the dataset it will not be immediately obvious that anything is wrong.

IMO, you are much better off creating iterative datasets, ie. tmp1, tmp2, tmp3, and so on... for every process that updates the dataset in some way. Space is much cheaper than spending time debugging.

Robert Penridge
  • 8,424
  • 2
  • 34
  • 55