8

I got a short question - If we are creating a SAS dataset say - Sample.sas7bdat which already exists, will the code take more time to execute (because here the code has to overwrite the existing dataset) than the case when this dataset was not already there?

data sample;
.....
.....
run;

I did some reasearch on the internet but could not find a satisfactory answer. To me it seems like the code should take a little bit extra time, though not sure how much of impact it would make on a 10GB of dataset.

in_user
  • 1,948
  • 1
  • 15
  • 22
  • Could you update your question with Operating System information, and which SAS version you are using? – mjsqu Dec 08 '14 at 11:45

3 Answers3

5

You could test this yourself fairly easily. A few caveats:

  • Make sure you have a large enough dataset such that you won't miss the differences in simple random cpu activity. 100+MB is usually a good target.
  • Make sure you perform the test multiple times - the more the better, with no time in between if possible. One test will always be insufficient and will always tend to show the first dataset as faster, because it benefits from write caching (basically the OS saying that it's done writing when it's not, but simply has the write queued up in memory).

Here's an example of my test. This is a 100 million row dataset with two 8 byte numerics, so 1.6 GB.

First, the results. I see a few second difference. Why? SAS takes a few operations when replacing a dataset:

Write dataset to temporary file
Delete the old dataset
Rename temporary dataset to new dataset

On some OSs this seems to be faster than others; I've found Windows desktop to be fairly slow about this, compared to unix or even Windows Server OS which is pretty quick. I'm guessing Windows is more careful about deleting than simply changing a file system pointer, but I don't really know. It's certainly not copying the whole file over from the utility directory (it's not nearly enough time for that). I also suspect write caching is still giving a bit of a boost to the new datasets, particularly as time for all datasets is growing as I write. The difference is probably only about a second or so - the difference between _REP iteration 2 and _NEW iteration 3 seems the most reasonable to me.

Iteration 1 _NEW=7.26999998099927 _REP=12.9079999922978
Iteration 2 _NEW=10.0119998454974 _REP=11.0789999961998
Iteration 3 _NEW=10.1360001564025 _REP=15.3819999695042
Iteration 4 _NEW=14.7720000743938 _REP=17.4649999142056
Iteration 5 _NEW=16.2560000418961 _REP=19.2009999752044

Notice the first iteration new is far faster than the others, and overall time increases as you go (as the write caching is less and less able to keep up). I suspect if you allow it to continue (or use a still larger file, which I don't have time for right now) you might see even more consistent times. I'm also not sure what happens with write caching when a file that is write cached is deleted; it's possible it has to wait for the write caching to write out to disk before doing the delete op or something similar. You could perform a test where you waited 30 seconds between _NEW and _REP to verify that.

The code:

%macro test_me(iter=1);
%do _i=1 %to &iter.;
%let start = %sysfunc(time());
data test&_i.;
  do x = 1 to 1e8;
    y=x**2;
    output;
  end;
run;
%let mid=%sysfunc(time());
data test&_i.;
  do x = 1 to 1e8;
    y=x**2;
    output;
  end;
run;
%let end=%sysfunc(time());
%let _new = %sysevalf(&mid.-&start.);
%let _rep = %sysevalf(&end.-&mid.);

%put Iteration &_i. &=_new. &=_rep.;
%end;

proc datasets nolist kill;
quit;
%mend test_me;

options nosource nonotes nomprint nosymbolgen;

%test_me(iter=5);
Joe
  • 62,789
  • 6
  • 49
  • 67
  • Thanks for sharing this!!So is it safe to say that the extra time goes in copying the dataset from the "temp(work directory in UNIX)" folder to the main library? – in_user Dec 08 '14 at 17:14
  • It shouldn't actually be moving a file. It should be changing a file system pointer, primarily. I'm not sure why it's a second or two or three; that seems long to me, hence I suspect something funny with write caching. It's definitely not taking long enough to write the whole dataset out twice. – Joe Dec 08 '14 at 18:13
3

There are more file operations involved when you are overwriting. After creating the table, SAS will delete the old table and rename the new. In my tests this took 0.2 seconds extra time.

Stig Eide
  • 1,052
  • 7
  • 14
  • I am just wondering what are other extra file operations in case of overwriting. – in_user Dec 08 '14 at 15:31
  • 1
    Delete and Rename shouldn't be (very) dependent on file size, they're primarily file system pointer operations (no actual file contents are moved). This is somewhat OS dependent. – Joe Dec 08 '14 at 16:40
1

In a brief test, my 800Mb dataset took 4 seconds to create new and 10-15 seconds to overwrite. I'm assuming this is because SAS has to preserve the existing dataset until the datastep completes executing so as to preserve data-integrity. That's why you might get the following message in the log:

WARNING: Data set dset was not replaced because this step was stopped.

Overwrite test

NOTE: The data set WORK.SAMPLE has 100000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
       real time           10.06 seconds
       user cpu time       3.08 seconds
       system cpu time     1.48 seconds
       memory              1506.46k
       OS Memory           26268.00k
       Timestamp           08/12/2014 11:43:06 AM
       Step Count                        42  Switch Count  38
       Page Faults                       0
       Page Reclaims                     155
       Page Swaps                        0
       Voluntary Context Switches        190
       Involuntary Context Switches      288
       Block Input Operations            0
       Block Output Operations           1588496

New data test

 NOTE: The data set WORK.SAMPLE1 has 100000000 observations and 1 variables.
 NOTE: DATA statement used (Total process time):
       real time           3.94 seconds
       user cpu time       3.14 seconds
       system cpu time     0.80 seconds
       memory              1482.18k
       OS Memory           26268.00k
       Timestamp           08/12/2014 11:43:10 AM
       Step Count                        43  Switch Count  38
       Page Faults                       0
       Page Reclaims                     112
       Page Swaps                        0
       Voluntary Context Switches        99
       Involuntary Context Switches      294
       Block Input Operations            0
       Block Output Operations           1587464

The only difference between the log messages is the real time, which to me would indicate SAS is processing filesystem operations on the dataset files.

N.B. I have tested this on SAS (r) Proprietary Software Release 9.4 TS1M2, which I'm running through SAS Studio online. I think it's a Linux operating system, results could vary depending on your operating system.

mjsqu
  • 5,151
  • 1
  • 17
  • 21
  • This test is somewhat flawed as I'll explain in my answer; testing just one replace will lead to a dramatic apparent speed difference due to write-caching. – Joe Dec 08 '14 at 16:28
  • I did repeat the test a number of times – mjsqu Dec 08 '14 at 19:08