5

I need some assistance with splitting a large SAS dataset into smaller datasets.

Each month I'll have a dataset containing a few million records. This number will vary from month to month. I need to split this dataset into multiple smaller datasets containing 250,000 records each. For example, if I have 1,050,000 records in the original dataset then I need the end result to be 4 datasets containing 250,000 records and 1 dataset containing 50,000 records.

From what I've been looking at it appears this will require using macros. Unfortunately I'm fairly new to SAS (unfamiliar with using macros) and don't have a lot of time to accomplish this. Any help would be greatly appreciated.

user2941280
  • 285
  • 3
  • 9
  • 16

6 Answers6

4

Building on Joe's answer, maybe you could try something like this :

%MACRO SPLIT(DATASET);

%LET DATASET_ID = %SYSFUNC(OPEN(&DATASET.));
%LET NOBS = %SYSFUNC(ATTRN(&DATASET__ID., NLOBS));
%LET NB_DATASETS = %SYSEVALF(&NOBS. / 250000, CEIL);

DATA 
  %DO I=1 %TO &NB_DATASETS.;
    WANT&I. 
  %END;;

  SET WANT;

  %DO I=1 %TO &NB_DATASETS.;

    %IF &I. > 1 %THEN %DO; ELSE %END; IF _N_ LE 2.5E5 * &I. THEN OUTPUT WANT&I.;

  %END;
RUN;
%MEND SPLIT;    
ftrotter
  • 3,066
  • 2
  • 38
  • 52
Karim L
  • 90
  • 4
  • This is very similar to what I ended up doing in the end. Thanks. – user2941280 Feb 03 '14 at 12:56
  • 2
    Since my edit was not approved for changing the answer too much, I'd like to point out here that there are some minor improvements/corrections to be made: 1) It would be a good idea to also close the dataset to prevent it from locking later. 2) SET WANT; should be SET &DATASET; 3) (Nice to have) Both the number of obs per dataset and the names of those new datasets could be additional macro parameters. – Yoh Jul 04 '14 at 06:40
  • @Yoh, you should write a new answer – LuizZ Feb 24 '22 at 02:33
1

You can do it without macros at all, if you don't mind asking for datasets that may not exist, and have a reasonable bound on things.

data want1 want2 want3 want4 want5 want6 want7 want8 want9;
if _n_ le 2.5e5 then output want1;
else if _n_ le 5e5 then output want2;
else if _n_ le 7.5e5 then output want3;
... etc....
run;

Macros would make that more efficient to program and cleaner to read, but wouldn't change how it actually runs in reality.

Joe
  • 62,789
  • 6
  • 49
  • 67
  • True, I could do it something like that. Unfortunately I won't know how many records to account for...it could be 1 million or 10 million or something else entirely. – user2941280 Jan 30 '14 at 12:13
1

You can do it without macros, using CALL EXECUTE(). It creates SAS-code as text strings and then executes it, after your "manually written" code completed.

data _null_;
    if 0 then set have nobs=n;
    do i=1 to ceil(n/250000);
        call execute (cats("data want",i)||";");
        call execute ("set have(firstobs="||(i-1)*250000+1||" obs="||i*250000||");");
        call execute ("run;");
    end;
run;
Dmitry Shopin
  • 1,753
  • 10
  • 11
1

The first result on Google is from the SAS User Group International (SUGI) These folks are your friends.

The article is here: http://www2.sas.com/proceedings/sugi27/p083-27.pdf

The code is:

%macro split(ndsn=2); 
data %do i = 1 %to &ndsn.; dsn&i. %end; ; 
 retain x; 
 set orig nobs=nobs; 
 if _n_ eq 1 
 then do; 
 if mod(nobs,&ndsn.) eq 0 
 then x=int(nobs/&ndsn.); 
 else x=int(nobs/&ndsn.)+1; 
 end; 
 if _n_ le x then output dsn1; 
 %do i = 2 %to &ndsn.; 
 else if _n_ le (&i.*x) 
 then output dsn&i.; 
 %end; 
 run; 
%mend split; 

%split(ndsn=10);

All you need to do is replace the 10 digit in "%split(ndsn=10);" with the number you require. In Line 4, "set orig nobs=nobs;", simply replace orig with your dataset name.

Hey presto!

ApostlesPoem
  • 81
  • 1
  • 4
1

A more efficient option, if you have room in memory to store one of the smaller datasets, is a hash solution. Here's an example using basically what you're describing in the question:

data in_data;
  do recid = 1 to 1.000001e7;
    datavar = 1;
    output;
  end;
run;


data _null_;
  if 0 then set in_data;
  declare hash h_out();
  h_out.defineKey('_n_');
  h_out.defineData('recid','datavar');
  h_out.defineDone();

  do filenum = 1 by 1 until (eof);
    do _n_ = 1 to 250000 until (eof);
      set in_data end=eof;
      h_out.add();
    end;
    h_out.output(dataset:cats('file_',filenum));
    h_out.clear();
  end;
  stop;
run;

We define a hash object with the appropriate parameters, and simply tell it to output every 250k records, and clear it. We could do a hash-of-hashes here also, particularly if it weren't just "Every 250k records" but some other criteria drove things, but then you'd have to fit all of the records in memory, not just 250k of them.

Note also that we could do this without specifying the variables explicitly, but it requires having a useful ID on the dataset:

data _null_;
  if 0 then set in_data;
  declare hash h_out(dataset:'in_data(obs=0)');
  h_out.defineKey('recid');
  h_out.defineData(all:'y');
  h_out.defineDone();

  do filenum = 1 by 1 until (eof);
    do _n_ = 1 to 250000 until (eof);
      set in_data end=eof;
      h_out.add();
    end;
    h_out.output(dataset:cats('file_',filenum));
    h_out.clear();
  end;
  stop;
run;

Since we can't use _n_ anymore for the hash ID due to using the dataset option on the constructor (necessary for the all:'y' functionality), we have to have a record ID. Hopefully there is such a variable, or one could be added with a view.

Joe
  • 62,789
  • 6
  • 49
  • 67
1

Here is a basic approach. This requires manual adjustment of the intervals, but is easy to understand.

* split data;
data output1;
set df;
if 1 <= _N_ < 5 then output;
run;


data output2;
set df;
if 5 <= _N_ < 10 then output;
run;


data output3;
set df;
if 10 <= _N_ < 15 then output;
run;


data output4;
set df;
if 15 <= _N_ < 22 then output;
run;
Rasmus Larsen
  • 5,721
  • 8
  • 47
  • 79