Splitting a SAS dataset into mutual-exclusive by row in parallel?

Question

There is SO question on split a large dataset into smaller one but with the advent of proc ds2 there must a way to do this using threads?

I have written the below data step to split a dataset into &chunks. chunks. I tried to write the same in proc ds but it just fails. I am quite new to proc ds2 so a simple explanation for someone with good understanding of data step would be ideal.

Data stepcode

%macro output_chunks(in, out, by, chunks);
data %do i = 1 %to &chunks.;
    &out.&i.(compress=char drop = i)
    %end;
;
    set &in.;
    by &by.;
    retain i 0;
    if first.&by. then do;
        i = i + 1;      
        if i = &chunks.+1 then i = 1;
    end;

    %do i = 1 %to &chunks.;
        if i = &i. then do;
            output  &out.&i.;
        end;
    %end;
run;
%mend;

proc ds2 code

proc ds2; 
  thread split/overwrite=yes; 
    method run(); 
      set in_data; 
      thisThread=_threadid_; 
      /* can make below into macro but I can't seem to get it to work */
      if thisThread = 1 then do;
        output ds1;
      end;
      else if thisThread = 2 then do;
        output ds2;
      end;
    end; 
    method term();
      put '**Thread' _threadid_ 'processed'  count 'rows:';
    end;
  endthread; 
  run; 
quit;

Just to clarify: do you want to specifically discuss a [tag:Sas-ds2] solution here? Or do you want to find a general solution to the issue that's more efficient than a regular data step? (There is for example a way to do this with hash tables that's much more efficient than my suggested solution in that thread.) — Joe, Nov 17 '17 at 15:49
Do you have SAS/CONNECT installed? That might be another possible approach. — user667489, Nov 17 '17 at 21:30

score 2 · Answer 1 · answered Nov 17 '17 at 18:21

So, you're right in one sense that DS/2 may be helpful here. However, I suspect it's a bit more complicated.

DS/2 will happily thread data steps, but what is going to be more challenging is writing to several different datasets. That's because there's not a great way to structure the output dataset name without using the macro language, which won't play with the threading very well as far as I can tell (though I'm no expert here).

Here's an example of it using threading:

PROC DS2;

     thread in_thread/overwrite=yes;
     dcl bigint count;
     drop count;
        method init();
           count=0;
        end;
         method run();
             set in_data;
             count+1;
             output;             
         end;
         method term();      
           put 'Thread' _threadid_ ' processed' count 'observations.';
         end;
     endthread;
     run;

     data out_data/overwrite=yes;
         dcl thread in_thread t_in; /* instance of the thread */
         method run();
           set from t_in threads=4;
           output;
         end;
     enddata;
     run;
quit;

But this just writes one dataset out, and if you change threads=4 to 1, it doesn't actually take any longer. Both are okay speed-wise, though actually slower than the regular data step (about 1.8x the speed for me). DS/2 uses a much, much slower method to access data under the hood than SAS's base data step when accessing SAS datasets; DS/2's speed gains really come into play when you're working in RDBMSs via SQL or similar.

However, there's no good way to drive the output in parallel. Here's the version of the above turned into making 4 datasets. Notice that the actual selection of where to output is in the main, non-threaded data step...

PROC DS2;

     thread in_thread/overwrite=yes;
     dcl bigint count;
     dcl bigint thisThread;
     drop count;
        method init();
           count=0;
        end;
         method run();
             set in_data;
             count+1;
             thisThread = _threadid_;
             output;

         end;
         method term();      
           put 'Thread' _threadid_ ' processed' count 'observations.';
         end;
     endthread;
     run;

     data a b c d/overwrite=yes;
         dcl thread in_thread t_in; /* instance of the thread */
         method run();
           set from t_in threads=4;
           select(thisThread);
             when (1) output a;
             when (2) output b;
             when (3) output c;
             when (4) output d;
             otherwise;
           end;
         end;
     enddata;
     run;
quit;

So it's actually quite a lot slower than in the non-threaded version. Oops!

Really, your issue here is that disk i/o is the main problem, not CPU. Your CPU does virtually no work here. DS/2 might be able to help in some edge cases where you have a really fast SAN that allows tons of simultaneous writes, but ultimately it takes X amount of time to read those million records and same X amount of time to write a million records, based on your i/o constraint, and odds are parallelizing that won't help.

Hash tables will add a lot more I suspect, and could certainly be used here with DS/2; see my new answer on the other question linked in OP for the data step version. DS/2 probably won't make that solution any faster, more likely slower; but you could implement roughly the same thing in DS/2 if you wanted, and then the sub-thread would be able to output on its own without involving the master thread.

Where DS/2 would be helpful would be if you're doing this in Teradata or something, where you can use SAS's in-database accelerator to execute this code database-side. That would make things a lot more efficient. Then you could use something similar to my code above, or better yet a hash solution.

score 1 · Answer 2 · answered Nov 17 '17 at 20:18

Example of a User-Defined DS2 package to split a data set using HoH method, the big downside is the inability to name the dataset by the keys without a LOT of fudgery due to very limited utility of variable lists in DS2, as a result I opt for a simpler naming convension:

data cars;
set sashelp.cars;
run;

proc ds2;

package hashSplit / overwrite=yes;

declare package hash  h  ();
declare package hash  hs ();
declare package hiter hi;

/**
  * create a child multidata hash object
  */
private method mHashSub(varlist k, varlist d) returns package hash;
  hs = _new_ [this] hash();
  hs.keys(k);
  hs.data(d);
  hs.multidata('Y');
  hs.defineDone();
  return hs;
end;

/**
  * constructor, create the parent and child hash objects
  */
method hashSplit(varlist k);
  h = _new_ [this] hash();
  h.keys(k);
  h.definedata('hs');
  h.defineDone();
end;

/**
  * adds key values to parent hash, if necessary
  * adds key values and data values to child hash
  * consilidates the FIND, ADD and nested ADD methods
  */
method add(varlist k, varlist d);
  declare double rc;

  rc = h.find();
  if rc ^= 0 then do;
    hs = mHashSub(k, d);
      h.add();
  end;
  hs.add();
end;

/**
  * outputs the child hashes to data sets with a fixed naming convention
  *
  * SAS needs to add more support for using variable lists with functions/methods besides hash
  */
method output();
  declare double rc;
  declare int i;

  hi = _new_ hiter('h');

  rc = hi.first();
  do i = 1 to h.num_items by 1 while (rc = 0);
    hs.output(catx('_', 'hashSplit', i));
      rc = hi.next();
  end;
end;

endpackage;
run;
quit;

/**
  * example of using the hashSplit package
  */
proc ds2;
data _null_;
varlist k [origin];
varlist d [_all_];
declare package hashSplit split(k);

method run();
  set cars;
  split.add(k, d);
end;

method term();
  split.output();
end;
enddata;
run;
quit;

Does this actually use threads? And if so does it improve performance? — Tom, Nov 17 '17 at 21:39

Splitting a SAS dataset into mutual-exclusive by row in parallel?

2 Answers2