3

I have a dataset with 5 groups and I want to use the DS2 procedure in SAS to concurrently compute group means.

Simulated dataset:

data sim;
    call streaminit(7);
    do group = 1 to 5;
        do pt = 1 to 500;
            x = rand('ERLANG', group);
            output;
        end;
    end;
run;

How I envision it working is that each of 5 threads receives a subset of the data corresponding to a particular group. The mean of x is calculated on each subset like so:

proc ds2;
    thread t / overwrite=yes;
        dcl double n sum mean;

        method init();
            n = 0;
            sum = 0;
            mean = .;
        end;

        method run();
            set sim;    /* Or perhaps a subsetted dataset */
            sum + x;
            n + 1;
        end;

        method term();
            mean = sum / n;
            output;
        end;
    endthread;

    ...
quit;

The problem is, if you call a thread that processes a dataset like below, rows are sent to the 5 threads all willy-nilly (i.e. irrespective of groups).

    data test / overwrite=yes;
        dcl thread t t_instance;
        method run();
            set from t_instance threads=5;
        end;
    enddata;

How can I tell SAS to subset the data by group and pass each subset to its own thread?

Alex A.
  • 5,466
  • 4
  • 26
  • 56
  • 1
    From [the documentation](http://support.sas.com/documentation/cdl/en/ds2ref/67313/HTML/default/viewer.htm#p1polmk2yv18uvn15rp9wcdvwpay.htm#n124iu5iqujfi7n1tyf5jrc1qa2l) it seems you need a `by` statement to specify the grouping. Further discussion [here](http://support.sas.com/documentation/cdl/en/ds2ref/67313/HTML/default/viewer.htm#n0t6d2pt7pbu2wn1b3ezzecslk4a.htm). However, from my superficial reading it isn't clear if this only relates to In Database Processing. – SRSwift Feb 03 '15 at 19:29
  • 1
    @SRSwift: I'm familiar with BY group processing and I've looked at it for this situation. I assume I'll need a `by` statement somewhere, but I haven't been able to figure out how to use it to spawn a thread for each group. – Alex A. Feb 03 '15 at 19:38
  • I don't have access to DS2, but I would assume that it follows the `set` statement as in base SAS. See [here](http://support.sas.com/documentation/cdl/en/ds2ref/67313/HTML/default/viewer.htm#n0aloisf2pdqw2n13sq0vc7qwee3.htm). – SRSwift Feb 03 '15 at 21:08
  • 1
    @SRSwift: I know how to use the `by` statement in general. It does indeed go beneath `set` as in a data step. My issue is utilizing the groups for a specific task. I've scoured the SAS DS2 docs and came up with nothing directly relevant to my needs, that's why I posted to SO. – Alex A. Feb 03 '15 at 21:23
  • [This](http://support.sas.com/documentation/cdl/en/proc/67327/HTML/default/n0ox2hnyx7twb2n13200g5hqqsmy.htm#p0wpqsvxdw1ffpn1vuer2fn7ct4s) seems relevant, but I'm not sure how helpful it is. – user667489 Feb 03 '15 at 21:59
  • @user667489: I looked at that but I'm not using in-database processing, I'm using regular SAS datasets. – Alex A. Feb 03 '15 at 22:15
  • Sorry that this doesn't fit the question exactly, but as a solution have you considered using arrays of `sum` and `n` of length 5 and using group as an index, then processing individually at the end? – SRSwift Feb 03 '15 at 22:41
  • @SRSwift: The goal is to compute the means in parallel--would I be able to do that using arrays? – Alex A. Feb 03 '15 at 22:54
  • http://support.sas.com/resources/papers/proceedings14/SAS329-2014.pdf by the way has some discussion of this - it's not perfect as it mixes in-db processing with regular SAS stuff, but it was helpful to understand the issue here. – Joe Feb 03 '15 at 23:17
  • 1
    [tag:sas-DS2] tag created - please feel free to improve the wiki. – Joe Feb 03 '15 at 23:22

1 Answers1

3

I believe you have to add the by statement inside the run() method, and then add some code to deal with the by group (ie, if you want it to output for last.group then add code to do so and clear the totals). DS2 is supposed to be smart and use one thread per by group (or, at least, process an entire by group per thread). I'm not sure if you will see a great improvement if you're reading from disk (since the threading advantage is probably less than the disk read time) but who knows.

The only changes below are in run(), and adding a proc means to check myself.

data sim;
    call streaminit(7);
    do group = 1 to 5;
        do pt = 1 to 500;
            x = rand('ERLANG', group);
            output;
        end;
    end;
run;

proc ds2;
    thread t / overwrite=yes;
        dcl double n sum mean ;

        method init();
            n = 0;
            sum = 0;
            mean = .;
        end;

        method run();
            set sim;
            by group;
            sum + x;
            n + 1;
            if last.group then do;
                mean = sum / n;
                output;
                n=0;
                sum=0;
            end;
        end;

        method term();
        end;
    endthread;
  run;

  data test / overwrite=yes;
        dcl thread t t_instance;
        method run();
            set from t_instance threads=5; 
        end;
    enddata;
    run;    
quit;

proc means data=sim;
class group;
var x;
run;
Joe
  • 62,789
  • 6
  • 49
  • 67
  • @Alex I didn't change your call at all - it's still in the code above (scroll down). I just moved it inside the `PROC DS2` block for organizational purposes. – Joe Feb 04 '15 at 17:30
  • Ugh, right. Scrolling down is useful. Sorry about that. – Alex A. Feb 04 '15 at 17:47
  • If you execute this and look at the computed means from DS2, they're all 1. – Alex A. Feb 04 '15 at 18:47
  • Hmm, that wasn't the case for me - they matched the PROC MEANS means perfectly. Let me look to see if I changed something by mistake. – Joe Feb 04 '15 at 18:49
  • @Alex I don't get that. I get means of 1.003105, 1.998972, 2.946343, 3.94346119, and 5.13233902, which perfectly match the `PROC MEANS` output. (Note, I set a random seed in your first data step to 7, so you should be able to identically replicate the results). This is with 9.4TS1M2; if you have an earlier 9.4 it's possible this doesn't work properly in that version? – Joe Feb 04 '15 at 20:29
  • I have 9.4 TS1M1. I had set the seed too but I forgot to include that in my post. Regardless, I don't know what I did the first time I ran your code, but I tried it again and it gets the right answer now. Is there any way to verify that it is indeed split properly by group? – Alex A. Feb 04 '15 at 20:38
  • @Alex I don't know; in particular, I only have a 4 core machine so 5 threads wouldn't be very interesting anyway. One thing I did was re-add an "output" to the term() routine; when I did that it was interesting. The first time I ran it, 3 groups were output, then a zero row (representing the term() output row from one thread), then a group, then a zero row, then the last group, then another zero row, then another zero row. That sounds like the first 3 groups all ended up in one thread. – Joe Feb 04 '15 at 20:40
  • @Alex But then I did it another time and got 5 separate groups with 5 zero rows between each, suggesting each had one thread and no threads were wasted. (You can also tell by what's in the group/pt/x variables - those are entirely missing if a thread didn't process any rows, I think). It's possible my CPU usage was different, suggesting a different thread usage - I'm afraid I just don't know DS2 well enough to know. Someone @ communities.sas.com might? – Joe Feb 04 '15 at 20:41
  • Upon further reading, it looks like SAS In-Database Code Accelerator separates `by` groups by thread and SAS Embedded Process for Teradata does not. I'm using neither of these; it remains to be seen what it's _supposed_ to do for base SAS datasets. In the meantime, your answer works and has been accepted. While there may not always be one thread per group, the groups do appear to be complete within threads, which is good enough. Thanks so much for your help! – Alex A. Feb 04 '15 at 21:27