5

I was trying to use proc ds2 to try and get some performance increases over the normal data step by using the multithreaded capability.
fred.testdata is a SPDE dataset containing 5 million observations. My code is below:

proc ds2; 
   thread home_claims_thread / overwrite = yes;
   /*declare char(10) producttype;
   declare char(12) wrknat_clmtype;
   declare char(7) claimtypedet;
   declare char(1) event_flag;*/
   /*declare date week_ending having format date9.;*/
   method run();
      /*declare char(7) _week_ending;*/
      set fred.testdata;
      if claim = 'X' then claimtypedet= 'ABC';
      else if claim = 'Y' then claimtypedet= 'DEF';
      /*_week_ending = COMPRESS(exposmth,'M');
    week_ending = to_date(substr(_week_ending,1,4) || '-' || substr(_week_ending,5,2) || '-01');*/
   end;
   endthread;

data home_claims / overwrite = yes;
   declare thread home_claims_thread t;  
   method run();
      set from t threads=8;
   end;
enddata;
run;
quit;

I didn't include all IF statements and only included a few otherwise it would have taken up a few pages (you should get the idea hopefully). As the code currently is it works quite a fair bit faster than the normal data step however significant performance issues arise when any of the following happens:

  1. I uncomment any of the declare statements
  2. I include any numeric variables in fred.testdata (even without performing any calculations on the numeric variables)

My questions are:

  1. Is there any way to introduce numeric variables into fred.testdata without getting significant slowdowns which make DS2 way slower than the normal data step? (for this small table of 5 million rows including numeric column/s the real time is about 1 min 30 for ds2 and 20 seconds for normal data step). The actual full table is closer to 600 million rows. For example I would like to be able to do that week_ending conversion without it introducing a 5x performance penalty in run times. Run times for ds2 WITHOUT declare statements and numeric variables takes about 7 seconds
  2. Is there any way to compress the table in ds2 without having to do an additional data step to compress it?

Thank you

reecec
  • 73
  • 5
  • Would utilizing **Proc HPds2** viable? (Usually HP versions are a lot more efficient.) – pinegulf Nov 28 '17 at 06:49
  • Did you try adding new empty fields to fred.testdata first (alter table...), then updating it via proc sql update? With such a huge table this seem you're I/O bound. What is number of existing columns and new columns to be added? – vasja Nov 28 '17 at 09:18
  • The main problem is threading in DS2 uses a single reader rather than multiple parallel reads. HPDS2 may be able to do this. There is a workaround that I've used in the past involving rsubmit and multiple SAS sessions to read the data in parallel chunks. It tends to be fairly effective. – Stu Sztukowski Nov 28 '17 at 19:45
  • What kind of SAS appliance is this on - a grid, a server, or a desktop - or CAS/Viya? When you say it's stored in SPDE, is that SPDE on a distributed set of disks or is it SPDE but on a single disk? – Joe Nov 28 '17 at 22:20
  • Thanks everyone for the comments. – reecec Nov 28 '17 at 23:49
  • pinegulf: yes proc hpds2 is an option, never used it before but will try the suggestion in the post below on how to use it. Stu: yes I would have liked the rsubmit but we don't have the SAS connect licence available unfortunately Joe: it's a desktop provided by AWS. The data is stored in SPDE format on a single NVMe drive, the transfer rate is reasonably good @ about 700-800mb/sec (by monitoring the data steps using nmon) – reecec Nov 28 '17 at 23:56

1 Answers1

4

Two methods to try: using proc hpds2 to have SAS handle parallel execution, or a more manual approach. Note that it is impossible to always preserve order with either of these methods.

Method 1: PROC HPDS2

HPDS2 is a way of performing massive parallel processing of data. In single-machine mode, it will make parallel runs per core, then put the data all back together. You only need to make a few slight modifications to your code in order to run it.

hpds2 has a setup where you declare your data in the data and out statements in the proc statement. Your data and set statements will always use the following syntax:

    data DS2GTF.out;
        method run();
            set DS2GTF.in;
            <code>;
        end;
    enddata;

Knowing that, we can modify your code to run on HPDS2:

proc hpds2 data=fred.test_data
           out=home_claims; 

   data DS2GTF.out;
   /*declare char(10) producttype;
   declare char(12) wrknat_clmtype;
   declare char(7) claimtypedet;
   declare char(1) event_flag;*/

   /*declare date week_ending having format date9.;*/
   method run();

      /*declare char(7) _week_ending;*/
      set DS2GTF.in;

      if claim = 'X' then claimtypedet= 'ABC';
      else if claim = 'Y' then claimtypedet= 'DEF';

      /*_week_ending = COMPRESS(exposmth,'M');
    week_ending = to_date(substr(_week_ending,1,4) || '-' || substr(_week_ending,5,2) || '-01');*/

   end;
   enddata;

run;
quit;

Method 2: Split the data using rsubmit and append

The below code makes use of rsubmit and direct observation access to read data in chunks, then append them all together at the end. This one can work especially well if you have your data set up for Block I/O

options sascmd='!sascmd'
        autosignon=yes
        noconnectwait
        noconnectpersist
        ;

%let cpucount = %sysfunc(getoption(cpucount));

%macro parallel_execute(data=, out=, threads=&cpucount);

    /* Get total obs from data */
    %let dsid = %sysfunc(open(&data.));
    %let n    = %sysfunc(attrn(&dsid., nlobs));
    %let rc   = %sysfunc(close(&dsid.));

    /* Run &threads rsubmit sessions */
    %do i = 1 %to &threads;

        /* Determine the records that each worker will read */
        %let firstobs = %sysevalf(&n.-(&n./&threads.)*(&threads.-&i+1)+1, floor);
        %let lastobs  = %sysevalf(&n.-(&n./&threads.)*(&threads.-&i.), floor);

        /* Get this session's work directory */
        %let workdir = %sysfunc(getoption(work));

        /* Send all macro variables to the remote session, and simultaneously start the remote session */
        %syslput _USER_ / remote=worker&i.;

        /* Check for an input libname */
        %if(%scan(&data., 2, .) NE) %then %do;
            %let inlib = %scan(&data., 1, .);
            %let indsn = %scan(&data., 2, .);
        %end;
            %else %do;
                %let inlib = workdir;
                %let indsn = &data.;
            %end;

        /* Check for an output libname */
        %if(%scan(&out., 2, .) NE) %then %do;
            %let outlib = %scan(&out., 1, .);
            %let outdsn = %scan(&out., 2, .);
        %end;
            %else %do;
                %let outlib = workdir;
                %let outdsn = &out.;
            %end;

        /* Work library location of this session to be inherited by the parallel session */
        %let workdir = %sysfunc(getoption(work));

        /* Sign on to a remote session and send over all user-made macro variables */
        %syslput _USER_ / remote=worker&i.;

        /* Run code on remote session &i */
        rsubmit remote=worker&i. inheritlib=(&inlib.);

             libname workdir "&workdir.";

             data workdir._&outdsn._&i.;
                 set &inlib..&indsn.(firstobs=&firstobs. obs=&lastobs.);
/*               <PUT CODE HERE>;*/
             run;
        endrsubmit;

    %end;

    /* Wait for everything to complete */
    waitfor _ALL_;

    /* Append all of the chunks together */
    proc datasets nolist;
        delete &out.;

        %do i = 1 %to &threads.;
            append base=&out.
                   data=_&outdsn._&i.
                   force
            ;
        %end;

/* Optional: remove all temporary data */
/*      delete _&outdsn._:;*/
    quit;

    libname workdir clear;
%mend;

You can test its functionality with the below code:

data pricedata;
    set sashelp.pricedata;
run;

%parallel_execute(data=pricedata, out=test, threads=3);

If you look at the temporary files in your WORK directory, you'll see that it evenly split up the dataset among the 3 parallel processes, and that it adds up to the original total.

_test_1 = 340
_test_2 = 340
_test_3 = 340
TOTAL   = 1020

pricedata = 1020
Stu Sztukowski
  • 10,597
  • 1
  • 12
  • 21
  • I think the second answer is the way to go (both from a simplicity standpoint and an effectiveness standpoint), though the HPDS2 answer is interesting. – Joe Nov 28 '17 at 22:18
  • Thanks Stu for the suggestion. I would have liked to use rsubmit but unfortunately we do not have a SAS CONNECT licence. – reecec Nov 28 '17 at 23:58
  • Ok I have tried testing hpds2 and here are my findings. It outputs the desired results (+I can see all 8 cores at about 70% utilization) but the performance is slower than the normal data step. Below are some stats (this is without introducing numeric variables so I expect more performance drops once introducing them): 53 seconds real time to run using hpds2, 8 seconds real time to run using proc ds2, 18 seconds using normal data step I quite like the hpds2 as the coding is a bit simpler but unfortunately it seems a lot slower than proc ds2 even though it's using multiple cores – reecec Nov 29 '17 at 08:06
  • In that case, you may want to do some additional performance monitoring to see where bottlenecks may be. Use `options msglevel=i` to help diagnose the slowest part of the process. Good white papers on performance tuning can be found here: ETL performance: http://support.sas.com/resources/papers/ETLperformance07.pdf http://analytics.ncsu.edu/sesug/2009/AD011.Chandler.pdf If you see Real Time is much greater than CPU time, the likely bottleneck is involving disk I/O. Things like adjusting `bufno`, reducing selected columns to only what is needed, and reducing column sizes can help a lot. – Stu Sztukowski Nov 29 '17 at 14:47
  • options msglevel=I (or N) doesn't seem to make any difference to what is put in my log. Below is all I get (note that in the normal data step the real time and CPU time is almost exactly equal which is why I tried going down this multithreaded path): NOTE: The HPDS2 procedure is executing in single-machine mode. NOTE: There were 5000000 observations read from the data set FRED.TEST_DATA. NOTE: The data set WORK.HOME_CLAIMS_HPDS2 has 5000000 observations and 182 variables. NOTE: PROCEDURE HPDS2 used (Total process time): real time 51.21 seconds cpu time 2:51.80 – reecec Nov 30 '17 at 00:13