2

In my project I am combining three unique input sources to generate one score. Imagine this formula

Integrated score = weight_1 * Score_1 + weight_2 * Score_2 + weight_3 * Score_3

So, to do this, I have utilised the following code

    DATA w_matrix_t;
     /*Create a row count to identify the model weight combination*/
         RETAIN model_combination;
         model_combination = 0;

         DO n_1 = 0 TO 100 BY 1;
               DO n_2 = 0 TO 100 BY 1;
                    IF (100 - n_1 - n_2) ge 0 AND (100 - n_1 - n_2) le 100 THEN DO;
                         n_3 = 100 - n_1 - n_2;
                         model_combination+1;
                         output;
                    END;
               END;
         END;
    RUN;

    DATA w_matrix;
        SET w_matrix_t;
        w_1 = n_1/100;
        w_2 = n_2/100;
        w_3 = n_3/100;
        /*Drop the old variables*/
        DROP n_1 n_2 n_3;
    RUN;

PROC SQL; 
    CREATE TABLE weights_added AS 
    SELECT
          w.model_combination
        , w.w_1
        , w.w_2
        , w.w_3
        , fit.name
        , fit.logsalary
        , (
            w.w_1*fit.crhits    + 
            w.w_2*fit.natbat    + 
            w.w_3*fit.nbb
        )                               AS y_hat_int
    FROM
        work.w_matrix                   AS w
    CROSS JOIN
        sashelp.baseball                AS fit
    ORDER BY
        model_combination;
QUIT;

My question is, is there a more efficient way of making this join? The purpose is to create a large table that contains the entire sashelp.baseball dataset duplicated for all combinations of weights.

In my live data, I have three input sources of 46,000 observations each and that cross join takes 1 hour. I also have three input sources of 465,000 each, I imagine this will take a very long time.

The reason I do it this way is because I calculate my Somers' D using Proc freq and by group processing (by model combination)

78282219
  • 85
  • 1
  • 8
  • Sounds like a very blind approach to seeking some desired objective. Is your comb too fine ? Going from a comb of 1% weight lattice to 5% reduces the number of combinations (the number of times the data set is stacked on itself) drops to 231 from 5151. @10% weight step, there are 66. Maybe you can compute at much larger weight steps and determine a smaller target region for a next iteration using a finer comb. The target region would be determined by examining the variation in D at the wider lattice points. Do you have SAS/OR ? – Richard Mar 19 '19 at 16:35
  • I have the options to reduce the boundaries by specifying max and min weights for each module. I was just wondering whether there are more efficient ways of completing this query (Someone mentioned that they would be able to do it quicker in a data step so it got me wondering) – 78282219 Mar 19 '19 at 18:54

1 Answers1

2

5000 copies of a 500,000 row table will be a pretty big table with 2.5B rows

Here is an example of data step stacking; one copy of have data set per row of weights. The example features SET weights to process each weight (via implicit loop) and SET have POINT= / OUTPUT inside an explicit loop (the inner loop). The inner loop copies the data while it computes the weighted sum.

data have;
  set sashelp.baseball (obs=200);  * keep it small for demonstration;
run;

data weights (keep=comboId w1 w2 w3);
  do i = 0 to 100; do j = 0 to 100; if (i+j) <= 100 then do;
    comboId + 1;
    w1 = i / 100;
    w2 = j / 100;
    w3 = (100 - i - j) / 100;
    output; 
  end; end; end;
run;


data want (keep=comboid w1-w3 name logsalary y_hat_int);
  do while (not endOfWeights);
    set weights end = endOfWeights;
    do row = 1 to RowsInHave;
      set have (keep=name logsalary crhits natbat nbb) nobs = RowsInHave point = row;
      y_hat_int = w1 * crhits + w2 * natbat + w3 * nbb;
      output;
    end;
  end;
  stop;
run;

proc freq data=want noprint;
  by comboId;
  table y_hat_int / out=freqout ;
  format y_hat_int 4.;
run;

proc contents data=want;
run;

Off the cuff, a single table containing 5,151 copies of a 200 row extract from baseball is nominally 72.7MB, so expect 5,151 copies of a 465K row table to have ~ 2.4G rows and be ~ 170 GB disk. On a disk spinning @7200 achieving max performance throughout your looking at best 20 minutes just writing, and probably much more.

Richard
  • 25,390
  • 3
  • 25
  • 38
  • The point = statement is something i've been trying to wrap my head around, I'll give this script a go but i'll run it on the university server rather than local disk – 78282219 Mar 20 '19 at 07:33
  • The tricky idea is that the `SET` statement can populate automatic variable `RowsInHave` from NOBS= **before** the SET statement is reached. The details of how that happens can be found in SAS SET documentation **"At compilation time, SAS reads the descriptor portion of each data set and assigns the value of the NOBS= variable automatically. Thus, you can refer to the NOBS= variable before the SET statement. The variable is available in the DATA step but is not added to any output data set."** – Richard Mar 20 '19 at 10:11