How to rank multiple variables in a large data set?

Question

I have a data set of around 50 million records with around 30 variables(columns). I need to rank the dataset for each variable.

Proc rank does not work since it required lot of memory for this large dataset.

To give rank manually, I have to sort the dataset on the respective variable column and then give rank by using a formula. But the problem is we have to sort the dataset 30 times on 30 variables which will take very very long time and not feasible.

What alternates can we use in this case?

kstats9pt3 · Answer 1 · 2016-10-27T14:31:35.607

You're in a tough spot without many options. If you're sorting and keeping all 30 variables each time, that will significantly increase your processing times. If I were you, I'd only keep the variable you want to rank and a sequence number to apply your formula, then merge it all back together at the end. This would require you to loop over each variable in your dataset then merge it all back together. See example below and if it would help decrease your processing times:

** PUT ALL VARIABLES INTO LIST **;
PROC SQL NOPRINT;
    SELECT DISTINCT(NAME)
    INTO :VARS SEPARATED BY " "
    FROM DICTIONARY.COLUMNS
    WHERE LIBNAME = 'SASHELP' AND MEMNAME = 'FISH';
QUIT;

%PUT &VARS.;

** CREATE SEQUENCE NUMBER IN FULL DATA **;
DATA FISH; SET SASHELP.FISH; 
    SEQ=_N_;
RUN;

** LOOP OVER EACH VARIABLE TO ONLY PROCESS THAT VARIABLE AND SEQUENCE -- REDUCES PROCESSING TIME **;
%MACRO LOOP_OVER(VARS);
    %DO I=1 %TO %SYSFUNC(COUNTW(&VARS.));
        %LET VAR = %SCAN(&VARS,&I.);
        DATA FISH_&I.; SET FISH (KEEP=SEQ &VAR.);
        RUN;

        /* INSERT YOUR FORMULA CODE HERE ON FISH_&I. DATA (MINE IS AN EXAMPLE) */
        PROC SORT DATA = FISH_&I.;  
            BY &VAR.;
        RUN;

        DATA FISH1_&I.; SET FISH_&I.;
            BY &VAR.;
            RANK_&VAR = _N_;
        RUN;

        /* RESORT FINAL DATA BY SEQUENCE NUMBER VARIABLE */
        PROC SORT DATA = FISH1_&I.;
            BY SEQ;
        RUN;
    %END;
%MEND;

%LOOP_OVER(&VARS.);

** MERGE ALL SUBSETS BACK TOGETHER BY THE ORIGINAL SEQUENCE NUMBER **;
DATA FINAL;
    MERGE FISH1_:;
    BY SEQ;
    DROP SEQ;
RUN;

You can further increase performance by combining the first two steps in the %DO I into one PROC SORT data=fish(keep=seq &var) out=fish_&i; etc. And also by using a VIEW for the data step to calculate the ranks and use that as the input to the last PROC SORT. — data _null_, Oct 27 '16 at 15:33

score 0 · Answer 2 · answered Oct 27 '16 at 15:55

If you just need to rank into deciles / percentiles etc rather than a complete ranking from 1 to 50m across all 50m rows, you should be able to get a very good approximation of the correct answer using a much smaller amount of memory via proc summary, using qmethod=P2 and specifying a suitable qmarkers setting.

This approach uses the P-squared algorithm: http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf

Johannes Bleher · Answer 3 · 2016-10-31T01:26:45.613

I am not sure, whether it is a good idea: But you may want to use a Hash object. The object is loaded into your RAM. Assuming that you have 30 Mio of numerical observations, you will need around (2*8bytes)*50 mio = 800MB of RAM -- if I am not mistaken.

The code could look like this (using Foxers Macro to loop over the variables, a little helper macro to get the list of variables from a dataset and a small test dataset with two variables):

%Macro GetVars(Dset) ;
 %Local VarList ;
 /* open dataset */
 %Let FID = %SysFunc(Open(&Dset)) ;
 /* If accessable, process contents of dataset */
 %If &FID %Then %Do ;
    %Do I=1 %To %SysFunc(ATTRN(&FID,NVARS)) ;
    %Let VarList= &VarList %SysFunc(VarName(&FID,&I));
 %End ;
 /* close dataset when complete */
 %Let FID = %SysFunc(Close(&FID)) ;
 %End ;
 &VarList
%Mend ;

data dsn;
input var1 var2;
datalines;
1 48
1 8
2 5
2 965
3 105
4 105
3 85
;
run;


%MACRO LOOP_OVER(VARS);
%DO I=1 %TO %SYSFUNC(COUNTW(&VARS.));
    %LET var = %SCAN(&VARS,&I.);
    data out&i.(keep=rank&i.);
        if 0 then set dsn;
        if _N_ =1 then
        do;
          dcl hash hh(ordered:'A');
          dcl hiter hi('hh');
          hh.definekey("&var.");
          hh.definedata("&var.","rank&i.");
          hh.definedone();
        end;

        /*Get unique combination variable and point in dataset*/
        do while(not last);
          set dsn end=last;
          hh.ref();
        end;

        /*Assign ranks within hash object*/
        rc=hi.first();
         k = 1;
        do while(rc=0);
          rank&i.=k;
          hh.replace();
          k+1;
          rc=hi.next();
        end;

        /*Output rank to new dataset in original order of observations*/
        do while(not theend);
          set dsn end=theend;
          hh.find();
          output;
        end;
      /*If data can be sorted according to the rank (with no duplicates) use:
      hh.output("out&i.");

      &outi. will then have variables &var. and rank&i.
      However, the merging below may not be sensible anymore 
      as correspondence between variables is not preserved.
      There will also be no duplicates in the dataset.
      */

    run;
%END;

%MEND LOOP_OVER;

%LOOP_OVER(%GetVars(dsn));


/*Merge all rank datasets to one large*/
data all;
merge out:;
run;

How to rank multiple variables in a large data set?

3 Answers3