1

At work I have a 200M rows (300+ columns, ~200GB) claims table that is indexed by member number (memno) and date of service (dos).

Twice a month I have to pull 50,000 member number's claims from a fixed date range (say 1/1/2017 to 5/1/2018), where I only need a limited columns from claims.

MyInputList only has 1 column (memno).

proc sql;
    create table myClaims as
    select a.claimno, a.dos, a.memno
    from s.claims a inner join myInputList b
       on a.memno = b.memno
    where a.dos between '01Jan2017'd and '01May2018'd;
quit;

It usually takes about 3-4 hours to run by PROC SQL. The data themselves are not hosted by a RDMS, I read many SAS essay that PROC SQL does a cartesian product and since I don't need all 300 columns per record, I wonder if it will be better using hash table.

My question: Can I give "hints" to the hash table so it can leverage the indexed columns (memno, dos)?

data myClaimsTest (drop=rc);
if 0 then set myInputList;

declare hash vs(hashexp:7, dataset:'myInputList');
vs.definekey('memno');
vs.definedata();
vs.definedone();

do until (eof);
   set s.claims (keep=claimno dos) end=eof;
   if vs.find()=0 then output;
end;

stop;
run;

New section (added by Richard)

Run this code to get a listing of the variables and the indices.

dm "clear output"; ods listing; ods noresults; options nocenter; title;
proc contents varum data=all_claims;
run;
dm "output" output; ods results;

Copy and paste the bottom of the output here. Replace this example with your actual listing.

        Variables in Creation Order

#    Variable      Type    Len    Format

1    claim_id      Num       8
2    member_id     Num       8
3    claim_date    Num       8    YYMMDD10.


       Alphabetic List of Indexes and Attributes

                          # of
              Unique    Unique
#    Index    Option    Values    Variables

1    PICK     YES       333338    member_id claim_date
Richard
  • 25,390
  • 3
  • 25
  • 38
George
  • 4,514
  • 17
  • 54
  • 81
  • Index and hash are totally different concepts. one is disk resident(index), while another is memory resident. The way they accessed are also very different. So indexed columns hints to hash object does not make sense. Did you already Hash technique, what was the result. – Kiran Sep 29 '18 at 13:46
  • @Kiran Maybe that is my problem, my Hash never returned (on a test input of size 10), it ran for over an hour and I have to terminated it. – George Sep 29 '18 at 14:01
  • "I read many SAS essay", can you recall and cite any in the question ? Did you read ["Two Guys on Hash"](https://analytics.ncsu.edu/sesug/2010/HOW05.Dorfman.pdf) Dorfman and Eberhardt (2010) ? Did you Turn on INFO logging with `options MSGLEVEL=I;` Info messages will tell you if an index if being used. – Richard Sep 29 '18 at 15:51
  • Is the BIG tables index unique ? ( I presume no ). If no, is there a tertiary variable that could be added to the composite index to make it unique ? , and if tertiary yes is the domain of the associated values knowable apriori and thus iterable at the lookup level for each of the 50K selectors ? – Richard Sep 29 '18 at 16:27
  • there is notable difference i see you with hash query it does not include where a.dos between '01Jan2017'd and '01May2018'd – Kiran Sep 29 '18 at 17:11
  • @Richard I will try on Monday with the info logging and confirm. And you are correct that the index are no unique. The claim table has 7-8 column that are indexed, but I don't think I know what values should be for the rest these 6 columns. – George Sep 30 '18 at 12:59
  • Pls add to your question the output from `PROC CONTENTS`. I have edited the question to have section where you can add the output. If each column is indexed individually there are different processing considerations -- the best scenario is that **is indexed by member number (memno) and date of service (dos)** means you have a composite index. – Richard Sep 30 '18 at 17:12
  • Sorry for so many comments... Is each combination of member_id and claim_date unique ? If no, this means a member can have multiple claims on the same day (probably the situation for real world data) – Richard Sep 30 '18 at 17:23
  • @Richard You are again correct, the member_id and claim_date are not unique as each member can have multiple claims on the same date (billing different procedures....) – George Oct 03 '18 at 02:48
  • @Richard, I can't easily post the output of PROC contents because the work machine access to stackoverflow. But I am preparing a respond, some of your tips does work and provided great insight on performance. -- Pending edit. – George Oct 03 '18 at 02:58

2 Answers2

2

Suppose BIG is your 200GB indexed SAS table and SMALL is your 50K row selection criteria row.

The BIG index (key) is likely not being used because the SMALL data does not have enough information (variables) to complete the composite key that would match to BIG.

There are two varieties of processing

  1. Full scan BIG, test each record via lookup into SMALL, or
  2. Process each record of SMALL, perform indexed retrieval from BIG

The hash code in your question corresponds to #1 and the SQL join is #2, although SQL might be choosing to go the way of #1.

Here is a sample data maker

%let MEMBER_N = 1e5;
%let CLAIM_RATE = 0.00125;
%let MEMBER_SAMPLE_N = 1e2;
%let STUDY_PROPORTION = 0.001;

data ALL_CLAIMS
( label = "BIG"
  index=
  ( 
    PICK = (member_id claim_date) / unique
  )
);
  retain claim_id 0 member_id 0 claim_date 0 member_n 0;
  format claim_date yymmdd10.;

  do member_id = 1e7 by 1;
    claim_n = 1;
    do claim_date = '01jan2012'd to '31dec2018'd;
      if ranuni(123) > &CLAIM_RATE then continue;
      claim_id + 1;
      if claim_n = 1 then member_n + 1;
      output;
      claim_n + 1;
    end;

    if member_n = &MEMBER_N then leave;
  end;

  stop;

  drop member_n claim_n;
run;

%put note: sample population is %sysevalf(5e4/200e6*100)% of all claims;
%put note: or ~%sysevalf(5e4/200e6*1e6) rows in this example;

data STUDY_MEMBERS(keep=member_id label="SMALL");
  * k / n selection method, Proc SURVEYSELECT is better but not always available;
  * an early sighting on SAS-L would be https://listserv.uga.edu/cgi-bin/wa?A2=ind9909c&L=SAS-L&P=173979
  * Re: Random Selection (Sep 20, 1999);

  retain 
    k %sysevalf(&MEMBER_N*&STUDY_PROPORTION, FLOOR) 
    n &MEMBER_N
  ;

  set ALL_CLAIMS;
  by member_id;

  if first.member_id;

  if ranuni(123) < k/n then do;
    output;
    k + (-1);
  end;

  n + (-1);

  if n=0 then stop;
run;

and processing code

options msglevel=i;

proc sql;
  create table ALL_STUDY_SUBSET as
  select ALL.claim_id, ALL.claim_date, ALL.member_id
  from ALL_CLAIMS ALL inner join STUDY_MEMBERS STUDY
    on ALL.member_id = STUDY.member_id
  where ALL.claim_date between '01Jan2017'd and '01May2018'd
  ;
quit;

* extend study data with a date variable that matches the data variable in the ALL index; 

data STUDY_MEMBERS_WITH_ITERATED_DATE;
  set STUDY_MEMBERS;
  do claim_date = '01Jan2017'd to '01May2018'd;
    output;
  end;
run;

* join on both variables in ALL key;

proc sql;
  create table ALL_STUDY_SUBSET2 as
  select ALL.claim_id, ALL.claim_date, ALL.member_id
  from ALL_CLAIMS ALL inner join STUDY_MEMBERS_WITH_ITERATED_DATE STUDY
    on ALL.member_id = STUDY.member_id
   and ALL.claim_date = STUDY.claim_date
  ;
quit;

* full scan with hash based match;

data ALL_STUDY_SUBSET3;
  SET ALL_CLAIMS;

  if _n_ = 1 then do;
    declare hash study (dataset:'STUDY_MEMBERS');
    study.defineKey('member_id');
    study.defineDone();
  end;

  if '01jan2017'd <= claim_date <= '01may2018'd;

  if study.find() = 0; 
run;

* SMALL scan with iterated dates to complete info to allow BIG index (key)
* to be used;

data ALL_STUDY_SUBSET4;
  set STUDY_MEMBERS;

  do claim_date = '01jan2017'd to '01may2018'd;
    set ALL_CLAIMS key=pick;
    if _iorc_ = 0 then output;
  end;

  _error_ = 0;
run;
Richard
  • 25,390
  • 3
  • 25
  • 38
  • Thanks for the thoughtfully written sample code and test case, I will try to implement them on Monday and report back with the result. (Servers are refreshing). – George Sep 30 '18 at 13:03
  • You brought up an important note. The Claim table has 8 separately indexed columns and I am only using 2 namely memno and dos (the rest of indexed values are somewhat claims specific). – George Sep 30 '18 at 13:06
1

I have added this second answer with slightly different problem conditions -- a member can have multiple claims on a given day, and there are only simple single variable indices.

Make data with multiple claims per member/day

%let MEMBER_N = 1e5;
%let CLAIM_RATE = 0.00125;
%let MULTI_CLAIM_RATE = 0.05; %* iterative rate at which another claim is made on same day a claim is made;
%let STUDY_PROPORTION = 0.001;

data ALL_CLAIMS
( label = "BIG"
  index=
  ( 
/*    PICK = (member_id claim_date) / unique (not happening) */
    member_id
    claim_id
  )
);
  retain claim_id 0 member_id 0 claim_date 0 member_n 0;
  format claim_date yymmdd10.;

  do member_id = 1e7 by 1;
    claim_n = 1;
    do claim_date = '01jan2012'd to '31dec2018'd;
      if ranuni(123) > &CLAIM_RATE then continue;
      if claim_n = 1 then member_n + 1;

      do multi_n = 0 by 1 until (ranuni(123) > &MULTI_CLAIM_RATE); 
        claim_id + 1;
        output;
      end;

      if multi_n > 1 then put 'NOTE: ' member_id= claim_date= multi_n 'claims';

      claim_n + 1;
    end;

    if member_n = &MEMBER_N then leave;
  end;

  stop;

  drop member_n claim_n;
run;

Using claim_date index for preliminary selection of candidate claims is probably not helpful -- you could have tens of thousands of claims on a disastrous day, your processing would have to iterate date over date range, set claims by matching date and do a hash lookup (SMALL:member_id) for each claim_id record occurring on those matching dates. You would have to experiment to see if this anti-intuitive approach might actually work out better for your particular ALL and SMALL.

If you examine the log for the SQL, you will see the query optimizer chooses to use the member_id index (and internally will iterate found rows for application of where clause). The undocumented Proc SQL options _method and _tree can show you what it will be doing -- see "The SQL Optimizer Project: _Method and _Tree in SAS®9.1" Lavery (SUGI 30).

proc sql _method _tree;
  create table ALL_STUDY_SUBSET as
  select ALL.claim_id, ALL.claim_date, ALL.member_id
  from ALL_CLAIMS ALL inner join STUDY_MEMBERS STUDY
    on ALL.member_id = STUDY.member_id
  where ALL.claim_date between '01Jan2017'd and '01May2018'd
  ;
quit;

Log excerpt

INFO: Index member_id of SQL table WORK.ALL_CLAIMS (alias = ALL) selected for SQL WHERE clause
      (join) optimization.

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxjndx
              sqxsrc( WORK.STUDY_MEMBERS(alias = STUDY) )
              sqxsrc( WORK.ALL_CLAIMS(alias = ALL) )

Tree as planned.
                               /-SYM-V-(ALL.claim_id:1 flag=0001)
                     /-OBJ----|
                    |         |--SYM-V-(ALL.claim_date:3 flag=0001)
                    |          \-SYM-V-(ALL.member_id:2 flag=0001)
           /-JOIN---|
          |         |                              /-SYM-V-(STUDY.member_id:1 flag=0001)
          |         |                    /-OBJ----|
          |         |          /-SRC----|
          |         |         |          \-TABL[WORK].STUDY_MEMBERS opt=''
          |         |--FROM---|
          |         |         |                    /-SYM-V-(ALL.claim_id:1 flag=0001)
          |         |         |          /-OBJ----|
          |         |         |         |         |--SYM-V-(ALL.claim_date:3 flag=0001)
          |         |         |         |          \-SYM-V-(ALL.member_id:2 flag=0001)
          |         |          \-SRC----|
          |         |                   |--TABL[WORK].ALL_CLAIMS opt=''
          |         |                   |          /-NAME--(claim_date:3)
          |         |                    \-IN-----|
          |         |                             |                    /-LITN(20820) DATE.
          |         |                             |          /-RANB---|
          |         |                             |         |          \-LITN(21305) DATE.
          |         |                              \-SET----|
          |         |--empty-
          |         |          /-SYM-V-(STUDY.member_id:1)
          |          \-CEQ----|
          |                    \-SYM-V-(ALL.member_id:2)
 --SSEL---|

and a DATA Step equivalent

data ALL_STUDY_SUBSET5(label="Presuming a preponderance of members file few claims over their all_claims lifetime");
  set STUDY_MEMBERS;

  do until (_iorc_);
    set ALL_CLAIMS key=member_id;
    if _iorc_ = 0 and '01jan2017'd <= claim_date <= '01may2018'd then do;
      OUTPUT;
    end;
  end;

  _error_ = 0;
run;

Still slow?

When the situation arises that best effort and best practice programming outcomes are not fast enough you will have to look into improvement through the containing system resources:

  • Can you add a new composite index using (member_id claim_date) ?
  • Can you move the data set table to a faster drive ? Such as:
    • Ensure no network interactions data-wise
    • Defragmentation
    • Replace 5400 RPM drive with 15000 RPM
    • Replace rotating drive with SSD SATA
    • Replace rotating or SSD SATA with NVMe
    • SASFILE
      • Workstation/Server with vast RAM (>200GB)
    • Move data to cloud/remote/big iron data warehouse solution with greater potential for automatic parallelization and on-demand resource bump-ups
Richard
  • 25,390
  • 3
  • 25
  • 38