0

I have an sql script on SAS, which defines time overlap. Because table is very big, first I created smaller copy of that table and defined some new variables. to make lookup into table easier.

data data.orders_small(drop=VALID_TO VALID_FROM DATE_OPEN DATE_CLOSE);
   set data.orders(keep=ID_ACCOUNT ID_CLIENT VALID_TO VALID_FROM DATE_OPEN DATE_CLOSE);
   length VALID_FROM_DAY_S VALID_FROM_DAYM1_S VALID_TO_DAY_S VALID_TO_DAYP1_S 8.;
   format VALID_FROM_D_S VALID_FROM_DM1_S VALID_TO_D_S VALID_TO_DP1_S date9.;
   VALID_FROM_D_S=intnx('day',VALID_FROM,+0,'S');
   VALID_FROM_DM1_S=intnx('day',VALID_FROM,-1,'S');
   VALID_TO_D_S=intnx('day',VALID_TO,+0,'S');
   VALID_TO_DP1_S=intnx('day',VALID_TO,+1,'S');
run;

/* Checks existence of previous interval with valid_to = to the valid_from of the currently checked interval 
 where currently checked interval is not the first one */

proc sql noprint;
   create table orders_test_1 as
   select id_client, id_account, valid_from, valid_to
   from data.orders a 
   where a.VALID_FROM GT a.DATE_OPEN
   and not exists( select 1
                   from data.accounts_test_14_small(&xobstxt.) b 
                   where a.id_account EQ b.id_account 
                     and a.cod_mandant EQ b.cod_mandant 
                     and a.DAT_VALID_FROM EQ b.DAT_VALID_TO_DP1_S );
quit;

/* Checks existence of following interval with valid_from = to the valid_to of the currently checked interval 
 where currently checked interval is not the last one */

proc sql noprint;
   create table orders_test_2 as
   select id_client, id_account, valid_from, valid_to
   from data.orders a 
   where VALID_TO NE '31DEC3000'd
     and VALID_TO LT DATE_CLOSE
     and not exists ( select 1
                      from data.orders_small b 
                      where a.id_account EQ b.id_account 
                        and a.id_client EQ b.id_client 
                        and a.VALID_TO EQ b.VALID_FROM_DM1_S);

quit;

/* Existence of overlaps */
proc sql;
   create table orders_test_3 as
   select id_client, id_account
   from data.orders a 
   where exists ( select 1
              from data.orders_small b 
              where a.id_account EQ b.id_account 
                and a.id_client EQ b.id_client 
                and ((a.VALID_FROM between b.VALID_FROM_D_S and b.VALID_TO_D_S) 
                 or (a.VALID_TO between b.VALID_FROM_D_S and b.VALID_TO_D_S))
                and not (a.VALID_FROM EQ b.VALID_FROM_D_S and a.VALID_TO EQ b.VALID_TO_D_S) );
quit;

Sample of the data:

data data.orders;
Length ID_ACCOUNT 4. ID_CLIENT 4. VALID_FROM 8. VALID_TO 8. DAT_OPEN 8. DAT_CLOSE 8.
 ;
informat VALID_FROM VALID_TO DATE_OPEN DATE_CLOSE date9.;
format VALID_FROM VALID_TO DATE_OPEN DATE_CLOSE date9.;
input ID_ACCOUNT ID_CLIENT VALID_FROM VALID_TO DATE_OPEN DATE_CLOSE;
datalines;

001 001 01MAR1993 31DEC3000 01MAR1993 31DEC3000
002 002 01MAR1997 15MAY2001 01MAR1997 31DEC3000
002 002 16MAY2001 25JUN2011 01MAR1997 31DEC3000
002 002 24JUN2001 16JUL2012 01MAR1997 31DEC3000
002 002 16MAY2001 16JUL2011 01MAR1997 31DEC3000
;

run;

Running this code take a lot of time, so I thought to use hashing in SAS and make it a bit faster, but couldn't make it work. Any ideas, how could this code be transferred to hash???

Thank you in advance!

Noza
  • 35
  • 1
  • 7
  • Please could you provide some sample data? Please edit your question and post it there, not as a response to this comment. – user667489 Jul 19 '16 at 10:41
  • Done! In the example of the data 4th and 5th records are wrong records, that script should identify. – Noza Jul 19 '16 at 10:49
  • Your code refers to other datasets and macro variables which you have not provided here - please could you follow the guidelines [here](http://www.stackoverflow.com/help/mcve) to produce a minimal, verifiable and complete example? – user667489 Jul 19 '16 at 13:37

2 Answers2

0

Can you just move ahead (lag) the VALID_TO variable and compare that with the VALID_FROM variable in a datastep rather than using a hash? If the lag is greater than the next VALID_FROM then you have an overlap.

DATA ORDERS;
LENGTH ID_ACCOUNT 4. ID_CLIENT 4. VALID_FROM 8. VALID_TO 8.;
INFORMAT VALID_FROM VALID_TO DATE_OPEN DATE_CLOSE DATE9.;
FORMAT VALID_FROM VALID_TO DATE_OPEN DATE_CLOSE DATE9.;
INPUT ID_ACCOUNT ID_CLIENT VALID_FROM VALID_TO DATE_OPEN DATE_CLOSE;
DATALINES;

001 001 01MAR1993 31DEC2000 01MAR1993 31DEC2000
002 002 01MAR1997 15MAY2001 01MAR1997 31DEC2000
002 002 16MAY2001 25JUN2011 01MAR1997 31DEC2000
002 002 24JUN2001 16JUL2012 01MAR1997 31DEC2000
002 002 16MAY2001 16JUL2011 01MAR1997 31DEC2000
;

RUN;

PROC EXPAND DATA = ORDERS OUT = ORDERS_LAG METHOD=NONE;
    BY ID_ACCOUNT ID_CLIENT;
    CONVERT VALID_TO = VT_LAG / TRANSFORMOUT=(LAG 1);
RUN;

DATA WANT; SET ORDERS_LAG;
    IF VT_LAG > VALID_FROM THEN OVERLAP = 1;
        ELSE OVERLAP = 0;
RUN;
kstats9pt3
  • 799
  • 2
  • 8
  • 28
  • Thank you Foxer, it looks very simple, but my system is not licensed to use EXPAND. Do you know how to make it the same way simple, but using some other statements? – Noza Jul 20 '16 at 07:40
0

Since you are not licensed for PROC EXPAND then try the following:

DATA WANT; SET ORDERS;
    BY ID_ACCOUNT ID_CLIENT;
        LAG_VT = LAG(VALID_TO);
    IF FIRST.ID_ACCOUNT OR FIRST.ID_CLIENT THEN LAG_VT = .;
    IF LAG_VT NE . AND LAG(VALID_TO) > VALID_FROM THEN OVERLAP = 1;
        ELSE OVERLAP = 0;
RUN;

It performs the same lag, but you have to account for following rows containing new accounts and/or clients. PROC EXPAND just performs it a little cleaner and adds a count variable for each unique group.

kstats9pt3
  • 799
  • 2
  • 8
  • 28