1

I have a dataset that lists, say, purchase dates for some products. And a separate dataset lists cut-dates for different phases of a marketing campaign for each product. I want to assign each purchase date a phase number (1 through n), according to which phase of the marketing campaign the purchase date occurs in. Each product has its own campaign with different cutdates.

In my head, I think of what I want as "For each purchase event date, look up the cut-dates for that product's marketing campaign and see how many cut-dates had passed as of the purchase date, and add 1 to compute the Phase number."

So have data like:

data have;  *Purchase events;
  input product $1 Date mmddyy10.;
  format date mmddyy10.;
  cards;
A 1/1/2015
A 3/1/2015
A 3/1/2015
A 6/1/2015
A 9/1/2015
B 1/1/2015
B 3/1/2015
B 6/1/2015
B 9/1/2015
C 1/1/2015
;
run;

data cut; *cut dates for marketing campaign;
input product $1 CutDate mmddyy10. ;
format cutdate mmddyy10.;
cards;
A 2/1/2015
B 4/1/2015
B 7/1/2015
;
run;

And want:

product          Date    Phase
   A       01/01/2015      1
   A       03/01/2015      2
   A       03/01/2015      2
   A       06/01/2015      2
   A       09/01/2015      2
   B       01/01/2015      1
   B       03/01/2015      1
   B       06/01/2015      2
   B       09/01/2015      3
   C       01/01/2015      1

I've been playing with a correlated sub-query approach which seems to work, but I feel like there must be a better way.

proc sql;
  create table want as
  select h.*
        ,coalesce
          (
           (select count(*)
            from cut c
            where h.product=c.product and c.cutdate<=h.date
            group by product
            )
           ,0
          )+1 as Phase 
    from have as h
  ;
quit;

My real data has hundreds of products, with between 0 and 4 cut-dates for each product, and millions of purchase events.

Quentin
  • 5,960
  • 1
  • 13
  • 21
  • I think it can be done in data step with hash table. – fl0r3k Aug 12 '16 at 13:31
  • Thanks @fl0r3k, can you give me more of a hint about what you are thinking? I can imagine load the cut dataset into a multidata hash table (key is product, multiple records for each key). Then read through the HAVE dataset, and for each record, look up the product in the hash table, and iterate through the records for that key, counting how many dates have passed. Is that what you're thinking? – Quentin Aug 12 '16 at 13:36
  • You have my frinds answer below :) – fl0r3k Aug 12 '16 at 14:05

1 Answers1

3

Solution of this problem using hash tables:

data want;
    if 0 then set cut;

    if _N_ = 1 then do;
        declare hash h(dataset: 'cut', multidata: 'y');
        h.defineKey('product');
        h.defineData('CutDate');
        h.defineDone();
        call missing(product, CutDate);
    end;

    set have;

    Phase = 1;
    rc = h.find();
    if rc = 0 then do;
        do while(Date > CutDate and rc = 0);
            rc = h.find_next();
            Phase = Phase + 1;
        end;
    end;
    drop rc;
run;
Robert Soszyński
  • 1,178
  • 1
  • 7
  • 12
  • Thanks that looks nice. Suggest moving `Phase = 1;` outside of the DO loop, so that products with no cut dates (like product C) have Phase=1. – Quentin Aug 12 '16 at 14:10