I have a dataset that lists, say, purchase dates for some products. And a separate dataset lists cut-dates for different phases of a marketing campaign for each product. I want to assign each purchase date a phase number (1 through n), according to which phase of the marketing campaign the purchase date occurs in. Each product has its own campaign with different cutdates.
In my head, I think of what I want as "For each purchase event date, look up the cut-dates for that product's marketing campaign and see how many cut-dates had passed as of the purchase date, and add 1 to compute the Phase number."
So have data like:
data have; *Purchase events;
input product $1 Date mmddyy10.;
format date mmddyy10.;
cards;
A 1/1/2015
A 3/1/2015
A 3/1/2015
A 6/1/2015
A 9/1/2015
B 1/1/2015
B 3/1/2015
B 6/1/2015
B 9/1/2015
C 1/1/2015
;
run;
data cut; *cut dates for marketing campaign;
input product $1 CutDate mmddyy10. ;
format cutdate mmddyy10.;
cards;
A 2/1/2015
B 4/1/2015
B 7/1/2015
;
run;
And want:
product Date Phase
A 01/01/2015 1
A 03/01/2015 2
A 03/01/2015 2
A 06/01/2015 2
A 09/01/2015 2
B 01/01/2015 1
B 03/01/2015 1
B 06/01/2015 2
B 09/01/2015 3
C 01/01/2015 1
I've been playing with a correlated sub-query approach which seems to work, but I feel like there must be a better way.
proc sql;
create table want as
select h.*
,coalesce
(
(select count(*)
from cut c
where h.product=c.product and c.cutdate<=h.date
group by product
)
,0
)+1 as Phase
from have as h
;
quit;
My real data has hundreds of products, with between 0 and 4 cut-dates for each product, and millions of purchase events.