As per the title, I am looking to do a cross join with a table which performs an aggregation function and filters on a couple of variables within the table.
I have similar data to the following:
library(dplyr)
library(data.table)
library(sqldf)
sales <- data.frame(salesx = c(3000, 2250,850,1800,1700,560,58,200,965,1525)
,week = seq(from = 1, to = 10, by = 1)
,uplift = c(0.04)
,slope = c(100)
,carryover = c(.35))
spend <- data.frame(spend = seq(from = 1, to = 50000, by = 1))
tempdata <- merge(spend,sales,all=TRUE)
tempdata$singledata <- as.numeric(1)
And here is an example of what I am trying to accomplish via my sql based solution:
newdata <- sqldf("select a.spend, a.week,
sum(case when b.week > a.week
then b.salesx*(b.uplift*(1-exp(-(power(b.singledata,b.week-a.week)/b.slope))))/b.spend
else 0.0 end) as calc3
from tempdata a, tempdata b
where a.spend = b.spend
group by a.spend,a.week")
This provides the results I want, but it is a little slow, particularly with my real dataset of around 1 million records. It would be great to have some advice on a) how to speed up the sqldf function; or b) using a more efficient data.table/dplyr approach (I can't get my head around the cross join/aggregation/filter trifecta problem).
Clarity on non-equi join solution below:
I had a couple of questions about the non-equi join solution – output is fine and very quick. In looking to understand how the code worked, I broke it down like this:
breakdown <- setDT(tempdata)[tempdata, .(spend, uplift, slope,carryover,salesx, singledata, week, i.week,x.week, i.salesx,x.salesx, x.spend, i.spend), on=.(spend, week > week)]
Based on the breakdown, in order to be consistent with the original calculation, it should be:
x.salesx*(uplift*(1.0-exp(-(`^`(singledata,x.week-week)/slope))))/i.spend
The reason why this isn’t apparent is because with the example I used the ‘power’ part of the equation wasn’t really doing anything (always 1). The actual calc used is (adding a carryover variable to data):
SQL
b.salesx*(b.uplift*(1-exp(-(power((b.singledata*b.carryover),b.week-a.week)/b.slope))))/b.spend (sql)
My data.table solution
sum(salesx.y*(uplift.y*(1-exp(-((singledata.y*adstock.y)^(week.y-week.x)/slope.y))))/spend), by=list(spend, week.x)
However, I am unable to get this working with the non equi join solution when adding the ‘carryover’ variable ie.
x.salesx*(uplift*(1.0-exp(-(`^`((singledata*carryover),x.week-week)/slope))))/i.spend