Background
I am trying to predict the sales for a product line (y_test in the sample at the end). Its sales for a timeperiod are based on all previous sales of another product (x_test) and how many of those previous sales are still being used. It is not possible to directly measure the number of those previously-sold products that are still in use though, so a survival curve needs to be inferred.
For instance, if you make accessories for a particular smartphone model, the accessory sales are at least partly based on the number of those smartphones still being used. (This is not homework, BTW.)
Details
I have some time series data and would like to fit a regression model using glm
or something similar. The relationship between the dependent and independent variable is this:
Where p is the time period, yp is the dependent variable, xp is the independent variable, c0 and c1 are the regression coefficients, Ft is a cumulative distribution function (such as pgamma
), and ep are the residuals.
Through the first three time periods, the function would expand to something like this:
#y[1] = c0 + c1*(x[1]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 0, 1)$value))
#y[2] = c0 + c1*(x[1]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 1, 2)$value) + x[2]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 0, 1)$value))
#y[3] = c0 + c1*(x[1]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 2, 3)$value) + x[2]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 1, 2)$value) + x[3]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 0, 1)$value))
So, I have historical data for xp and yp, and I want to get the values for the coefficients/parameters c0, c1, c2, and c3 that minimize the residuals.
I think the solution is to use glm
and create a custom family, but I'm not sure how to do that. I looked at the code for the Gamma
family but didn't get very far. I have been able to do the optimization "manually" using nlminb
, but I would much prefer the simplicity and usefulness (i.e. predict
and others) offered by glm
or similar functions.
Here is some example data:
# Survival function (the integral part):
fsurv<-function(q, par) {
l<-length(q)
out<-vapply(1:l, function(i) {1-integrate(function(x) {pgamma(x, par[1], par[1]/par[2])}, q[i]-1, q[i])$value}, FUN.VALUE=0)
return(out)}
# Sum up the products:
frevsumprod <- function(x,y) {
l <- length(y)
out <- vapply(1:l, function(i) sum(x[1:i]*rev(y[1:i])), FUN.VALUE=0)
return(out)}
# Sample data:
p<-1:24 # Number of periods
x_test<-c(1188, 2742, 4132) # Sample data
y_test<-c(82520, 308910, 749395, 801905, 852310, 713935, 624170, 603960, 640660, 553600, 497775, 444140) # Sample data
c<-c(-50.161147,128.787437,0.817085,13.845487) # Coefficients and parameters, from another method that fit the data
# Pad the data to the correct length:
pad<-function(p,v,padval=0) {
l<-length(p)
padv<-l-length(v)
if(padv>0) (v<-c(v,rep(padval,padv)))
return(v)
}
x_test<-pad(p,x_test)
y_test<-pad(p,y_test,NA)
y_fitted<-c[0+1]+c[1+1]*frevsumprod(x_test,fsurv(p,c[(2:3)+1])) # Fitted values from regression
library(ggplot2)
ggplot(data.frame(p,y_test,y_fitted))+geom_point(aes(p,y_test))+geom_line(aes(p,y_fitted)) # Plot actual and fit