Model formula for glm regression with user-specified family

Question

Background

I am trying to predict the sales for a product line (y_test in the sample at the end). Its sales for a timeperiod are based on all previous sales of another product (x_test) and how many of those previous sales are still being used. It is not possible to directly measure the number of those previously-sold products that are still in use though, so a survival curve needs to be inferred.

For instance, if you make accessories for a particular smartphone model, the accessory sales are at least partly based on the number of those smartphones still being used. (This is not homework, BTW.)

Details

I have some time series data and would like to fit a regression model using glm or something similar. The relationship between the dependent and independent variable is this: regression formula

Where p is the time period, y_p is the dependent variable, x_p is the independent variable, c₀ and c₁ are the regression coefficients, F_t is a cumulative distribution function (such as pgamma), and e_p are the residuals.

Through the first three time periods, the function would expand to something like this:

#y[1] = c0 + c1*(x[1]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 0, 1)$value))
#y[2] = c0 + c1*(x[1]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 1, 2)$value) + x[2]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 0, 1)$value))
#y[3] = c0 + c1*(x[1]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 2, 3)$value) + x[2]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 1, 2)$value) + x[3]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 0, 1)$value))

So, I have historical data for x_p and y_p, and I want to get the values for the coefficients/parameters c₀, c₁, c₂, and c₃ that minimize the residuals.

I think the solution is to use glm and create a custom family, but I'm not sure how to do that. I looked at the code for the Gamma family but didn't get very far. I have been able to do the optimization "manually" using nlminb, but I would much prefer the simplicity and usefulness (i.e. predict and others) offered by glm or similar functions.

Here is some example data:

# Survival function (the integral part):
fsurv<-function(q, par) {
  l<-length(q)
  out<-vapply(1:l, function(i) {1-integrate(function(x) {pgamma(x, par[1], par[1]/par[2])}, q[i]-1, q[i])$value}, FUN.VALUE=0)
  return(out)}

# Sum up the products:
frevsumprod <- function(x,y) {
  l <- length(y)
  out <- vapply(1:l, function(i) sum(x[1:i]*rev(y[1:i])), FUN.VALUE=0)
  return(out)}

# Sample data:
p<-1:24 # Number of periods
x_test<-c(1188, 2742, 4132) # Sample data
y_test<-c(82520, 308910, 749395, 801905, 852310, 713935, 624170, 603960, 640660, 553600, 497775, 444140) # Sample data
c<-c(-50.161147,128.787437,0.817085,13.845487) # Coefficients and parameters, from another method that fit the data

# Pad the data to the correct length:
pad<-function(p,v,padval=0) {
  l<-length(p)
  padv<-l-length(v)
  if(padv>0) (v<-c(v,rep(padval,padv)))
  return(v)
}
x_test<-pad(p,x_test)
y_test<-pad(p,y_test,NA)

y_fitted<-c[0+1]+c[1+1]*frevsumprod(x_test,fsurv(p,c[(2:3)+1])) # Fitted values from regression

library(ggplot2)
ggplot(data.frame(p,y_test,y_fitted))+geom_point(aes(p,y_test))+geom_line(aes(p,y_fitted)) # Plot actual and fit

Thanks @DWin. I edited the question, but still haven't gotten any attempts to answer it. Do you have other suggestions for posing the question better? And BTW, I think almost all of the problem statement is relevant to any potential answer, and not just a recounting of my "failed efforts". — dnlbrky, Apr 16 '13 at 18:27
I suggest you flag this with a request for migration to CrossValidated. (... or you could just cross-post it there with a note that it has not gotten a response on SO.) — IRTFM, Apr 16 '13 at 19:15

score 1 · Answer 1 · answered Oct 25 '17 at 18:16

This cannot be done with glm. The family in glm specifies how the linear predictor is connected with the mean value of y. See ?family and wiki. In particular, you would need to be able to write a family list with (some of) the functions like:

> fam <- poisson()
> str(fam)
List of 12
 $ family    : chr "poisson"
 $ link      : chr "log"
 $ linkfun   :function (mu)  
 $ linkinv   :function (eta)  
 $ variance  :function (mu)  
 $ dev.resids:function (y, mu, wt)  
 $ aic       :function (y, n, mu, wt, dev)  
 $ mu.eta    :function (eta)  
 $ initialize:  expression({  if (any(y < 0))  stop("negative values not allowed for the 'Poisson' family")  n <- rep.int(1, nobs| __truncated__
 $ validmu   :function (mu)  
 $ valideta  :function (eta)  
 $ simulate  :function (object, nsim)  
 - attr(*, "class")= chr "family"
> 
> fam <- Gamma()
> str(fam)
List of 12
 $ family    : chr "Gamma"
 $ link      : chr "inverse"
 $ linkfun   :function (mu)  
 $ linkinv   :function (eta)  
 $ variance  :function (mu)  
 $ dev.resids:function (y, mu, wt)  
 $ aic       :function (y, n, mu, wt, dev)  
 $ mu.eta    :function (eta)  
 $ initialize:  expression({  if (any(y <= 0))  stop("non-positive values not allowed for the 'gamma' family")  n <- rep.int(1, n| __truncated__
 $ validmu   :function (mu)  
 $ valideta  :function (eta)  
 $ simulate  :function (object, nsim)  
 - attr(*, "class")= chr "family"

where eta refers to the linear predictor. I.e. at-least you will need to specify an inverse link function, linkinv, which only depends on the co-variates through a dot product between the parameters and the co-variates. Yours does not as it depends on c_2 and c_3 in a non-linear way.

Model formula for glm regression with user-specified family

1 Answers1

Linked