0

I am trying to maximize loglikelihood function to get coefficients for conditional logit model. I have a big data frame with about 9M rows (300k choice sets) and about 40 parameters to be estimated. It looks like this:

ChoiceSet Choice  SKU Price Caramel etc.
        1      1 1234   1.0       1  ...
        1      0  145   2.0       1  ...
        1      0 5233   2.0       0  ...
        2      0 1432   1.5       1  ...
        2      0 5233   2.0       0  ...
        2      1 8320   2.0       0  ...
        3      0 1234   1.5       1  ...
        3      1  145   1.0       1  ...
        3      0 8320   1.0       0  ... 

Where ChoiceSet is a set of products available in store in the moment of purchase and Choice=1 when the SKU is chosen.

Since ChoiceSets might vary I use loglikelihood function:

clogit.ll <- function(beta,X) {    #### This is a function to be maximized
X <- as.data.table(X) 
setkey(X,ChoiceSet,Choice) 

sum((as.matrix(X[J(t(as.vector(unique(X[,1,with=F]))),1),3:ncol(X),with=F]))%*%beta)- 

sum(foreach(chset=unique(X[,list(ChoiceSet)])$ChoiceSet, .combine='c', .packages='data.table') %dopar% { 

Z <- as.matrix(X[J(chset,0:1),3:ncol(X), with=F]) 
Zb <- Z%*%beta 
e <- exp(Zb) 

log(sum(e)) 
}) 
}

Create new data frame without SKU (it's not needed) and zero vector:

X0 <- Data[,-3]
b0 <- rep(0,ncol(X0)-2)

I maximize this function with a help of maxLike package where I use gradient to make calculation faster:

grad.clogit.ll <- function(beta,X) { ###It is a gradient of likelihood function
  X <- as.data.table(X) 
  setkey(X,ChoiceSet,Choice) 

colSums(foreach(chset=unique(X[,list(ChoiceSet)])$ChoiceSet, .combine='rbind',.packages='data.table') %dopar% { 
Z <- as.matrix(X[J(chset,0:1),3:ncol(X), with=F]) 
Zb <- Z%*%beta 
e <- exp(Zb) 
as.vector(X[J(chset,1),3:ncol(X),with=F]-t(as.vector(X[J(chset,0:1),3:ncol(X),with=F]))%*%(e/sum(e))) 
}) 
}

Maximization problem is following:

fit <- maxLik(logLik = clogit.ll, grad = grad.clogit.ll, start=b0, X=X0, method="NR", tol=10^(-6), iterlim=100) 

Generally, it works fine for small samples, but too long for big:

Number of Choice sets    Duration of computation

        300                       4.5min
        400                      10.5min
       1000                        25min

But when I do it for 5000+ choice sets R terminate session.

So (if you are still reading it) how can I maximaze this function if I have 300,000+ choice sets and 1.5 weeks to finish my course work? Please help, I have no any idea.

Vitaliy Poletaev
  • 113
  • 1
  • 1
  • 8
  • Have you profiled your code? I would use package data.table for faster subsetting (to avoid all these vector scans). – Roland May 29 '16 at 19:09
  • 1
    @lmo No, they probably don't need more computing power. They need to improve their code. – Roland May 29 '16 at 19:24
  • @Roland, what do you mean by the improving of the code? Is it about theoretical mistakes connected with the conditional logit model or about technical issues of the coding? – Vitaliy Poletaev May 30 '16 at 11:54
  • I mean that you need to improve your code logic to avoid/optimize wasteful and slow operations. You should not subset so much and you should use data.tables binary search instead of the vector scan `X[,1]==chset`. And of course, you need to profile your code to identify further bottlenecks. – Roland May 30 '16 at 12:14
  • @Roland, I followed your advice about data.tables, and now it works faster (thanks for that), but still too long. New code is in the question. What about subseting too much, I actually have no ideas, how to avoid this, becauce of the model specification. Any other sugestions? – Vitaliy Poletaev Jun 02 '16 at 10:02
  • Profile your code. – Roland Jun 02 '16 at 10:10

0 Answers0