1

I was trying to obtain the expected utility for each individual using R's survival package (clogit function) and I was not able to find a simple solution such as mlogit's logsum.

Below I set an example of how one would do it using the mlogit package. It is pretty straight forward: it just requires regressing the variables with the mlogit function, save the output and use it as an argument in the logsum function -- if needed, there is a short explanation in this vignette. And what I want is to know the similar method for clogit. I've read the package's manual but I have failed to grasp what would be the most adequate function to perform the analsysis.

Note1: My preference for a function like mlogit's is related to the fact that I might need to perform tons of regressions later on and being able to perform the correct estimation in different scenarios would be helpful.

Note2: I do not intend that the dataset created below be representative of how data should behave. I've set the example solely for the purpose of perfoming the function after the logit regressions.

**

library(survival)
library(mlogit)

#creating a dataset

df_test=data.frame(id=rep(1:20,each=4),
                   choice=rep(c("train","car","plane","boat")),
                   distance=c(rnorm(80)*10),
                   )

f=function(x,y,z) {
    
  v=round(rnorm(x,y,z))
    
    while(sum(v)>1 | sum(v)==0) {
      
      v=round(rnorm(x,y,z))
      
    }
  
return(v)
    
}

result1=c()

for (i in 1:20) {
  
  result=f(4,0.5,0.1)
  
  result1=c(result,result1)
  
}

df_test$distance=ifelse(df_test$distance<0,df_test$distance*-1,df_test$distance)
df_test$price = 0
df_test$price[df_test$choice=="plane"] = rnorm(20, mean = 300, sd=30)
df_test$price[df_test$choice=="car"] = rnorm(20, mean = 50, sd=10)
df_test$price[df_test$choice=="boat"] = rnorm(20, mean = 100, sd=15)
df_test$price[df_test$choice=="train"] = rnorm(20, mean = 120, sd=25)

df_test$choice2=result1
           
mlog=mlogit(choice2 ~ distance + price , data = df_test)

#the function logsum generates expected utility for each individual

logsum(mlog)

#so what would be adequate alternative with survival's clogit? I set an exemple below of
#of what i would like to regress and then perform something like logsum()

clog=clogit(choice2 ~ distance + price + as.factor(choice), strata(id), data = df_test)

**

John Doe
  • 212
  • 1
  • 9
  • 28

1 Answers1

2

The vignette you offer says the logsum is calculated as:

enter image description here

To my reading that is similar to the calculation used to construct the "linear predictor". the lp is t(coef(clog)) %*% Xhat. If I'm correct on that interpretation, then that is stored in the clog-object:

 clog["linear.predictor"]

So you could just take:

log(sum( exp(clog[["linear.predictors"]]) ))
[1] 4.286592

If you need to segregate this by df_test$choice then (since the length of the 'linear.predictors' element of clog is the same as the length of the 'choice' column of df_test) it's just:`

 tapply(clog$linear.predictors, df_test$choice, function(x){ 
               log(sum( exp(x) ))})
 #--------------------------------------
    boat      car    plane    train 
2.824502 3.506756 2.825004 1.734258 

If you need to aggregate this by id, you would do it like this:

tapply(clog$linear.predictors, df_test$id, function(x){ 
  log(sum( exp(x) ))})

       1        2        3        4        5        6        7        8        9 
1.405896 1.506152 1.454507 1.539060 1.467082 1.428482 1.393582 1.521877 1.480670 
      10       11       12       13       14       15       16       17       18 
1.466490 1.416338 1.500581 1.528075 1.488253 1.398405 1.445014 1.483623 1.404041 
      19       20 
1.460672 1.452709 
John Doe
  • 212
  • 1
  • 9
  • 28
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I think you're partially correct. But we need to index this to each id - each 'iv' has an 'i'. – John Doe Jun 03 '21 at 03:25
  • 1
    OK. Eazy-peezy. See above. – IRTFM Jun 03 '21 at 15:06
  • I will award you the bounty, but you need sum the values of each choice by each ID -- the index J refers to choices, the 'i' to the IDs of the individuals. What I meant is that you need to provide a 'logsum' of each ID. – John Doe Jun 04 '21 at 18:15
  • could you please show how you would get the linear predictors as in `t(coef(clog)) %*% Xhat`? I've haven't been able to obtain the value with the estimated coefficients. – John Doe Jun 06 '21 at 02:38
  • I think you might not be taking into account the conditional structure on the calculations. – IRTFM Jun 06 '21 at 21:15
  • If you could just provide a reference on how the linear predictors are calculated, that would be great :) I was trying find out how the `lp` option is calculated with the `predict` function, but that is still not clear to me. – John Doe Jun 06 '21 at 22:28
  • But still. I will award you the bounty if you just correct your function to sum the value through each individual - if the `lp` option is indeed correct. One possible way to this -- and using only the `clogit` object -- might be to get the length of the `xlevels`, divide by `dim(df_teste)[1]` and recreate the ID. Also, if I am not mistaken, there is no way to get the original dataset stored in the object created by `clogit`. Right? – John Doe Jun 06 '21 at 22:42
  • `?clogit`:"It turns out that the loglikelihood for a cond. logistic regression model = loglik from a Cox model with a particular data structure. Proving this is a nice homework exercise for a PhD statistics class; not too hard, but the fact that it is true is surprising. [snipped one sentence] In detail, a stratified Cox model with each c-c group assigned to its own stratum, time set to a constant, status of 1=case 0=control,.." and then 2 references are given. – IRTFM Jun 06 '21 at 22:46
  • To your request for a calculation with both choice and id grouping, that would lead to an 80 cell table for a dataset with 80 cases and there would be exactly 1 linear predictor per cell, so summation would be redundant. – IRTFM Jun 06 '21 at 22:51
  • Not to group by choice. Just ID. That is what the `logsum` function provides for the `mlogit` object. – John Doe Jun 06 '21 at 23:30
  • Do you think we could do this solely with the information produced by the `clogit` object (`clog`) and not using the dataset of the data.frame? Such as not using `df_test$id`, but something that is stored in the `clog` object. – John Doe Jun 07 '21 at 00:32
  • I didn’t see sufficient information. The Surv object was the but not the pairing details. The coxph function allows storage of both the x and y components but not clog it. – IRTFM Jun 07 '21 at 05:04