I'm having trouble estimating a multinomial probit model in R. I've found two packages, but I haven't gotten either to give satisfactory results. Is there a bug in my code? Am I using the packages incorrectly?
A little example:
A consumer faces 3 choices, plus the outside option of not taking any of the choices. The utility of the outside option is normalized to zero:
u_i0 = 0
u_i1 = -20 + 1*age_i + epsilon_i1
u_i2 = 0 + epsilon_i2
u_i3 = 15 - 1*age_i + epsilon_i3
(Here i indexes consumers.)
Code where (assuming no bug) age is uniform on 11:50 and the epsilons are iid Normal(0, 1), independently of age:
library(MNP) # Multinomial probit
library(mlogit) # Has a probit option
n <- 1000
df <- data.frame(age=sample(11:50, replace=TRUE, size=n))
constant <- c(-20, 0, 15)
coefficients <- rbind(c(1, 0, -1))
epsilon <- matrix(rnorm(n*3), nrow=n, ncol=3)
utility <- (matrix(rep(constant, n), nrow=1000, ncol=3, byrow=TRUE) +
as.matrix(df) %*% coefficients + epsilon)
isTRUE(all.equal(utility[1, ], as.vector(constant + coefficients * df$age[1] +
epsilon[1, ]))) # True as expected
df$choice <- max.col(utility)
max.utility <- apply(utility, 1, max)
df$choice[max.utility < 0] <- 0 # Take outside option when all product utilities < 0
df$choice <- factor(df$choice)
table(df$choice)
model.mnp <- mnp(choice ~ age, data=df, burnin=100)
summary(model.mnp) # Many of the 95% intervals don't contain the true value
model.mlogit <- mlogit(choice ~ 0 | age, data=df,
varying=NULL, shape="wide", probit=TRUE)
summary(model.mlogit)
I'd like the models to recover the coefficients, but mnp's estimates seem to be off (or are they just very noisy?), and mlogit gives me an error saying the system is computationally singular.
What should I try?
Edit: this does work (probit=FALSE):
model.mlogit <- mlogit(choice ~ 0 | age, data=df, varying=NULL, shape="wide", probit=FALSE)
summary(model.mlogit)
It gives constants of roughly -30, 0, 22 and age coefficients of 1.5, 0, -1.4. The code runs and it gives reasonable estimates -- but they aren't exactly correct, because the data are generated with normal errors, whereas for the logit to be correctly specified the errors would have to be extreme value (see http://en.wikipedia.org/wiki/Logistic_regression#As_a_latent-variable_model)