3

This question regards how to code variable selection in a probit model with marginal effects (either directly or by calling some pre-existing package).

I'm conducting a little probit regression of the effects of free and commercial availability of films on the level of piracy of those films as a TLAPD-related blog post.

The easy way of running a probit in R is typically through glm, i.e.:

probit <- glm(y ~ x1 + x2, data=data, family =binomial(link = "probit"))

but that's problematic for interpretation because it doesn't supply marginal effects.

Typically, if I want marginal effects from a probit regression I define this function (I don't recall the original source, but it's a popular function that gets re-posted a lot):

mfxboot <- function(modform,dist,data,boot=500,digits=3){
  x <- glm(modform, family=binomial(link=dist),data)
  # get marginal effects
  pdf <- ifelse(dist=="probit",
                mean(dnorm(predict(x, type = "link"))),
                mean(dlogis(predict(x, type = "link"))))
  marginal.effects <- pdf*coef(x)
  # start bootstrap
  bootvals <- matrix(rep(NA,boot*length(coef(x))), nrow=boot)
  set.seed(1111)
  for(i in 1:boot){
    samp1 <- data[sample(1:dim(data)[1],replace=T,dim(data)[1]),]
    x1 <- glm(modform, family=binomial(link=dist),samp1)
    pdf1 <- ifelse(dist=="probit",
                   mean(dnorm(predict(x, type = "link"))),
                   mean(dlogis(predict(x, type = "link"))))
    bootvals[i,] <- pdf1*coef(x1)
  }
  res <- cbind(marginal.effects,apply(bootvals,2,sd),marginal.effects/apply(bootvals,2,sd))
  if(names(x$coefficients[1])=="(Intercept)"){
    res1 <- res[2:nrow(res),]
    res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),res1)),nrow=dim(res1)[1])
    rownames(res2) <- rownames(res1)
  } else {
    res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep="")),nrow=dim(res)[1]))
    rownames(res2) <- rownames(res)
  }
  colnames(res2) <- c("marginal.effect","standard.error","z.ratio")
  return(res2)
}

Then run the regression like this:

mfxboot(modform = "y ~ x1 + x2",
        dist = "probit",
        data = piracy)

but using that approach I don't know that I can run any variable selection algorithms like forward, backward, stepwise, etc.

What's the best way to solve this problem? Is there a better way of running probits in R that reports marginal effects and also allows for automated model selection? Or should I focus on using mfxboot and doing variable selection with that function?

Thanks!

Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • The answer to this question requires a substantial statistical component. If you migrate this to stats.SE or ask a new question there, I will be happy to answer this question. – tchakravarty Oct 21 '14 at 21:25
  • @fgnu Thanks, though I tried asking a very similar/related question on Crossvalidated/stats.SE and they sent me here saying that it's too specific to R. Thanks for your answer below. I will upvote and comment. – Hack-R Oct 22 '14 at 01:33

1 Answers1

1

It is not clear why there is a problem. Model (variable) selection and computing of the marginal effects for a given model are sequential, and there is no reason to try to combine the two.

Here is how you might go about computing marginal effects and their bootstrapped standard effects post model (variable) selection:

  1. Perform variable selection using your preferred model selection procedure (including bootstrap model selection techniques as discussed below, not to be confused with the bootstrap you will use to compute the standard errors of the marginal effects for the final model).

    Here is an example on the dataset supplied in this question. Note also that this is in no way an endorsement of the use of stepwise variable selection methods.

#================================================
# read in data, and perform variable selection for
#   a probit model
#================================================
dfE = read.csv("ENAE_Probit.csv")
formE = emploi ~ genre + 
  filiere + satisfaction + competence + anglais
glmE = glm(formula = formE, 
           family = binomial(link = "probit"),
           data = dfE)

# perform model (variable) selection
glmStepE = step(object = glmE)
  1. Now pass the selected model to a function that computes the marginal effects.
#================================================
# function: compute marginal effects for logit and probit models
# NOTE: this assumes that an intercept has been included by default
#================================================
fnMargEffBin = function(objBinGLM) {
  stopifnot(objBinGLM$family$family == "binomial")
  vMargEff = switch(objBinGLM$family$link, 
                    probit = colMeans(outer(dnorm(predict(objBinGLM, 
                                                         type = "link")),
                                           coef(objBinGLM))[, -1]),
                    logit = colMeans(outer(dlogis(predict(objBinGLM, 
                                                        type = "link")),
                                          coef(objBinGLM))[, -1])
  )
  return(vMargEff)
}

# test the function
fnMargEffBin(glmStepE)

Here is the output:

> fnMargEffBin(glmStepE)
     genre    filiere 
0.06951617 0.04571239
  1. To get at the standard errors of the marginal effects, you could bootstrap the marginal effects, using, for example, the Boot function from the car function since it provides such a neat interface to bootstrap derived statistics from glm estimates.
#================================================
# compute bootstrap std. err. for the marginal effects
#================================================
margEffBootE = Boot(object = glmStepE, f = fnMargEffBin, 
     labels = names(coef(glmE))[-1], R = 100)
summary(margEffBootE)

Here is the output:

> summary(margEffBootE)
          R original  bootBias   bootSE  bootMed
genre   100 0.069516 0.0049706 0.045032 0.065125
filiere 100 0.045712 0.0013197 0.011714 0.048900

Appendix:

As a matter of theoretical interest, there are two ways to interpret your bootstrapped variable selection ask.

  1. You can perform model selection (variable selection) by using as a measure of fit a bootstrap model fit criteria. The theory for this is outlined in Shao (1996), and requires a subsampling approach.
    You then compute marginal effects and their bootstrap standard errors conditional on the best model selected above.

  2. You can perform variable selection on multiple bootstrap samples, and arrive at either one best model by looking at the variables retained across the multiple bootstrap model selections, or use a model averaging estimator. The theory for this is discussed in Austin and Tu (2004).
    You then compute marginal effects and their bootstrap standard errors conditional on the best model selected above.

Community
  • 1
  • 1
tchakravarty
  • 10,736
  • 12
  • 72
  • 116
  • Thanks very much for this detailed answer. I will +1 and if there are no better answers I will mark it as the solution after a while. However, I actually did want to do selection + bootstrap all in one step for a couple of reasons and I should've better clarified that. The reasons you mentioned at the end are a big part of it and another is that I'm commonly dealing with Big Data, running model selection takes a long time, bootstraps take even longer and since I figured there was some way to do it I thought it might be faster to just estimate everything in one function/step. – Hack-R Oct 22 '14 at 01:39