How can I estimate variable influence of the factors in my model rather than just factor levels?

Question

I created a regression model And i want to estimate an influence analysis for each factor.

Meaning take the square Wald-estimation (z-value) for a specific factor and divide it by sum of squares of their Wald-estimation. and that how I estimate the influence of specific factor.

My problem is that the factors are divided by their levels.

I will give an example:

model<-glm(formula = form,
            family = binomial("logit"),
            data   = Train)
View(summary(model)$coefficients)

In the table we can see that the factor dom_time_Colnames is divided into 4 levels. Same thing happened with first_byte_downdload_Colnames.

I want to take the factors z-values and not their levels z-values.

How I do it? anova() is a good idea but it doesn't stop running for me. I search for creative solution that give me output like the the z-value in glm summary or deviance in anova for the all factor and not for theirs levels.

Here is a reproducible example:

Data<-data.frame(Species=iris$Species)
for(i in 1:ncol(iris)){
  if(is.numeric(iris[,i])){
    result=quantile(x = iris[,i],probs = seq(0,1,0.1)) 
    out<-cut(iris[,i], breaks = unique(result),include.lowest = TRUE)
    Data<-data.frame(Data,out)
    colnames(Data)[length(Data)]<-colnames(iris)[i]
  } else {
    next()
  }
}
Data$y<-rbinom(n = nrow(Data),size = 1,prob = 0.1)
form<-formula(y~.)
model<-glm(formula = form,
            family = binomial("logit"),
            data = Data)
View(summary(model)$coefficients)

we can see that the factor sepal or Petal is divided to its levels.

What do you mean by "`anova()` is a good idea but it don't stop running for me. "? The thing is that factors are transformed in a set of dummy variables (which I think makes sense). If you want to see if the whole factor variable is significant than I would recommend an F-test. But that is more a statistical than a programming question. — Alex, Jun 26 '16 at 12:29
When you run the reproducible example do you also get the warning `Warning message: glm.fit: fitted probabilities numerically 0 or 1 occurred `? By the way, it looked like you were trying to add some non-functioning HTML tags to your question. I fixed it. Just FYI the StackOverflow text editor is more of a WYSIWYG editor, asides from a few special markdowns. http://stackoverflow.com/editing-help — Hack-R, Jun 28 '16 at 13:58
@Alex when you running anova you can get deviance for your factors. I can use them instead of z-values. and compute the influence-analysis. But the problem is that it doesn't stop running. — Dima Ha, Jun 28 '16 at 14:46
@Alex anova(model), I tried anova(model,"chisq") i tried evrything. — Dima Ha, Jun 28 '16 at 14:50
Does it have to be this particular variable importance measure, or are you open to other solution? — Hack-R, Jun 28 '16 at 15:00
I think you are confusing "variable importance" with "statistical significance". The Wald test, or likelihood ratio test (which is asymptotically equivalent), will test nested-models hypotheses. But they will not measure "variable importance".... **what are you trying to quantify-- importance to prediction accuracy? model fit (eg. likelihood)? something else?** — alexwhitworth, Jun 29 '16 at 00:02
@Hack-R Yes, of course, am open to any solution that will give me any importance measure. It doesn't have to be deviance or z-values. — Dima Ha, Jun 29 '16 at 08:13
@Alex Yes Alex, you are right. i am using "statistical significance" to measure "variable importance". — Dima Ha, Jun 29 '16 at 08:16
To digress a little, I have one concern. In your example, you treated levels as `"factor"`, but I think they are `"ordered" "factor"` and this makes a difference. (but I can't decide this comment apply to your data.) — cuttlefish44, Jun 30 '16 at 02:42

Alex · Answer 1 · 2016-06-28T19:05:21.930

As I already mentioned is R converting factor variables into dummy variables when using glm. This makes I think intuitive sense because otherwise it would use arbitrary numeric values which are used for the coding of the factor levels.

You can use an F-test to look if multiple variables together are significant. Here we want to test whether all dummies together are significant (in other words the whole factor variable). This can be done by first fitting one model with the factor included and one where you skipped the factor variable. In a second step you can pass both models to the test function (here we use lmtest::waldtest).

Here is my suggestion:

One remark: I changed the probability of the y variable from 0.1 to 0.3. Otherwise the P-value almost always becomes 1.

library(lmtest)

## The full model containing all variables
full_formula      <- formula(y~ Species + Sepal.Length + Sepal.Width + Petal.Length + Petal.Width)
full_model        <- glm(formula = full_formula,
                         family = binomial("logit"),
                         data = Data)

## Reduced models (leaving out each variable once)
reduced_models <- list()
for(i in 1:(ncol(Data)-1)) {
  reduced_formulas  <- as.formula(paste0("y ~ ", paste(names(Data)[c(-i ,-ncol(Data))], collapse= " + " )))
  reduced_models[[i]] <- glm(formula = reduced_formulas,
                         family = binomial("logit"),
                         data = Data)
}
## Test the full model against all reduced models
result        <- lapply(reduced_models, waldtest, full_model)

##Add the name of the tested variable
names(result) <- names(Data[ ,-ncol(Data)])

The first three results:

> result[1:3]
$Species
Wald test

Model 1: y ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
Model 2: y ~ Species + Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
  Res.Df Df      F Pr(>F)
1    115                 
2    113  2 0.1116 0.8945

$Sepal.Length
Wald test

Model 1: y ~ Species + Sepal.Width + Petal.Length + Petal.Width
Model 2: y ~ Species + Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
  Res.Df Df      F Pr(>F)
1    122                 
2    113  9 0.7051 0.7031

$Sepal.Width
Wald test

Model 1: y ~ Species + Sepal.Length + Petal.Length + Petal.Width
Model 2: y ~ Species + Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
  Res.Df Df      F Pr(>F)
1    121                 
2    113  8 0.1743 0.9939

The first list entry tested whether the factor variable species is significant, the second whether Sepal.Length is significant and so on.

It is no surprise that all tests have high P-values since the y variable was completely random.

*edit

You can also use anova():

result_anova        <- lapply(reduced_models, anova, full_model)
names(result_anova) <- names(Data[ ,-ncol(Data)])

Only the result for the first variable:

> result_anova[[1]]
Analysis of Deviance Table

Model 1: y ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
Model 2: y ~ Species + Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
  Resid. Df Resid. Dev Df Deviance
1       115     149.16            
2       113     146.87  2   2.2914

So, to make this into a measure of relative variable influence/importance, you'd make this into a loop that programmatically tests each factor as the one dropped from the model and then compare the F values? — Hack-R, Jun 28 '16 at 15:24
I do not know what is required. But this could of course be done. I think I will add it to the answer. But what do you think about this approach? — Alex, Jun 28 '16 at 15:29
I think it makes sense if it fits with the OP's intention. I asked her in a comment if we could use other measures of variable importance, but she hasn't replied yet. I think this measures significance rather than magnitude, though right? or is it both? — Hack-R, Jun 28 '16 at 15:33
Yes, it measures only significance. But if I understood correctly this was the question. — Alex, Jun 28 '16 at 15:41
It may have been, I'm not sure. I'm a little unclear on what if anything is the distinction between importance and influence. For variable importance the magnitude is often the attribute of interest because they look at things like how much a variable contributes to a model, e.g. mean decrease in GINI coefficient being used to calculate the % that a variable contributes to a model, etc. OTOH if I'm doing feature selection for logit I care more about significance (or both). Both significance and magnitude are important, it just depends which question the OP is trying to answer. — Hack-R, Jun 28 '16 at 15:44
Hey Alex, This is exactly what i looked for!!! I understand from the comments that i was not so clear, I am sorry for that. I am a data-scientist and in my work i create a model. when we show to the management they don't understand what is it "statistical significance". so we tell them that 30% of the people left because 'Sepal.Length' (just for the explanation) 20% because of Petal.Width. Usually i took the deviance or the z-values and calculated this "variable importance" (this percentage of influence). — Dima Ha, Jun 29 '16 at 08:30
So did my answer do what you wanted? Or is there still something not clear? — Alex, Jun 29 '16 at 11:04
@AnnaBokovskaya And most of what is shown in more detail in the two answers I had already told you in comments which were wiped because you considered them "rude". And what you tell your management is so wrong that famous statisticians start rotating in their graves. — Roland, Jun 30 '16 at 10:33
@Roland Of course famous statisticians start rotating in their graves. But this is not for FDA drug submission. the business analyst doesn't know statistics!! especially anova or significant. But still he need tools for decision. — Dima Ha, Jun 30 '16 at 11:47

score 0 · Answer 2 · answered Jun 29 '16 at 15:58

Given your clarification in the comments that what you mean by "variable importance" is actually statistical significance, I suggest you use a likelihood ratio test. For simple-to-simple hypotheses, these are uniformly most powerful (UMP) tests and for composite hypotheses are still ideal.

The Wald test option, which other @Alex suggests above, is another option. The LRT and Wald test are asymptotically equivalent. Although, for simple v simple hypotheses, the LRT is the UMP and thus preferred.

# using your Data argument, here are nested models:
model1<-glm(y ~ ., family = binomial("logit"), data = Data)
model2 <- glm(y ~ Sepal.Length + Sepal.Width + Petal.Length, 
              family= binomial("logit"), data= Data)

# LRT model1 vs the null hypothesis
pchisq(model1$null.deviance - model1$deviance, 
       model1$df.null - model1$df.residual, lower= FALSE)
[1] 0.1721306

# nested model LRT
pchisq(model2$deviance - model1$deviance, 
       model2$df.residual - model1$df.residual, lower= FALSE)
[1] 0.05313724

In both cases, since pchisq > 0.05 (a standard default significance level) we do not reject the null hypothesis that the additional covariates are needed. For this dataset, we would choose the null model. More relevant to your question: for the nested model LRT, the additional terms (eg. model1) are not needed.

Again, this is not a metric of "variable importance" (eg. Breiman 2001). It's a metric assessing goodness of fit.

How can I estimate variable influence of the factors in my model rather than just factor levels?

2 Answers2