11

For Numerical/Continuous data, to detect Collinearity between predictor variables we use the Pearson's Correlation Coefficient and make sure that predictors are not correlated among themselves but are correlated with the response variable.


But How can we detect multicollinearity if we have a dataset, where predictors are all categorical. I am sharing one dataset where I am trying to find out if predictor variables are correlated or not


> A(Response Variable)   B     C   D
> Yes                    Yes Yes Yes
> No                     Yes Yes Yes
> Yes                    No   No  No

How to do the same?

IRTFM
  • 258,963
  • 21
  • 364
  • 487
karthik subramanian
  • 153
  • 1
  • 2
  • 11
  • http://stats.stackexchange.com/questions/108007/correlations-with-categorical-variables – Alex Oct 28 '15 at 17:46
  • 1
    This question needs to be migrated to CV, as I have flagged it. It's off topic for SO. – alexwhitworth Oct 29 '15 at 13:30
  • 1
    If the questioner was asking for R code to detect collinearity or multicollinearity (which I am suggesting is well done via calculation of the variance inflation factor or the tolerance level of a data matrix), then CV.com may not be the correct venue. They generally refer people over to SO when the question is "how to do X in R?" – IRTFM Oct 30 '15 at 05:00

1 Answers1

7

Collinearity can be, but is not always , a property of just a pair of variables and this is especially true when dealing with categorical variables. So although a high correlation coefficient would be sufficient to establish that collinearity might be a problem, a bunch of pairwise low to medium correlations is not a sufficient test for lack of collinearity. The usual method for continuous mixed or categorical collections for variables is to look at the variance inflation factors (which my memory tells me are proportional to the eigenvalues of the variance-covariance-matrix). At any rate this is the code for the vif-function in package:rms:

vif  <- 
function (fit) 
{
    v <- vcov(fit, regcoef.only = TRUE)
    nam <- dimnames(v)[[1]]
    ns <- num.intercepts(fit)
    if (ns > 0) {
        v <- v[-(1:ns), -(1:ns), drop = FALSE]
        nam <- nam[-(1:ns)]
    }
    d <- diag(v)^0.5
    v <- diag(solve(v/(d %o% d)))
    names(v) <- nam
    v
}

The reason that categorical variables have a greater tendency to generate collinearity is that the three-way or four-way tabulations often form linear combinations that lead to complete collinearity. You example case is an extreme case of collinearity but you can also get collinearity with

A B C D
1 1 0 0
1 0 1 0
1 0 0 1

Notice that this is collinear because A == B+C+D in all rows. None of pairwise correlations would be high, but the system together causes complete collinearity.

After putting your data into an R object and running lm() on it, it becomes apparent that there is another way to determine collinearity with R and that is because lm will drop factor variables from the results when they are "aliased", which is just another term for being completely collinear.

Here is an example for @Alex demonstrating highly collinear data and the output of vif in that situation. Generally you hope to see variance inflation factors below 10.

> set.seed(123)
> dat2 <- data.frame(res = rnorm(100), A=sample(1:4, 1000, repl=TRUE)
+ )
> dat2$B<-dat2$A
> head(dat2)
          res A B
1 -0.56047565 1 1
2 -0.23017749 4 4
3  1.55870831 3 3
4  0.07050839 3 3
5  0.12928774 2 2
6  1.71506499 4 4
> dat2[1,2] <- 2   
#change only one value to prevent the "anti-aliasing" routines in `lm` from kicking in
> mod <-  lm( res ~ A+B, dat2) 
> summary(mod)

Call:
lm(formula = res ~ A + B, data = dat2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.41139 -0.58576 -0.02922  0.60271  2.10760 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.10972    0.07053   1.556    0.120
A           -0.66270    0.91060  -0.728    0.467
B            0.65520    0.90988   0.720    0.472

Residual standard error: 0.9093 on 997 degrees of freedom
Multiple R-squared:  0.0005982, Adjusted R-squared:  -0.001407 
F-statistic: 0.2984 on 2 and 997 DF,  p-value: 0.7421

> vif ( mod )
       A        B 
1239.335 1239.335 

If you make a fourth variable "C" that is independent of the first two perdictors (admittedly a bad name for a variable since C is also an R function), you get a more desirable result from vif:

 dat2$C <- sample(1:4, 1000, repl=TRUE)

 vif ( lm( res ~ A + C, dat2) )
#---------    
   A        C 
1.003493 1.003493 

Edit: I realized that I had not actually created R-representations of a "categorical variable" despite sampling from 1:4. The same sort of result occurs with factor versions of that "sample":

>  dat2 <- data.frame(res = rnorm(100), A=factor( sample(1:4, 1000, repl=TRUE) ) )
>  dat2$B<-dat2$A
>  head(dat2)
          res A B
1 -0.56047565 1 1
2 -0.23017749 4 4
3  1.55870831 3 3
4  0.07050839 3 3
5  0.12928774 2 2
6  1.71506499 4 4
>  dat2[1,2] <- 2   
> #change only one value to prevent the "anti-aliasing" routines in `lm` from kicking in
>  mod <-  lm( res ~ A+B, dat2) 
>  summary(mod)


Call:
lm(formula = res ~ A + B, data = dat2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.43375 -0.59278 -0.04761  0.62591  2.12461 

Coefficients: (2 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.11165    0.05766   1.936   0.0531 .
A2          -0.67213    0.91170  -0.737   0.4612  
A3           0.01293    0.08146   0.159   0.8739  
A4          -0.04624    0.08196  -0.564   0.5728  
B2           0.62320    0.91165   0.684   0.4944  
B3                NA         NA      NA       NA  
B4                NA         NA      NA       NA  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9099 on 995 degrees of freedom
Multiple R-squared:  0.001426,  Adjusted R-squared:  -0.002588 
F-statistic: 0.3553 on 4 and 995 DF,  p-value: 0.8404

Notice that two of the factor levels are omitted from the calculation of coefficints. ... because they are completely collinear with the corresponding A levels. So if you want to see what vif returns for factor variables that are almost collinear, you need to change a few more values:

> dat2[1,2] <- 2   
> dat2[2,2] <-2; dat2[3,2]<-2; dat2[4,2]<-4
>  mod <-  lm( res ~ A+B, dat2) 
>  summary(mod)

Call:
lm(formula = res ~ A + B, data = dat2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.42819 -0.59241 -0.04483  0.62482  2.12461 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.11165    0.05768   1.936   0.0532 .
A2          -0.67213    0.91201  -0.737   0.4613  
A3          -1.51763    1.17803  -1.288   0.1980  
A4          -0.97195    1.17710  -0.826   0.4092  
B2           0.62320    0.91196   0.683   0.4945  
B3           1.52500    1.17520   1.298   0.1947  
B4           0.92448    1.17520   0.787   0.4317  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9102 on 993 degrees of freedom
Multiple R-squared:  0.002753,  Adjusted R-squared:  -0.003272 
F-statistic: 0.4569 on 6 and 993 DF,  p-value: 0.8403
#--------------
> library(rms)

> vif(mod)
      A2       A3       A4       B2       B3       B4 
192.6898 312.4128 308.5177 191.2080 312.5856 307.5242 
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • 4
    Categorical variables cannot be co**linear**. They do not represent **linear** measures in Euclidean space.... A chi-square test can be used to test for independence of categorical variables. – alexwhitworth Oct 29 '15 at 13:33
  • R factor variables are represented as integers and they _may_ be collinear in the situations such as I described, since it is the invertibility of the matrix formed by the data cross-product that determines whether there is collinearity. – IRTFM Oct 29 '15 at 16:37
  • 3
    No. This is a definitional thing in statistics and mathematics. Colinear (def-geometry): points are said to be colinear if they lie on a single line. colinearity (def-statistics): the linear relationship between two variables.... You have provided an empirical example using VIF that has no substantive meaning statistically. IE- VIF does *calculate* in R because of the integer coding in R you describe. But that doesn't mean that it has a statistically valid interpretation. – alexwhitworth Oct 29 '15 at 17:02
  • You are not trying to generate interpretations, but rather to prevent errors in analysis. BTW, you are misspelling the term that you are trying to "protect". – IRTFM Oct 29 '15 at 17:07
  • Thanks for the typo catch... Agreed--let's prevent errors in analysis. It's an error in analysis to think categorical variables can be collinear. – alexwhitworth Oct 31 '15 at 16:24
  • What ever you call it, using `vif `on model output is an effective method for assessing the lack of joint independence of categorical variables. Failing to use effective methods to assess such correlation results in errors in analysis. – IRTFM Oct 31 '15 at 19:38