0

So in the stroke prediction dataset, I've created dummy variables for all the categorical variables, i.e gender_male and gender_female, smoking_status_smokes and smoking_status_unknown and so on. Now to check for multicollinearity for all the variables (numerical and dummy), I've applied the variance inflation function:

import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()

vif_data["feature"] = new_df.loc[:, new_df.columns != 'stroke'].columns
vif_data["VIF"] = [variance_inflation_factor(new_df.loc[:, new_df.columns != 'stroke'].values, i) for i in range(len(new_df.loc[:, new_df.columns != 'stroke'].columns))]
vif_data

The output that I get is below:

feature VIF
0   age 2.836394
1   hypertension    1.111484
2   heart_disease   1.113943
3   avg_glucose_level   1.107552
4   bmi 1.342729
5   gender_Female   inf
6   gender_Male inf
7   ever_married_No inf
8   ever_married_Yes    inf
9   work_type_Govt_job  inf
10  work_type_Never_worked  inf
11  work_type_Private   inf
12  work_type_Self-employed inf
13  work_type_children  inf
14  Residence_type_Rural    inf
15  Residence_type_Urban    inf
16  smoking_status_formerly smoked  inf
17  smoking_status_never smoked inf
18  smoking_status_smokes   inf

Can somebody please explain why are the vif of the dummy variables infinity? Is there a better way to check for multicollinearity? Thanks

IndigoChild
  • 842
  • 3
  • 11
  • 29
  • 1
    dummy variable trap. The design matrix has perfectly collinear variables and reduced rank. – Josef Mar 06 '22 at 20:22
  • 1
    If you have two columns exactly equal (can happen if you have dummy variables and colinearity), for both variables the regression model used to compute VIF will have a R² of 1 and so the VIF will be 1/(1-1) = infinity. Also when you create dummy variables from a categorical variable, you should remove 1 of the column so that you don't introduce colinearity in the model. If you have a variable Gender (M/F -> 0/1) only use one of the column because col1 + col2 = "1" – rehaqds Mar 06 '22 at 20:27
  • Thanks both of you! It was silly of me to miss this – IndigoChild Mar 07 '22 at 03:58

0 Answers0