Pycaret does't well manage multicollinearity

Question

I have a Panda Dataframe df in input to Pycaret library. So the df has :

3 categoricals variables:
    LIB_SOURCE  : values: 'arome_001', 'gfs_025' and 'arpege_01'
    MonthNumber : values from 1 to 12
    origine     : 'Sencrop' and 'Visiogreen' values

3 continuous variables : 

    TEMPERATURE_PREDITE  DIFF_HOURS  TEMPERATURE_OBSERVEE

I let Pycaret encoding categorical features to 0/1 and manage multicollinearity:

regression = setup(data = dataset_predictions_meteo, 
                   target = 'TEMPERATURE_PREDITE', 
                   categorical_features = ['MonthNumber' , 'origine' , 'LIB_SOURCE'],
                   numeric_features = ['DIFF_HOURS' , 'TEMPERATURE_OBSERVEE'],  
                   session_id=123,
                   train_size=0.8, 
                   normalize=True, 
                   #transform_target=True,
                   remove_perfect_collinearity = True
                  )

But as you can see in the screen above, Pycaret doesn't well manage multicollinearity : PyCaret should remove by itself 1 of 3 columns 'arome_001', 'gfs_025' and 'arpege_01' (get_config('X')). But PyCaret keeps all 3 columns.

Why PyCaret doesn't remove one of 3 columns? Thanks.

What is your question? You must explicitly state your question. — Jeong Kim, May 17 '22 at 22:16
Because PyCaret is managing multicollinearity, PyCaret should remove by itself 1 of 3 columns 'arome_001', 'gfs_025' and 'arpege_01' (get_config('X')) — Theo75, May 18 '22 at 06:16
So your question is why PyCaret doesn't remove one of 3 columns? — Jeong Kim, May 18 '22 at 06:37

Alper Yilmaz · Answer 1 · 2022-12-02T15:48:50.893

Multicollinearity means that two or more features are correlated, meaning that they have a correlation coefficient close to +1.0 or -1.0. If two features are correlated, then they change together: if one changes, also the other one changes (they affect each other). This situation affects the model performance negatively. PyCaret manages multicollinearity internally to achieve well-performing models.

In the case of multicollinearity, PLS (Partial Least Squares Regresssion), and PCA (Principal Component Analysis) can be used to remove correlation among the features. PLS regression can reduce the features to a smaller set of features (by eliminating some of the features) that have no correlation among them. On the other hand, PCA creates new features which are uncorrelated (it replaces the old features with the uncorrelated new features).

I am not very clear about why you think that 1 of 3 columns 'arome_001', 'gfs_025' and 'arpege_01' should be removed, my guess is that PyCaret works as expected.

score 0 · Answer 2 · answered Nov 21 '22 at 20:53

0

I suppose that colinearity is being calculated for floats and integers. They are indeed categorical.

answered Nov 21 '22 at 20:53

Essegn

153
1
9

Pycaret does't well manage multicollinearity

2 Answers2