1

I have a Panda Dataframe df in input to Pycaret library. So the df has :

3 categoricals variables:
    LIB_SOURCE  : values: 'arome_001', 'gfs_025' and 'arpege_01'
    MonthNumber : values from 1 to 12
    origine     : 'Sencrop' and 'Visiogreen' values

3 continuous variables : 

    TEMPERATURE_PREDITE  DIFF_HOURS  TEMPERATURE_OBSERVEE

I let Pycaret encoding categorical features to 0/1 and manage multicollinearity:

regression = setup(data = dataset_predictions_meteo, 
                   target = 'TEMPERATURE_PREDITE', 
                   categorical_features = ['MonthNumber' , 'origine' , 'LIB_SOURCE'],
                   numeric_features = ['DIFF_HOURS' , 'TEMPERATURE_OBSERVEE'],  
                   session_id=123,
                   train_size=0.8, 
                   normalize=True, 
                   #transform_target=True,
                   remove_perfect_collinearity = True
                  )

enter image description here

enter image description here

But as you can see in the screen above, Pycaret doesn't well manage multicollinearity : PyCaret should remove by itself 1 of 3 columns 'arome_001', 'gfs_025' and 'arpege_01' (get_config('X')). But PyCaret keeps all 3 columns.

Why PyCaret doesn't remove one of 3 columns? Thanks.

Theo75
  • 477
  • 4
  • 14

2 Answers2

1

Multicollinearity means that two or more features are correlated, meaning that they have a correlation coefficient close to +1.0 or -1.0. If two features are correlated, then they change together: if one changes, also the other one changes (they affect each other). This situation affects the model performance negatively. PyCaret manages multicollinearity internally to achieve well-performing models.

In the case of multicollinearity, PLS (Partial Least Squares Regresssion), and PCA (Principal Component Analysis) can be used to remove correlation among the features. PLS regression can reduce the features to a smaller set of features (by eliminating some of the features) that have no correlation among them. On the other hand, PCA creates new features which are uncorrelated (it replaces the old features with the uncorrelated new features).

I am not very clear about why you think that 1 of 3 columns 'arome_001', 'gfs_025' and 'arpege_01' should be removed, my guess is that PyCaret works as expected.

Alper Yilmaz
  • 36
  • 1
  • 6
0

I suppose that colinearity is being calculated for floats and integers. They are indeed categorical.

Essegn
  • 153
  • 1
  • 9