-1

I am currently working on Regression problem statement wherein we have around 13 Numerical Features and 38 Categorical Feature and we are required to predict the Target feature (Discreate Numerical Feature).

Is there a method to determine which categorical feature (out of all 38) is best for my target feature? Or should I perform Label encoding on all categorical features, check for skewness of data, perform Factor analysis?

My dataset has (6403320, 51) Rows and Columns.

Let me know what's the best approach for dataset which has huge categorical features?

SID
  • 17
  • 5

1 Answers1

1

Typically I recommend using theory or hypotheses to limit the features that you include when attempting to model the data.

However, other approaches (which are not mutually exclusive), could be to:

  • Consider using a penalty model (e.g., LASSO regression, Ridge Regression). These types of models will automatically perform variable selection, based on your penalty weights.
  • Create latent variables that 'group' together some of the categorical variables (i.e., reducing the number of variables in the model) if there is reason to argue that they are related (e.g., Principal Component Analysis).

A sensitivity analysis could then be implemented to see how the number of variables affects the residual-sum-of-squares in the model.

ethanknights
  • 154
  • 5