-1

I have a dataset that has a high number of categorical variables. For example, currently, the dataset has 37 categorical variables, now if I perform one hot encoding or any other encoding it will explode the number of columns and overall column counts will increase by 100.

Hence is there an efficient way to first select the best 5 or 10 features among all the categorical variables present?

1 Answers1

1

There is a huge number of solutions for your problem. I will give one really easy to implement and incredibly basic one and another one that is more decent. First one is to do simple linear regression using simple equation y = ax+b for each feature. After you fit linear regression function, you can check your "a" values for each feature and see how much they change the "y" value. This is something I suggest to get basics of feature selection (keep that in mind, negative "a" values have big importance too). Other one is Pearson correlation. There is many ways to check correlation but you can give these two a try.