-1

I am developing a model in which it predicts whether the employee retains its job or leave the company.

The features are as below

  • satisfaction_level
  • last_evaluation
  • number_projects
  • average_monthly_hours
  • time_spend_company
  • work_accident
  • promotion_last_5years
  • Department
  • salary
  • left (boolean)

During feature analysis, I came up with the two approaches and in both of them, I got different results for the features. as shown in the image here

When I plot a heatmap it can be seen that satisfaction_level has a negative correlation with left.

On the other hand, if I just use pandas for analysis I got results something like this

In the above image, it can be seen that satisfaction_level is quite important in the analysis since employees with higher satisfaction_level retain the job.

While in the case of time_spend_company the heatmap shows it is important while on the other hand, the difference is not quite important in the second image.

Now I am confused about whether to take this as one of my features or not and which approach should I choose in order to choose features.

Some please help me with this.

BTW I am doing ML in scikit-learn and the data is taken from here.

Ishaan
  • 1,249
  • 15
  • 26

1 Answers1

0

Correlation between features have little to do with feature importance. Your heat map is correctly showing correlation. In fact, in most of the cases when you talking about feature importance, you must provide context of a model that you are using. Different models may choose different features as important. Moreover many models assume that data comes from IID (Independent and identically distributed random variables), so correlation close to zero is desirable.

For example in sklearn learn regression to get estimation of feature importance you can examine coef_ parameter.

Farseer
  • 4,036
  • 3
  • 42
  • 61
  • So what is the best way by which I can get always the best result among the above two? – Ishaan Jul 23 '19 at 06:47
  • If you using pandas , go with `corr = df.corr()` and than print/plot your correlation as desired. Look at documentation of the method here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html – Farseer Jul 23 '19 at 06:58
  • @StukedCoder Yes. However you should use Cross Validation to check if feature selections is useful. Note that techniques like `Recursive feature elimination`, require for model to provide `coef_` or `feature_importances_ ` attribute. – Farseer Jul 23 '19 at 07:21
  • So ultimately one should go with all the methods and draw a conclusion with them to select the best features, please correct me if I'm wrong. – Ishaan Jul 23 '19 at 07:27
  • I think that trying all methods is not feasible. One should CrossValidate most promising methods and choose one that had best performance on CrossValidation data set(not on test set) – Farseer Jul 23 '19 at 07:45