-1

I'm supposed to perform feature selection of my dataset (independent variables: some aspects of a patient, target varibale: patient ill or not) using a dcision tree. After that with the features selected I've to implement a different ML model.

My doubt is: when I'm implementing the decison tree is it necessary having a train and a test set or just fit the model on the whole data?

lorenzlorg
  • 125
  • 7

1 Answers1

1

it's necessary to split the dataset into train-test because otherwise you will measure the performance with data used in training and could end up into over-fitting.

Over-fitting is where the training error constantly decrease but the generalization error increase, where by generalization error is intended as the ability of the model to classify correctly new (never seen before) samples.

Frego
  • 26
  • 2
  • ok thanks, you're right!! So at the end, after the split and fit I've to use ""feature_importances_"" in order to understand which variables are important (these feattures will be the features of the main model), right? – lorenzlorg Jun 16 '22 at 08:23
  • 1
    yes by using ""feature_importances_" (from sklearn I suppose) you will obtain the list of importance of each feature, from there you can select the most important set of features and train a new model with only the selected subset. Hope that help – Frego Jun 16 '22 at 08:36