Remove features from dataset

Question

Im conducting an experiment on blood test results data trying to predict the probability a patient has a curtain disease. using the blood test result i have reached over 2000 features and im trying to find a good way to eliminate features that doesnt help. is there more general way to find the unneccesery features ? im using xgboost and histGradientBoost models for the prediction

ive tried using feature importance but as i increase the number of patients in the dataset the important features changes ... i heard about a package called SHAP but my computer has no access to the internet and getting the package will take time

I would look for features that highly correlate with each other, and only keep one of each. — Franciska, Feb 20 '23 at 12:56

score 0 · Accepted Answer · answered Feb 20 '23 at 13:18

Correlation for highly correlate with each other or use PCA which can be used to identify the most important features in the data

Regarding the issue with feature importance changing as you increase the number of patients in the dataset, this is a common problem with some feature importance methods. SHAP is one way to address this issue as it provides a more accurate and stable estimate of feature importance by considering all possible feature combinations.

Hope that helps

SHAP importance is an in-sample measure. Depending on the scientific context, this might be what you want - or not at all. — Michael M, Feb 20 '23 at 19:53

Remove features from dataset

1 Answers1