Random Forest and SHAP values with few features for features selection

Question

I have some datasets with 4 features and observations between 100 and 300. I would like to use them to perform a classification. The target variable has 3 possible labels. I have trained a Random Forest and as the interpretation and understanding of the result and the feature selection step are more important than the result itself, I have also calculated SHAP values.

I felt comfortable using them, but I fear that the model is too simple for such an advanced XAI. Since I am still a beginner with ML, I would like to ask your opinion. Would you suggest a different model, a different approach to explain the model and to select the most important features? Thanks a lot in advance

EDIT: Maybe I can also give you some details about my problem: I applied a cluster analysis and identified three clusters in the data. The data set also has other features, but I performed the cluster analysis considering only two numerical features. It is important that only these two features are considered because they lead to a result that can be highly understood by the users of the results of this analysis. Now I want to figure out why these three classes exist. I have therefore fitted a random forest, considering that the class obtained with the cluster analysis is the dependent variable, while the remaining features are the independent variables. By looking at the predictive ability of the random forest and the SHAP values, I can explain which variables are important in predicting the class, and thus how come the three classes exist. Do you think this approach can be reasonable?

Thank you Sergey for your answer to my question. One thing is not clear to me: what is the difference between explaining the model and explaining the data? I have given you some more details about my problem by editing my question. I would appreciate your opinion. Thanks a lot in advance — Pachita, May 07 '23 at 12:20
What model to explain outcome with -- linear regression (for a few datapoints) or a neural network (for many datapoints) -- it's up to you. shap will explain any. Note however, your model will be as good as your data (unless you make an effort). — Sergey Bushmanov, May 07 '23 at 17:03
I suggest you google "true to model or true to data" or study the link https://arxiv.org/abs/2006.16234 — Sergey Bushmanov, May 07 '23 at 17:04

Random Forest and SHAP values with few features for features selection

0 Answers0