I have a dataset consisting of force-displacement curves. The dataset is heavily imbalanced, with the negative class having 29,000 samples and the positive class having only 100 samples. After transforming the force-displacement curves with tsfresh, I tried several approaches such as undersampling, oversampling, adjusting class weights (e.g., pos_class_weight for xgboost), and adjusting the threshold of predict_proba. However, none of these approaches helped me improve precision. Although I achieved a relatively good recall after undersampling, the precision remained consistently at 0. I have attached some images to this post.
the plot shows how recall and precision change at different thresholds
I have also created plots for other attempts. Unfortunately, since I am not getting meaningful results from the plots, I cannot determine which method (e.g., which undersampling method or feature selection method) is best suited for my dataset.
Note that I am only sampling the training data.
I have tried:
- oversampling methods (Smote and variants of it)
- undersampling methods (NearMiss, RandomUndersampler etc.)
- feature selection (mRMR methodes)
- adjusting class weights (e.g., pos_class_weight for xgboost)
- adjusting the threshold of predict_proba