0

I have a dataset consisting of force-displacement curves. The dataset is heavily imbalanced, with the negative class having 29,000 samples and the positive class having only 100 samples. After transforming the force-displacement curves with tsfresh, I tried several approaches such as undersampling, oversampling, adjusting class weights (e.g., pos_class_weight for xgboost), and adjusting the threshold of predict_proba. However, none of these approaches helped me improve precision. Although I achieved a relatively good recall after undersampling, the precision remained consistently at 0. I have attached some images to this post.

the plot shows how recall and precision change at different thresholds

This plot shows the precision of the training and validation sets for different values of the 'sampling_strategy' parameter of the NearMiss Undersampler. (X-axis = values of the 'sampling_strategy' parameter; Y-axis = values of precision).

I have also created plots for other attempts. Unfortunately, since I am not getting meaningful results from the plots, I cannot determine which method (e.g., which undersampling method or feature selection method) is best suited for my dataset.

Note that I am only sampling the training data.

I have tried:

  • oversampling methods (Smote and variants of it)
  • undersampling methods (NearMiss, RandomUndersampler etc.)
  • feature selection (mRMR methodes)
  • adjusting class weights (e.g., pos_class_weight for xgboost)
  • adjusting the threshold of predict_proba
Fatih
  • 1
  • 1

0 Answers0