0

I have 77000 text samples that 4900 of them are positive and about 72000 of them are negative (binary classification) and the maximum length of these samples are 15 (These samples are sentences). Not only are the data imbalanced but also positive and negative samples are very similar. Actually features of both classes are almost identical. The model I used was bidirectional LSTM with GRU along with attention layer(of course I did preprocessing the data). Despite using SMOTE and Tomek link method for balancing data, precision and recall are low. It is obvious that the similarity between samples is the main problem.

Is there any way to solve this problem?

Best regards,

soheila
  • 15
  • 1
  • 5
  • This seems more like a question for [Mathematics](https://math.stackexchange.com/). If not, then please provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) with example code and some data that manages to show the issue you report. – Arc Feb 01 '22 at 19:59
  • Thanks for your attention. Actually samples are sentences that explain Drug_Drug interaction and their structures are like this: Drug_name relation Gene_name relation Drug_name. In this format , relation between drugs and genes is limited number of words. – soheila Feb 05 '22 at 15:58

0 Answers0