1

I am attempting to train a binary positive/negative classifier using SVM inside of Encog. In this specific case, the data set is highly unbalanced, with negative examples outnumbering positive examples roughly 30:1.

In the training of the model, I am deliberately undersampling the negative cases to roughly balance the positive/negative cases given to the model, an approach that has worked well for me for other problems. In this case, however, the end model ends up with an unacceptably high false positive rate, with the number of false positives outweighing the number of true positives when tested on an unbalanced test set.

Any suggestions for how to train to reduce the false positive rate? Training with unbalanced data (or with a closer-to-observed balance) will reduce the number of overall positive predictions, but it doesn't seem to increase the true positive to false positive ratio.

Adam
  • 31
  • 5
  • This has been answered in another post. http://stackoverflow.com/questions/18078084/how-should-i-teach-machine-learning-algorithm-using-data-with-big-disproportion/18088148#18088148 – Yakku Jan 22 '15 at 12:25

1 Answers1

0

Sounds like your data set is not separable. in this case an unbalanced set may result in bad performance. in libsvm you can assign a higher weight to labels with little representation.

first i would suggest to keep all negatives as the feature space for the negatives is probably much bigger and will more likely be covered if all samples are kept. second you have to decide what to optimize for e.g. (TP+TN)/(TP+TN+FP+FN). now you run training/evaluation with different weight values for your positive labels to find the maximum performance according to your definition. the final performance depends on the separability of your data.

stefan
  • 3,681
  • 15
  • 25