2

I am working with a data set where if an example is labeled positive, it is definitely positive. Unfortunately for the negative class, if a label is labeled negative, the same cannot be said (and it could turn out to be positive class). Also the number of examples marked negative is far more than the number of examples marked positive. I am trying to learn a classification model on this training data set. I was wondering what techniques could be used in such cases (where labels of a particular class might be noisy)

vkmv
  • 1,345
  • 1
  • 14
  • 24

2 Answers2

2

Noisiness of the label is not the problem, most classifiers assume, that some data is misslabeled (like SVM and its soft margin). What is interesting here is the fact, that there is a disproportion between the correctness of one particular class. This can be approached in few ways:

  • Use class-weighting scheme and attach proportionaly bigger weight to the possitive class, as due to its "correctness" you should be more concerned about correct classification, while you can have more missclassified elements of the negative class (this is also solution for the classes size disproportion)
  • While using some parameters fitting - use the customly hacked metric, which will weight the positives over negatives (so you care more about TP, and FP, while do not really care about TN and FN). The simplest case is the precision metric, which simply ignores the TN and FN, but you could also used the F-beta measure, which balances between precision and recall - in your case you should select small beta (maybe inverse proportional to the ratio of positive/negative correctness). In general this beta parameter shows how many times more you care about recall then precision.
  • Use novelty detection instead of binary classification, and focus on detecting the positive samples. There are many possible models for such task, one of which is one-class SVM.
Community
  • 1
  • 1
lejlot
  • 64,777
  • 8
  • 131
  • 164
1

You can also try to fix the labels in your dataset: if the dataset is really too noisy it can harm the classifier performance (as evaluated on an hypothetical god-standard test set with no noise).

You can use your classifier output to help you label. If you are using scikit-learn, some models like SGDClassifier(loss='log') for instance can give you class assignments probabilities with the predict_proba method. You can thus:

1- train a first model on the noisy development set 2- compute the class-assignment probabilities on this dataset 3- assuming the classifier did no completely overfit the noise (which is unlikely for a linear model if you have lot of real negative examples), rank the violations by probabilities: get the most offending classification errors on top: they are the most likely badly labeled examples 4- manually inspect the those violations in order and update the labels accordingly

Then iterate until you are better satisfied with the quality of your data.

ogrisel
  • 39,309
  • 12
  • 116
  • 125