0

I have setup a bagging classifier in pyspark, in which a binary classifier trains on the positive samples and an equal number of randomly sampled unlabeled samples (given scores of 1 for positive and 0 for the unlabeled). The model then predicts the out of bag samples, and this process repeats so now I plan to take the average prediction per sample.

My question comes in as the output model prediction using PySpark is a probability column that is a vector of probabilities per class. So for example the output for binary classification looks like:

model.transform(test_data).show()
+-----+--------------------+
|label|         probability|
+-----+--------------------+
|    0|[0.294, 0.8]        |
|    1|[0.65, 0.2 ]        |

To perform positive unlabeled learning from a binary classifier that outputs this, do I need to drop the probabilities predicted for the negative class and use only the predictions the model has made for if the unlabeled samples are positive?

LN3
  • 67
  • 1
  • 2
  • 10

1 Answers1

0

Yes. The probability you get for each unlabeled data is the probability for that point to be positive as the model gains for. Then you take the average across iterations

Baktaawar
  • 7,086
  • 24
  • 81
  • 149