I have setup a bagging classifier in pyspark, in which a binary classifier trains on the positive samples and an equal number of randomly sampled unlabeled samples (given scores of 1 for positive and 0 for the unlabeled). The model then predicts the out of bag samples, and this process repeats so now I plan to take the average prediction per sample.
My question comes in as the output model prediction using PySpark is a probability column that is a vector of probabilities per class. So for example the output for binary classification looks like:
model.transform(test_data).show()
+-----+--------------------+
|label| probability|
+-----+--------------------+
| 0|[0.294, 0.8] |
| 1|[0.65, 0.2 ] |
To perform positive unlabeled learning from a binary classifier that outputs this, do I need to drop the probabilities predicted for the negative class and use only the predictions the model has made for if the unlabeled samples are positive?