How to implement the predict_proba(X) -equivalent of Scikit-Learn in MLlib

Question

python-wise I am preferring .predict_proba(X) instead of .decision_function(X) since it is easier for me to interpret the results. as far as I can see, the latter functionality is already implemented in Spark (well, in version 0.9.2 for example I have to compute the dot product on my own otherwise I get 0 or 1) but the former is not implemented (yet!). what should I do \ how to implement that one in Spark as well? what are the required inputs here and how does the formula look like?

score 0 · Answer 1 · answered Apr 11 '15 at 20:02

In Spark/Mlib version 1.3 I seem that the predict function can return the probability, by clearing the threshold. From this page: https://spark.apache.org/docs/1.3.0/api/python/pyspark.mllib.html#module-pyspark.mllib.classification

>>> data = [
...     LabeledPoint(0.0, [0.0, 1.0]),
...     LabeledPoint(1.0, [1.0, 0.0]),
... ]
>>> lrm = LogisticRegressionWithSGD.train(sc.parallelize(data))
>>> lrm.predict([1.0, 0.0])
1
>>> lrm.predict([0.0, 1.0])
0
>>> lrm.predict(sc.parallelize([[1.0, 0.0], [0.0, 1.0]])).collect()
[1, 0]
>>> lrm.clearThreshold()
>>> lrm.predict([0.0, 1.0])
0.123...

The predict function call in the source says just that: https://spark.apache.org/docs/1.3.0/api/python/_modules/pyspark/mllib/classification.html#LogisticRegressionModel.predict

if self._threshold is None:
            return prob
        else:
            return 1 if prob > self._threshold else 0

I hope that helps.

How to implement the predict_proba(X) -equivalent of Scikit-Learn in MLlib

1 Answers1