1

python-wise I am preferring .predict_proba(X) instead of .decision_function(X) since it is easier for me to interpret the results. as far as I can see, the latter functionality is already implemented in Spark (well, in version 0.9.2 for example I have to compute the dot product on my own otherwise I get 0 or 1) but the former is not implemented (yet!). what should I do \ how to implement that one in Spark as well? what are the required inputs here and how does the formula look like?

zero323
  • 322,348
  • 103
  • 959
  • 935
user706838
  • 5,132
  • 14
  • 54
  • 78

1 Answers1

0

In Spark/Mlib version 1.3 I seem that the predict function can return the probability, by clearing the threshold. From this page: https://spark.apache.org/docs/1.3.0/api/python/pyspark.mllib.html#module-pyspark.mllib.classification

>>> data = [
...     LabeledPoint(0.0, [0.0, 1.0]),
...     LabeledPoint(1.0, [1.0, 0.0]),
... ]
>>> lrm = LogisticRegressionWithSGD.train(sc.parallelize(data))
>>> lrm.predict([1.0, 0.0])
1
>>> lrm.predict([0.0, 1.0])
0
>>> lrm.predict(sc.parallelize([[1.0, 0.0], [0.0, 1.0]])).collect()
[1, 0]
>>> lrm.clearThreshold()
>>> lrm.predict([0.0, 1.0])
0.123...

The predict function call in the source says just that: https://spark.apache.org/docs/1.3.0/api/python/_modules/pyspark/mllib/classification.html#LogisticRegressionModel.predict

if self._threshold is None:
            return prob
        else:
            return 1 if prob > self._threshold else 0

I hope that helps.

farmi
  • 343
  • 2
  • 7