python-wise I am preferring .predict_proba(X) instead of .decision_function(X) since it is easier for me to interpret the results. as far as I can see, the latter functionality is already implemented in Spark (well, in version 0.9.2 for example I have to compute the dot product on my own otherwise I get 0 or 1) but the former is not implemented (yet!). what should I do \ how to implement that one in Spark as well? what are the required inputs here and how does the formula look like?
Asked
Active
Viewed 1,310 times
1 Answers
0
In Spark/Mlib version 1.3 I seem that the predict function can return the probability, by clearing the threshold. From this page: https://spark.apache.org/docs/1.3.0/api/python/pyspark.mllib.html#module-pyspark.mllib.classification
>>> data = [
... LabeledPoint(0.0, [0.0, 1.0]),
... LabeledPoint(1.0, [1.0, 0.0]),
... ]
>>> lrm = LogisticRegressionWithSGD.train(sc.parallelize(data))
>>> lrm.predict([1.0, 0.0])
1
>>> lrm.predict([0.0, 1.0])
0
>>> lrm.predict(sc.parallelize([[1.0, 0.0], [0.0, 1.0]])).collect()
[1, 0]
>>> lrm.clearThreshold()
>>> lrm.predict([0.0, 1.0])
0.123...
The predict function call in the source says just that: https://spark.apache.org/docs/1.3.0/api/python/_modules/pyspark/mllib/classification.html#LogisticRegressionModel.predict
if self._threshold is None:
return prob
else:
return 1 if prob > self._threshold else 0
I hope that helps.

farmi
- 343
- 2
- 7