9

I want to optimize the hyper parameters of a PySpark Pipeline using a ranking metric (MAP@k). I have seen in the documentation how to use the metrics defined in the Evaluation (Scala), but I need to define a custom evaluator class because MAP@k is not implemented yet. So I need to do something like:

model = Pipeline(stages=[indexer, assembler, scaler, lg])
paramGrid_lg = ParamGridBuilder() \
    .addGrid(lg.regParam, [0.001, 0.1]) \
    .addGrid(lg.elasticNetParam, [0, 1]) \
    .build()

crossval_lg = CrossValidator(estimator=model,
                      estimatorParamMaps=paramGrid_lg,
                      evaluator=MAPkEvaluator(), 
                      numFolds=2)

where MAPkEvaluator() is my custom evaluator. I've seen a similar question but not the answer.

Is there any example or documentation available for this? Does anyone know if it Is possible to implement it in PySpark? What methods should I implement?

Ric S
  • 9,073
  • 3
  • 25
  • 51
Amanda
  • 941
  • 2
  • 12
  • 28
  • 2
    You should be able to accomplish this by extending from the `Evaluator` (https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.evaluation) base class and providing your custom metric implementation. – jarandaf Jul 19 '18 at 08:06

3 Answers3

11

@jarandaf answered the question in the first comment, but for clarity reasons I write how to implement a basic example with a random metric:

import random
from pyspark.ml.evaluation import Evaluator

class RandomEvaluator(Evaluator):

    def __init__(self, predictionCol="prediction", labelCol="label"):
        self.predictionCol = predictionCol
        self.labelCol = labelCol

    def _evaluate(self, dataset):
        """
        Returns a random number. 
        Implement here the true metric
        """
        return random.randint(0,1)

    def isLargerBetter(self):
        return True

Now the following code should work:

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

paramGrid_lg = ParamGridBuilder() \
    .addGrid(lg.regParam, [0.01, 0.1]) \
    .addGrid(lg.elasticNetParam, [0, 1]) \
    .build()

crossval_lg = CrossValidator(estimator=model,
                      estimatorParamMaps=paramGrid_lg,
                      evaluator= RandomEvaluator(), 
                      numFolds=2)

cvModel = crossval_lg.fit(train_val_data_)
Amanda
  • 941
  • 2
  • 12
  • 28
1

@Amanda answered very well the question, but let me show you something to avoid too. If you check the help of Evaluator() class doing:

help(Evaluator())

you'll see a method defined there:

isLargerBetter(self)
 |      Indicates whether the metric returned by :py:meth:`evaluate` should be maximized
 |      (True, default) or minimized (False).
 |      A given evaluator may support multiple metrics which may be maximized or minimized.
 |      
 |      .. versionadded:: 1.5.0

Now if you your metric needs to be minimized, you need to set this method as:

def isLargerBetter(self):
        return False

The default value for the current method is True.

igorkf
  • 3,159
  • 2
  • 22
  • 31
  • 1
    Thank you @igorkf for the clarification, I forget to include this valuable information in the answer! – Amanda May 19 '20 at 07:23
1

Adding an actual example to @Amanda's clear answer, the following code can be used to create a custom Evaulator which computes the F1-score in a binary classification task. It may be not optimized (I actually don't know if there is a more efficient way to implement the metric), but it gets the job done.

import pyspark.sql.functions as F
from pyspark.ml.evaluation import Evaluator

class MyEvaluator(Evaluator):

    def __init__(self, predictionCol='prediction', labelCol='label'):
        self.predictionCol = predictionCol
        self.labelCol = labelCol

    def _evaluate(self, dataset):
        tp = dataset.filter((F.col(self.labelCol) == 1) & (F.col(self.predictionCol) == 1)).count()
        fp = dataset.filter((F.col(self.labelCol) == 0) & (F.col(self.predictionCol) == 1)).count()
        fn = dataset.filter((F.col(self.labelCol) == 1) & (F.col(self.predictionCol) == 0)).count()
        f1 = (2 * tp) / (2 * tp + fp + fn)
        return f1

    def isLargerBetter(self):
        return True
Ric S
  • 9,073
  • 3
  • 25
  • 51