How to specify the "positive class" in sparkml (binary) classification? (Or perhaps: How does a MulticlassClassificationEvaluator determine which class is the "positive" one?)
Suppose we were training a model to target Precision in a binary classification problem like...
label_idxer = StringIndexer(inputCol="response",
outputCol="label").fit(df_spark)
# we fit so we can get the "labels" attribute to inform reconversion stage
feature_idxer = StringIndexer(inputCols=cat_features,
outputCols=[f"{f}_IDX" for f in cat_features],
handleInvalid="keep")
onehotencoder = OneHotEncoder(inputCols=feature_idxer.getOutputCols(),
outputCols=[f"{f}_OHE" for f in feature_idxer.getOutputCols()])
assembler = VectorAssembler(inputCols=(num_features + onehotencoder.getOutputCols()),
outputCol="features")
rf = RandomForestClassifier(labelCol=label_idxer.getOutputCol(),
featuresCol=assembler.getOutputCol(),
seed=123456789)
label_converter = IndexToString(inputCol=rf.getPredictionCol(),
outputCol="prediction_label",
labels=label_idxer.labels)
pipeline = Pipeline(stages=[label_idxer, feature_idxer, onehotencoder,
assembler,
rf,
label_converter]) # type: pyspark.ml.pipeline.PipelineModel
crossval = CrossValidator(estimator=pipeline,
evaluator=MulticlassClassificationEvaluator(
labelCol=rf.getLabelCol(),
predictionCol=rf.getPredictionCol(),
metricName="weightedPrecision"),
numFolds=3)
(train_u, test_u) = dff.randomSplit([0.8, 0.2])
model = crossval.fit(train_u)
I know that...
Precision = TP / (TP + FP)
...but how do you specify a particular class label as the "positive class" to target for Precision? (As it stands, IDK which response value is actually being used as such in training nor how to tell).