org.apache.spark.SparkException: Unseen label with TrainValidationSplit

Question

I was searching for this error but I haven't found anything related to TrainValidationSplit. So I wanna do parameter tuning and doing so with TrainValidationSplit gives the following error: org.apache.spark.SparkException: Unseen label.

I understand why this happens and increasing the trainRatio mitigates the problem but does not completely solve it. For that matter, this is (part of) the code:

stages = []
for categoricalCol in categoricalCols:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
    stages += [stringIndexer]

assemblerInputs = [x+"Index" for x in categoricalCols] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

labelIndexer = StringIndexer(inputCol='label', outputCol='indexedLabel')
stages += [labelIndexer]

dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
stages += [dt]

evaluator = MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1')

paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [1,2,6])
             .addGrid(dt.maxBins, [20,40])
             .build())

pipeline = Pipeline(stages=stages)

trainValidationSplit = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, trainRatio=0.95)

model = trainValidationSplit.fit(train_dataset)
train_dataset= model.transform(train_dataset)

I have seen this answer but I am not sure whether it also applies to my case and I am wondering if there's a more appropriate solution. Please, help?

Remember, you should split your data to train/test "before" doing feature normalization. Otherwise you will experience "Data Leakage". — mah65, Sep 26 '20 at 14:08

score 2 · Answer 1 · answered Apr 28 '17 at 03:20

The Unseen label exception is usually associated with StringIndexer.

You split the data into training (95%) and validation (5%) dataset. I think there are some category values (in the categoricalCol columns) that appear in the training data but do not appear in the validation set.

Therefore, during the string indexing stage in the validation process, the StringIndexer sees an unseen label and throws that exception. By increasing the training ratio, you increase the chance that category values in the training set are a superset of that in the validation set, but this is only a workaround since there is no guarantee.

One possible solution: fit the StringIndexer with train_dataset first, and add the resulting StringIndexerModel to the pipeline stages. This way the StringIndexer would see all the possible category values.

for categoricalCol in categoricalCols:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
    strIndexModel = stringIndexer.fit(train_dataset)
    stages += [strIndexModel]

org.apache.spark.SparkException: Unseen label with TrainValidationSplit

1 Answers1