Pyspark: How to save and apply IndexToString to convert labels back to original values in a new predicted dataset

Question

I am using pyspark.ml.RandomForestClassifier and one of the steps here involves StringIndexer on the training data target variable to convert it into labels.

indexer = StringIndexer(inputCol = target_variable_name, outputCol = 'label').fit(df)
df = indexer.transform(df)

After fitting the final model I am saving it using mlflow.spark.log_model(). So, when applying the model on a new dataset in future, I just load the model again and apply to the new data:

model = mlflow.sklearn.load_model("models:/RandomForest_model/None")
predictions = rfModel.transform(new_data)

In the new_data the prediction will come as labels and not in original value. So, if I have to get the original values I have to use IndexToString

labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",labels=indexer.labels)
predictions = labelConverter.transform(predictions)

So, the question is, my model doesn't save the indexer.labels as only the model gets saved. How do, I save and use the indexer.labels from my training dataset on any new dataset. Can this be saved and retrived in mlflow ?

Apologies, if Iam sounding naïve here . But, getting back the original values in the new dataset is really getting me confused.

score 0 · Answer 1 · edited Feb 02 '23 at 10:47

0

Hope you got the answer in case if you haven't then here's the solution. String indexer has a method to save and read, you can use save to reuse the string indexer model.

Eg: Stringindexermodel.save("PAth")

Source: StringIndexerModel — PySpark 3.3.1 documentation (apache.org)

I was searching for a quick answer but couldn't find any, on searching document i found save as an option.

edited Feb 02 '23 at 10:47

Kotana Sai

1,207
3
9
20

answered Jan 30 '23 at 08:16

Raghavendra.S.K

1
2

score 0 · Answer 2 · answered Jun 07 '23 at 08:45

StringIndexerModel is a model fitted by StringIndexer.

What you can do is saving to disk

from pyspark.ml.feature import StringIndexer, StringIndexerModel
indexer = indexer = StringIndexer(inputCol = target_variable_name, outputCol = 'label').fit(df)
indexer.save("string_indexer")
indexer = StringIndexerModel.load("string_indexer")

or logging to mlflow

import mlflow
mlflow.spark.log_model(indexer, "string_indexer")
logged_model = 'runs:/you_run_id/string_indexer'
indexer = mlflow.spark.load_model(logged_model)

Hope this helps.

Pyspark: How to save and apply IndexToString to convert labels back to original values in a new predicted dataset

2 Answers2