2

I am using pyspark.ml.RandomForestClassifier and one of the steps here involves StringIndexer on the training data target variable to convert it into labels.

indexer = StringIndexer(inputCol = target_variable_name, outputCol = 'label').fit(df)
df = indexer.transform(df)

After fitting the final model I am saving it using mlflow.spark.log_model(). So, when applying the model on a new dataset in future, I just load the model again and apply to the new data:

model = mlflow.sklearn.load_model("models:/RandomForest_model/None")
predictions = rfModel.transform(new_data)

In the new_data the prediction will come as labels and not in original value. So, if I have to get the original values I have to use IndexToString

labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",labels=indexer.labels)
predictions = labelConverter.transform(predictions)

So, the question is, my model doesn't save the indexer.labels as only the model gets saved. How do, I save and use the indexer.labels from my training dataset on any new dataset. Can this be saved and retrived in mlflow ?

Apologies, if Iam sounding naïve here . But, getting back the original values in the new dataset is really getting me confused.

Deb
  • 499
  • 2
  • 15

2 Answers2

0

Hope you got the answer in case if you haven't then here's the solution. String indexer has a method to save and read, you can use save to reuse the string indexer model.

Eg: Stringindexermodel.save("PAth")

Source: StringIndexerModel — PySpark 3.3.1 documentation (apache.org)

I was searching for a quick answer but couldn't find any, on searching document i found save as an option.

Kotana Sai
  • 1,207
  • 3
  • 9
  • 20
0

StringIndexerModel is a model fitted by StringIndexer.

What you can do is saving to disk

from pyspark.ml.feature import StringIndexer, StringIndexerModel
indexer = indexer = StringIndexer(inputCol = target_variable_name, outputCol = 'label').fit(df)
indexer.save("string_indexer")
indexer = StringIndexerModel.load("string_indexer")

or logging to mlflow

import mlflow
mlflow.spark.log_model(indexer, "string_indexer")
logged_model = 'runs:/you_run_id/string_indexer'
indexer = mlflow.spark.load_model(logged_model)

Hope this helps.