I am using pyspark.ml.RandomForestClassifier and one of the steps here involves StringIndexer on the training data target variable to convert it into labels.
indexer = StringIndexer(inputCol = target_variable_name, outputCol = 'label').fit(df)
df = indexer.transform(df)
After fitting the final model I am saving it using mlflow.spark.log_model(). So, when applying the model on a new dataset in future, I just load the model again and apply to the new data:
model = mlflow.sklearn.load_model("models:/RandomForest_model/None")
predictions = rfModel.transform(new_data)
In the new_data the prediction will come as labels and not in original value. So, if I have to get the original values I have to use IndexToString
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",labels=indexer.labels)
predictions = labelConverter.transform(predictions)
So, the question is, my model doesn't save the indexer.labels as only the model gets saved. How do, I save and use the indexer.labels from my training dataset on any new dataset. Can this be saved and retrived in mlflow ?
Apologies, if Iam sounding naïve here . But, getting back the original values in the new dataset is really getting me confused.