First I create two ML algorithms and save them to two separate files. Note that both models are based on the same dataframe. feature_1
and feature_2
are different sets of features extracted from the same dataset.
import sys
from pyspark.ml.classification import RandomForestClassifier
trainer_1 = RandomForestClassifier(featuresCol="features_1")
trainer_2 = RandomForestClassifier(featuresCol="features_2")
model_1 = trainer_1.fit(df_training_data)
model_2 = trainer_2.fit(df_training_data)
model_1.save(sys.argv[1])
model_2.save(sys.argv[2])
Then, when I later want to use the models, I have to load them both from their respective paths, providing the paths f.ex. via sys.argv.
import sys
from pyspark.ml.classification import RandomForestClassificationModel
model_1 = RandomForestClassificationModel.load(sys.argv[1])
model_2 = RandomForestClassificationModel.load(sys.argv[2])
What I want is an elegant way to be able to save these two models together, as one, in the same path. I want this mainly so that the user do not have to keep track of two separate pathnames every time he saves and loads. These two models are closely connected and will generally always be created and used as a together, so they are sort of one model.
Is this the kind of thing pipelines are intended for?