I have broadcasted training dataset to all partitions. Now I want to share the information of 10 different hyperparameters/models to 10 different partitions and train them indepedently. How to share this modes/hyperparameters information ?
Is this right approach ?
broadcast_train = spark.sparkContext.broadcast(df_train)
models = [ LogisticRegression(), Ridge(), Lasso() ...... ] (10 such models)
model_list = spark.sparkContext.parallelize(models )` model_list = model_list.partitionBy(num_models, UniquePartitioner(num_models))
results_rdd = model_list.mapPartitions(run_model_on_partition)
def run_model_on_partition(model): #read the broadcasted training data #using model from rdd and run on training dataset and return the results.