How to run 10 different models on 10 different partitions using pyspark. Without using sparkmllib or standard parrallelizing implementations?

Asked Jun 22 '23 at 19:46

Active Jun 22 '23 at 19:46

Viewed 11 times

I have broadcasted training dataset to all partitions. Now I want to share the information of 10 different hyperparameters/models to 10 different partitions and train them indepedently. How to share this modes/hyperparameters information ?

Is this right approach ?

broadcast_train = spark.sparkContext.broadcast(df_train)

models = [ LogisticRegression(), Ridge(), Lasso() ...... ] (10 such models)

model_list = spark.sparkContext.parallelize(models )` model_list = model_list.partitionBy(num_models, UniquePartitioner(num_models))

results_rdd = model_list.mapPartitions(run_model_on_partition)

def run_model_on_partition(model): #read the broadcasted training data #using model from rdd and run on training dataset and return the results.

asked Jun 22 '23 at 19:46

thiran509

How to run 10 different models on 10 different partitions using pyspark. Without using sparkmllib or standard parrallelizing implementations?

0 Answers0