How to parallelize work in pyspark over chunks of a dataset and the chunk needs to be a pandas df

Question

I have a question on how the best way to implement the following problem.

I have a LGBM model in my driver. I need to run this model against a very large distributed over the executors dataset.

In order to run the model, I need to transform the dataset from pysparkDF to pandasDF. AFAIK, when I do this, the pandasDF start to live in the driver once transformed.

Right now what I am doing is to shard the dataset and run the model into chunks of it, but the problem is the it is running sequentially, all in the driver. After each run of the model, I union the results into a new distributed pysparkDF.

Is there a way that I can deliver the model to the executors and by themselves they run the model in which part of the dataset the have? (remembering that they need to convert to pandasDF first)

If any further information is needed please reach me out.

Is the model already trained? You just need for the model to run on executors for each chunk of pandas_df for inference? How big is the model size? Can't you just load the model and broadcast it. In the executor access the broadcasted model and do the inference? The second option mentioned here should work for you. https://medium.com/@the.data.yoga/running-scikit-models-in-apache-pyspark-4a2ac5e693c4 — user238607, Aug 29 '23 at 17:29

How to parallelize work in pyspark over chunks of a dataset and the chunk needs to be a pandas df

0 Answers0