I have a question on how the best way to implement the following problem.
I have a LGBM model in my driver. I need to run this model against a very large distributed over the executors dataset.
In order to run the model, I need to transform the dataset from pysparkDF to pandasDF. AFAIK, when I do this, the pandasDF start to live in the driver once transformed.
Right now what I am doing is to shard the dataset and run the model into chunks of it, but the problem is the it is running sequentially, all in the driver. After each run of the model, I union the results into a new distributed pysparkDF.
Is there a way that I can deliver the model to the executors and by themselves they run the model in which part of the dataset the have? (remembering that they need to convert to pandasDF first)
If any further information is needed please reach me out.