Why spark.ml CrossValidator gives "Broadcasting large task binary with size X" with big dataset?

Question

Problem:

I’ve been working on distributing a CrossValidation process with Pyspark and Spark ml library so it requires less time than it does with regular sequential computation (i.e. scikit). However, I’m facing some issues doing it. Concretely when I start the job I continuously get the messages “Broadcasting large task binary with size X” (X being numbers from 1700 KiB to 6 MiB). After I leave for a while the job, it eventually ends with the messages “Job X cancelled because SparkContext was shut down” (for a lot of Xs = jobs) and “ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message. org.apache.spark.SparkException: Could not find CoarseGrainedScheduler”.

Reasoning:

Since I’ve had to modify the source code of the CrossValidator _fit method in “pyspark.ml.tuning#CrossValidator”, I’m familiar enough with how it operates to know that the way it distributes the task is by parallelizing for each split of the dataset the training of the models with the different parameter settings. That is, CrossValidator _fit is sending the whole dataset to the executors to train in each executor individually a model with a specific parameters combination every time, and it seems that Spark doesn’t like to broadcast the dataset too much. Here is the relevant part of pyspark.ml.tunning _fit method:

        for i in range(nFolds):
        validation = datasets[i][1].cache()
        train = datasets[i][0].cache()

        tasks = _parallelFitTasks(est, train, eva, validation, epm, collectSubModelsParam)
        for j, metric, subModel in pool.imap_unordered(lambda f: f(), tasks):
            metrics[j] += (metric / nFolds)
            if collectSubModelsParam:
                subModels[i][j] = subModel

        validation.unpersist()
        train.unpersist()

What I've tried:

I have tried the most common solutions for the broadcast warning I'm geting even though I already imagined they wouldn’t work in my case. Concretely I’ve modified the partitions of the data and the parallelization parameter, as well as the memory size for both the executors and the driver.

I am quite sure if the distributed implementation of CrossValidator exists in the ml library is because it is actually useful. However, I must be missing something, as I’m unable to think how to make it work if my dataset is big and it needs to be broadcasted so many times (because of the implementation). Maybe I’m missing something?

I am facing a similar issue. The solution described here[1] helps somewhat but some tasks still fail after a while. I am trying to figure out if increasing data number of partitions helps. In my case, tasks fail because it seems that CrossValidator hogs memory even after tasks finish executing. Take a look at the thread. [1]: https://stackoverflow.com/a/65389100/8761414 — ottovon, Dec 27 '21 at 05:58

Why spark.ml CrossValidator gives "Broadcasting large task binary with size X" with big dataset?

0 Answers0