I'm getting an error in a spark job that's surprising me:
Total size of serialized results of 102 tasks (1029.6 MB) is
bigger than spark.driver.maxResultSize (1024.0 MB)
My job is like this:
def add(a,b): return a+b
sums = rdd.mapPartitions(func).reduce(add)
rdd has ~500 partitions and func takes the rows in that partition and returns a large array (a numpy array of 1.3M doubles, or ~10Mb). I'd like to sum all these results and return their sum.
Spark seems to be holding the total result of mapPartitions(func) in memory (about 5gb) instead of processing it incrementally, which would require about only 30Mb.
Instead of increasing spark.driver.maxResultSize, is there a way perform the reduce more incrementally?
Update: Actually I'm kinda surprised that more that two results are ever held in memory.