Why is it possible to have "serialized results of n tasks (XXXX MB)" be greater than `spark.driver.memory` in pyspark?

Question

I launched a spark job with these settings (among others):

spark.driver.maxResultSize  11GB
spark.driver.memory         12GB

I was debugging my pyspark job, and it kept giving me the error:

serialized results of 16 tasks (17.4 GB) is bigger than spark.driver.maxResultSize (11 GB)

So, I increased the spark.driver.maxResultSize to 18 G in the configuration settings. And, it worked!!

Now, this is interesting because in both cases the spark.driver.memory was SMALLER than the serialized results returned.

Why is this allowed? I would assume this not to be possible because the serialized results were 17.4 GB when I was debugging, which is more than the size of the driver, which is 12 GB, as shown above?

How is this possible?

Just curious - Why the downvote? – makansij Dec 12 '16 at 06:28 — makansij, Dec 12 '16 at 06:28

score 1 · Answer 1 · answered Jul 21 '16 at 08:48

It is possible because spark.driver.memory configures JVM driver process not Python interpreter and data between them is transferred with sockets and driver process don't have to keep all data in memory (don't convert to local structure).

score 0 · Accepted Answer · answered Jul 26 '16 at 11:27

My understanding is that when we ask Spark to perform an action, the results from all the partitions are serialized, but these results need not be sent to the driver, unless some operation such as a collect() is performed.

spark.driver.maxResultSize defines a limit on the total size of serialized results of all partitions & is independent of the actual spark.driver.memory. Therefore, your spark.driver.memory could be lesser than your spark.driver.maxResultSize and your code would still work.

We could probably get a better idea if you tell us the transformations and actions used in this process or your code snippet.

Why is it possible to have "serialized results of n tasks (XXXX MB)" be greater than `spark.driver.memory` in pyspark?

2 Answers2