Lately I've been running a memory-heavy spark job and started to wonder about storage levels of spark. I persisted one of my RDDs as it was used twice using StorageLevel.MEMORY_AND_DISK
. I was getting OOM Java heap space during the job. Then, when I removed the persist completely, the job has managed to go through and finish.
I always thought that the MEMORY_AND_DISK
is basically a fully safe option - if you run out of memory, it spills the object to disk, done. But now it seemed that it did not really work in the way I expected it to.
This derives two questions:
- If
MEMORY_AND_DISK
spills the objects to disk when executor goes out of memory, does it ever make sense to useDISK_ONLY
mode (except some very specific configurations likespark.memory.storageFraction=0
)? - If
MEMORY_AND_DISK
spills the objects to disk when executor goes out of memory, how could I fix the problem with OOM by removing the caching? Did I miss something and the problem was actually elsewhere?