Spark - StorageLevel (DISK_ONLY vs MEMORY_AND_DISK) and Out of memory Java heap space

Question

Lately I've been running a memory-heavy spark job and started to wonder about storage levels of spark. I persisted one of my RDDs as it was used twice using StorageLevel.MEMORY_AND_DISK. I was getting OOM Java heap space during the job. Then, when I removed the persist completely, the job has managed to go through and finish.

I always thought that the MEMORY_AND_DISK is basically a fully safe option - if you run out of memory, it spills the object to disk, done. But now it seemed that it did not really work in the way I expected it to.

This derives two questions:

If MEMORY_AND_DISK spills the objects to disk when executor goes out of memory, does it ever make sense to use DISK_ONLY mode (except some very specific configurations like spark.memory.storageFraction=0)?
If MEMORY_AND_DISK spills the objects to disk when executor goes out of memory, how could I fix the problem with OOM by removing the caching? Did I miss something and the problem was actually elsewhere?

1. Do you get the same problem when you use DISK_ONLY with persisting? 2. It'd be hard to tell without seeing the code. — Islam Hassan, Sep 27 '17 at 23:59
Are you using heavy data structures? spark.memory.fraction defines the fraction of heap used for execution and storage. This means that when you're caching, you have less heap for your data structures, so OOM errors can appear more easily. — Miguel, Sep 28 '17 at 07:18
@IslamHassan Well, I must admit I did not try DISK_ONLY, I've given up and used no persisting at all. What would you expect to see in the code regarding the question? (despite me missing something, I'm asking about the spilling to disk) — Matek, Oct 01 '17 at 20:08
@Miguel In the RDD I'm processing, the object I keep there might be a bit heavy, but these are in the RDD so I believe these don't count to your question. Regarding out-of-spark structures, there's basically nothing. — Matek, Oct 01 '17 at 20:09
Actual it depends on your job design. Could you please give an example of your problem? — Moustafa Mahmoud, May 11 '19 at 12:43

score 6 · Answer 1 · answered Nov 19 '19 at 09:41

So, after few years ;) that's what I believe happened:

Caching is not a way to save execution memory. The best you can do is not to lose execution memory (DISK_ONLY) when caching.
It's most likely the lack of execution memory that caused my job to throw OOM error, although I don't remember the actual use case.
I used MEMORY_AND_DISK caching and the MEMORY part took its part from the unified region which made it impossible for my job to finish (since the Execution = Unified - Storage memory was not enough to perform the job)
Due to above, when I removed caching at all, it took slower, but the job had enough execution memory to finish.
With DISK_ONLY caching it seems that the job would therefore finish as well (although not necessarily faster).

https://spark.apache.org/docs/latest/tuning.html#memory-management-overview

score 1 · Answer 2 · answered Nov 18 '19 at 21:47

1

MEMORY_AND_DISK doesn't "spill the objects to disk when executor goes out of memory". It tells Spark to write partitions not fitting in memory to Disk so they will be loaded from there when needed.

Dealing with huge datasets you should definately consider persisting data to DISK_ONLY. https://spark.apache.org/docs/latest/rdd-programming-guide.html#which-storage-level-to-choose

answered Nov 18 '19 at 21:47

baitmbarek

2,440
4
18
26

Except for being more precise about what MEMORY_AND_DISK does, this doesn't really answer the questions. Although you're right about DISK_ONLY being the only way of caching that doesn't waste memory. I'll elaborate on that in a separate answer. – Matek Nov 19 '19 at 09:31

Spark - StorageLevel (DISK_ONLY vs MEMORY_AND_DISK) and Out of memory Java heap space

2 Answers2