1

As far as I know, when you use .persist(), writing the line persist sets only the persistence level, and then the next action in the script will cause the actual persistence work to be invoked.

However, sometimes, seemingly depending on the dataframe, persist() will lead to a Java out of heap space error.

What is the intended behavior of persist, and why could this simple line actually lead to this memory error?

Kristian
  • 21,204
  • 19
  • 101
  • 176

1 Answers1

3

The whole point of RDD Persistence is to store intermediate results in memory, allowing faster access on subsequent use. There are several different levels of persistence, ranging from MEMORY_ONLY (the default), over MEMORY_AND_DISK, up to DISK_ONLY. Persisting purely to memory means that there has to be enough heap space for the persist to work. If you run out of heap memory you can

  • go for a lower persistence level,
  • reduce the size of your partitions,
  • reduce the total number of persisted RDD stages, e.g., by unpersisting them.

Finding the right balance is one of the key challenges in Spark to achieve a good tradeoff between memory and CPU usage.

bluenote10
  • 23,414
  • 14
  • 122
  • 178
  • while I agree that is likely what is happening, I'm also asking why writing the line `persist` actually invokes any tasks at all. I was under the impression that it was lazily evaluated. and it seems to immediate invoke work in some cases. – Kristian Sep 02 '16 at 06:31
  • 1
    @Kristian: That sounds indeed unusual. Do you have a minimal example that would show the problem? Calling `persist()` mainly registers the RDD in the Spark context for persistence, which should not require a significant amount of memory. – bluenote10 Sep 02 '16 at 17:08