2

This question is similar to the one asked here. But, the answer does not help me clearly understand what user memory in spark actually is.

Can you help me understand with an example. Like, an example to understand execution and storage memory would be: In c = a.join(b, a.id==b.id); c.persist() the join operation (shuffle etc) uses the execution memory, the persist uses the storage memory to keep the c cached. Similarly, can you please give me an example of user memory?

From the official documentation, one thing I understand is that it stores UDFs. Storing the UDFs does not warrant even a few MBs of space let alone the default of 25% that is actually used in spark. What kind of heavy objects might get stored in user memory that one should be careful of and should take into consideration while deciding to set the necessary parameters (spark.memory.fraction) that set the bounds of user memory?

figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56

1 Answers1

1

That's a really great question, to which I won't be able to give a fully detailed answer (I'll be following this question to see if better answers pop up) but I've been snooping around on the docs and found out some things.

I wasn't sure whether I should post this as an answer, because it ends with a few questions of my own but since it does answer your question to some degree I decided to post this as an answer. If this is not appropriate I'm happy to move this somewhere else.

Spark configuration docs

From the configuration docs, you can see the following about spark.memory.fraction:

Fraction of (heap space - 300MB) used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records. Leaving this at the default value is recommended. For more detail, including important information about correctly tuning JVM garbage collection when increasing this value, see this description.

So we learn it contains:

  • Internal metadata
  • User data structures
  • Imprecise size estimation in case of sparse, unusually large records

Spark tuning docs: memory management

Following the link in the docs, we get to the Spark tuning page. In there, we find a bunch of interesting info about the storage vs execution memory, but that is not what we're after in this question. There is another bit of text:

spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MiB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.

and also

The value of spark.memory.fraction should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. See the discussion of advanced GC tuning below for details.

So, this is a similar explanation and also a reference to garbage collection.

Spark tuning docs: garbage collection

When we go to the garbage collection page, we see a bunch of information about classical GC in Java. But there is a section that discusses spark.memory.fraction:

In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of memory used for caching by lowering spark.memory.fraction; it is better to cache fewer objects than to slow down task execution. Alternatively, consider decreasing the size of the Young generation. This means lowering -Xmn if you’ve set it as above. If not, try changing the value of the JVM’s NewRatio parameter. Many JVMs default this to 2, meaning that the Old generation occupies 2/3 of the heap. It should be large enough such that this fraction exceeds spark.memory.fraction.

What do I gather from this

As you have already said, the default spark.memory.fraction is 0.6, so 40% is reserved for this "user memory". That is quite large. Which objects end up in there?

This is where I'm not sure, but I would guess the following:

  • Internal metadata
    • I don't expect this to be huge?
  • User data structures
    • This might be large (just intuition speaking here, not sure at all), and I would hope that someone with more knowledge about this would be able to give some good examples here.
      • If you make intermediate structures during a map operation on a dataset, do they end up in user memory or in execution memory?
  • Imprecise size estimation in the case of sparse, unusually large records
    • Seems like this is only triggered in special cases, would be interesting to know where/how this gets decided.
    • In some other place in the docs it is said safeguarding against OOM errors in the case of sparse and unusually large records. So it might be that this is more of a safety buffer than anything else?
Koedlt
  • 4,286
  • 8
  • 15
  • 33
  • Can you please give me an example of a user data structure? – figs_and_nuts Dec 13 '22 at 07:24
  • I'm afraid I can't: had a look around in the source code and I'm not able to identify this. As I put in my answer here, 40% is really large! So I made a SO [question](https://stackoverflow.com/q/74784139/15405732) that might give more info on this! Let's hope we get enlightened :) – Koedlt Dec 13 '22 at 11:34
  • 1
    fingers crossed. Upvoted that question – figs_and_nuts Dec 13 '22 at 11:38