spark shuffle memory error: failed to allocate direct memory

Question

When performing a couple of joins on spark data frames (4x) I get the following error:

org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 byte(s) of direct memory (used: 4294967296, max: 4294967296)

Even when setting:

--conf "spark.executor.extraJavaOptions-XX:MaxDirectMemorySize=4G" \

it is not solved.

What I have observed is: it already fails on the second join. I can move the first and second join to be a broadcast join. That it works just fine. For the 3rd and 4th one - I still need to find a solution — Georg Heiler, Mar 23 '20 at 07:38
When writing each step to disk (each intermediate result after a join) it is working just fine. But this seems to be a rather hacky solution which generates additional IO — Georg Heiler, Mar 23 '20 at 08:57
I tried to braodcast the 3rd and 4th join as well. In memory cached it is about 11G in size. Unfottunately, somehow I do not find the right sttings to broadcast it. So for now I need to resort to wrtiting to disk. However, this is also not reliable in all cases as it turns out. — Georg Heiler, Mar 24 '20 at 05:31
`When writing each step to disk (each intermediate result after a join) it is working just fine. But this seems to be a rather hacky solution which generates additional IO – ` How do you write each step to disk? — Nebi M Aydin, Aug 04 '23 at 16:53

score 4 · Accepted Answer · answered Mar 23 '20 at 23:18

4

Seems like there are too many in flight blocks. Try with smaller values of spark.reducer.maxBlocksInFlightPerAddress. For reference take a look at this JIRA

Quoting text:

For configurations with external shuffle enabled, we have observed that if a very large no. of blocks are being fetched from a remote host, it puts the NM under extra pressure and can crash it. This change introduces a configuration spark.reducer.maxBlocksInFlightPerAddress , to limit the no. of map outputs being fetched from a given remote address. The changes applied here are applicable for both the scenarios - when external shuffle is enabled as well as disabled.

answered Mar 23 '20 at 23:18

D3V

1,543
11
21

very interesting. but even when setting ` --conf spark.shuffle.service=true \ ` the bug remains. – Georg Heiler Mar 24 '20 at 06:03
What version of spark are you on? – D3V Mar 25 '20 at 12:01
2.3.2 (HDP 3.1) – Georg Heiler Mar 25 '20 at 12:03
`spark.reducer.maxBlocksInFlightPerAddress` was set to? – D3V Mar 25 '20 at 12:04
Not tried yet (default is Int.maxValue). but even the above-mentioned workaround (write to file every step) tends to fail now and then - but always due to retries exhausted with this error message. What would be a recommended number here for let's say 30 executors with each 5 cores? – Georg Heiler Mar 25 '20 at 12:06
It is hard to say without knowing complete application profile, but you can start with a reasonable number like 10 and increase it if it is proving to be too slow. – D3V Mar 25 '20 at 12:32
Do I understand correctly that this is the number of blocks (for each worker node) which is handled in parallel? I.e. this is not a paramter globally, but for each node? – Georg Heiler Mar 25 '20 at 12:35
This is very specific to the job and applies to the fetch from NM. Other jobs will not follow this setting if it is not applied to those jobs. So effect is very local. Good for testing purposes. If you ever want to make it global for spark cluster, then you can always put it in defaults. – D3V Mar 25 '20 at 12:41
Well yes ; ) obviously but with global I did not mean in Ambari settings in spark-default's conf. What I mean is: if I add or remove executors do I need to fine tune this number - or will it just scale automatically? Also, do I understand correctly, that this parameter cannot be changed during the runtime of a job? I have a complex job where various files are written. I would prefer to only change the configuration for this specific part where it fails for now. – Georg Heiler Mar 25 '20 at 12:46
Isolating it would be difficult if it is a complex job. Autoscaling is the reason why this parameter exists, so it should be fine. – D3V Mar 25 '20 at 13:59
When running the job today (with write outs for each step) the failure did not occur. It looks like this parameter indeed is the solution. Thanks. I will accept the answer when the job runs stable for a couple of runs. – Georg Heiler Mar 26 '20 at 06:24
1

I can confirm now, that this indeed is fixing my problems! Now the original join works just fine without any exceptions without materializing intermediate steps! – Georg Heiler Mar 30 '20 at 12:55

Dmytro Maslenko · Answer 2 · 2023-05-04T01:09:17.743

I had the similar issue:

org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 byte(s) of direct memory (used: 57445187584, max: 57446760448)

where the 57445187584 is exactly the configured spark.executor.memory=54850m.

Our job really worked with huge amount of data and more memory was needed.

What I did is upgraded the Dataproc machine type from e2-highmem-8 to e2-highmem-16 and defined new memory spark.executor.memory=114429m.

spark shuffle memory error: failed to allocate direct memory

2 Answers2

Linked