Apache Spark - How to avoid failing slow tasks

Question

Dear fellow Apache Spark enthusiasts

I recently kicked off a sideline project with the goal of turning a couple of ODROID XU4 computers into a stand-alone Spark Cluster.

After setting up the cluster I ran into a problem that seems to be specific to heterogeneous multi processors. Spark executor tasks run extremely slow on the XU4 when using all 8 processors. The reason, as mentioned in a comment on my post below, is that Spark does not wait for the executors that have been kicked off on the slow processors.

http://forum.odroid.com/viewtopic.php?f=98&t=21369&sid=4276f7dc89a8d7825320e7f705011326&p=152415#p152415

One solution is to use fewer executor cores and to set the CPU affinity to not use the LITTLE processors. This is however a less than ideal solution.

Is there a way to ask Spark to wait a bit longer for feedback from slower executors? Obviously waiting too long will have a negative effect on performance. The positive effect of utilising all cores should however balance out the negative effect.

Thanks in advance for any help!

score 3 · Accepted Answer · answered Jul 30 '16 at 08:55

@Dikei response highlights two potential causes, but it turns out the problem is not the one he suspects. I have the same set up as the @TJVR, and it turns out the driver is missing heartbeats from executors. To address this, I added the following to spark-env.sh:

export SPARK_DAEMON_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
export SPARK_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"

This changes the default timeouts for executor heartbeats. Also set spark.shuffle.consolidateFiles to true to improve performance on my ext4 filesystem. These defaults changes allowed me to increased the core usage above one and not frequently lose executors.

Great find @claireware. I applied the settings and get much better performance, even when using 8 cores. It however seems 2 - 3 is the safest number to use. — TJVR, Aug 07 '16 at 12:04
@TJVR Using more cores uses more RAM overhead, and on the 2GB XU4 board, this can be significant. If you are working with a large dataset, I found i best to dial back to 1 core so that more RAM is available to the calculations. However, I have made 2-3 cores work for small data sets. — kamprath, Aug 07 '16 at 21:03

Kien Truong · Answer 2 · 2016-07-26T14:19:00.470

Spark does not kill slow executors, but will mark an executor as dead in two cases:

If the driver doesn't receive a heartbeat signal in a while (default: 120s): The executor have to regularly (default: 10s) send a heartbeat message to notify the driver that it is still alive. Network issues or large GC pause can prevent these heartbeat from happening.
The executor has crashed due to exception in the code or JVM runtime error, most likely due to GC pause as well.

In my opinion, it's probably that GC overhead has killed your slowed executor and the driver has to redo the task on a different executor. If this is the case, you can try splitting your data into smaller partitions, so that each executor has to process less data at a time.

Secondly, you should NOT set spark.speculation to 'true' without testing. It's 'false' by default for a reason, I've seen it do more harm than good in some case.

Lastly, the following assumption might not hold true.

The positive effect of utilising all cores should however balance out the negative effect.

Slow executors (straggles) can cause the program to perform much worse, depending on workload. It's entirely possible that avoiding the slow cores will provide the best result.

Hi Dikei Thanks for the insights. I will try smaller jobs to see whether that solves the problem. As each of the 4 ODROIDXU4's have 4 LITTLE processors, do you think not using these will speed up the analysis? It sounds like under utilisation of available cores. I will spend some time testing different settings and then only using a subset of processors. — TJVR, Jul 26 '16 at 11:28

Apache Spark - How to avoid failing slow tasks

2 Answers2