Apache Spark : Long Shuffle Read Blocked Time. Why?

Question

I have an 8-nodes Spark standalone cluster with a total of 880 GB RAM and 224 cores.

I just can't explain why the Shuffle Read Blocked Time is so long : about 20 minutes per task. Do you have an idea why ? What are the bottleneck in such case ?

To give more details, you can see bellow the details for stage :

The following tasks metrics from Spark Ui bellow :

The agregation per executor bellow :

And the stage full DAG :

The executor tab :

The list of stages :

Thank you!

https://stackoverflow.com/questions/45740567/spark-shuffle-read-takes-significant-time-for-small-data could be related — Vish, Sep 14 '21 at 21:30
Related to storage ? Yes it could be. Could you elaborate a bit please ? In fact, the underlying storage is GPFS (IBM equivalent to HDFS). And we can't have IBM FPO enable (so to be clear : no local disk for spark.local.dir, but the GPFS distributed filesystem is also used for spark local dir which might come with some latencies..) — Klun, Sep 14 '21 at 21:36
Related to GC ? Yes it could be too. I have 8 executors (one executor = one node), so each executor has about 100GB free memory. I use the default GC. Do you have some tips here to enhance my ETL ? When I look to the GUI, "executors" tab, the "GC" column doesn't appear in red however — Klun, Sep 14 '21 at 21:39
yup ..i was only pointing toward latencies/throughput but link posted above also hints about previous state executors having gc causing this issue — Vish, Sep 14 '21 at 21:40
I have added at the end of my message the "executor tab". GC seems OK.. — Klun, Sep 14 '21 at 21:43
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/237109/discussion-between-vish-and-klun). — Vish, Sep 14 '21 at 21:45
It might not be related to your issue, but I find 120GB spill memory quite suspicious — BlueSheepToken, Sep 14 '21 at 21:46
I am on my phone right now, I Can elaborate tomorrow. Meanwhile Can you try to increase the number of Shuffle partitions ? — BlueSheepToken, Sep 14 '21 at 21:54
Yes I will try to increase number of shuffle partitions tomorrow. For now, it is set to 2000 to process my 17 billions of rows — Klun, Sep 14 '21 at 21:55
Did you find a solution for this? I have the exact same problem. I disappears on restart of the cluster, but then the "Shuffle Read Blocked Time" gradually start increasing with each job. — vntzy, Apr 27 '23 at 12:43

Apache Spark : Long Shuffle Read Blocked Time. Why?

0 Answers0