What is the recommended cluster size for a Spark job with 35,000 partitions

Asked Aug 21 '21 at 15:08

Active Sep 12 '21 at 03:21

Viewed 376 times

I'm using Dataproc 1.4 and I have a Spark Job with 35,000 partitions (input size is 3.4 TB). I'm using 120 nodes clusters of n1-standard-4 machines (so 480 cpus).

The problem is that I ran into network errors during shuffles (same results with external shuffle service enabled or disabled), e.g.

"Connection to xxx:44035 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong."

Is there any recommendation on the number of partitions per node? Should I use more nodes or bigger instance types?

edit 2021-08-23

As suggested in the comments, I've tested with 128G and 256G of disk and both runs had the same write speed (~ 4 GiB/s)

Between 3:00 and 3:20, the spark was reading the input and writing the shuffle files. The CPU consumption is quite low (~40%). Why?
After 3:20 PM, a new stage was reading shuffle files to write the output and the job failed

edited Sep 12 '21 at 03:21

Simone Lungarella

asked Aug 21 '21 at 15:08

Yann Moisan

8,161
8
47
91

Could you provide more details about your cluster? Are these 120 nodes all primary workers? Is autoscaling enabled? Is EFM enabled? – Dagang Aug 21 '21 at 16:44
all primary nodes. no autoscaling. no EFM. – Yann Moisan Aug 21 '21 at 18:58
How large is the disk size for workers? – Dagang Aug 21 '21 at 19:34
disk size is 100G SSD – Yann Moisan Aug 21 '21 at 19:42
Disk size is too small, the recommended is 1TB at least. If SSD is too expensive for you, just use the default PD. https://cloud.google.com/dataproc/docs/support/spark-job-tuning#optimize_disk_size – Dagang Aug 21 '21 at 19:58
Did increasing disk size solve your problem? – Dagang Aug 22 '21 at 16:47
no, as you can see in my edit – Yann Moisan Aug 23 '21 at 13:33
I see that you tried with 128 and 256GB, but not with 1TB. Can you please try that and let us know if it helped? – Gaurangi Saxena Aug 30 '21 at 19:44

What is the recommended cluster size for a Spark job with 35,000 partitions

0 Answers0