5

I'm using Dataproc 1.4 and I have a Spark Job with 35,000 partitions (input size is 3.4 TB). I'm using 120 nodes clusters of n1-standard-4 machines (so 480 cpus).

The problem is that I ran into network errors during shuffles (same results with external shuffle service enabled or disabled), e.g.

"Connection to xxx:44035 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong."

Is there any recommendation on the number of partitions per node? Should I use more nodes or bigger instance types?

edit 2021-08-23

As suggested in the comments, I've tested with 128G and 256G of disk and both runs had the same write speed (~ 4 GiB/s)

enter image description here

  • Between 3:00 and 3:20, the spark was reading the input and writing the shuffle files. The CPU consumption is quite low (~40%). Why?
  • After 3:20 PM, a new stage was reading shuffle files to write the output and the job failed
Simone Lungarella
  • 301
  • 1
  • 4
  • 15
Yann Moisan
  • 8,161
  • 8
  • 47
  • 91

0 Answers0