I'm using Dataproc 1.4 and I have a Spark Job with 35,000 partitions (input size is 3.4 TB). I'm using 120 nodes clusters of n1-standard-4 machines (so 480 cpus).
The problem is that I ran into network errors during shuffles (same results with external shuffle service enabled or disabled), e.g.
"Connection to xxx:44035 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong."
Is there any recommendation on the number of partitions per node? Should I use more nodes or bigger instance types?
edit 2021-08-23
As suggested in the comments, I've tested with 128G and 256G of disk and both runs had the same write speed (~ 4 GiB/s)
- Between 3:00 and 3:20, the spark was reading the input and writing the shuffle files. The CPU consumption is quite low (~40%). Why?
- After 3:20 PM, a new stage was reading shuffle files to write the output and the job failed