Spark stores on-going processed data in the _temporary
folder. Once the job finishes, the data is moved to its final destination. However, when there are tens of thousands of partitions, it takes quite some time to move the files from one place to the other. Question: how to speed up this move?
Running applications in yarn-cluster mode, on a bare-metal Hadoop, not on AWS (no S3, EMR, etc).
Update: my job takes around 1 hour to generate 2.3T of data in 25000 partitions, and another hour to move data out of _temporary.