0

I am running a spark job on a AWS EMR 6.6, (Spark 3.2.0) however it seems that spark is writing a lot of data to disk. I always thought that spark was all in memory, but it appears that spark writes temporary files to disk each time there is a wide shuffle i.e. between stages (I am not sure why). I however, this is really only an issue because these temp files don't get deleted between stages.

For my understanding the temp files from one stage are read by the next stage, however, I don't think they should be needed for the stage after that. So if my job is 3 stages, after stage 1 completes I should be able to delete the temporary shuffle files created by stage 1 before I run stage 3.

I believe this would resolve my problem since it means my local storage will have at most 2 (sequential) stages worth of shuffle temp data. However, I can't seem to find any way to do this.

I know I can just increase the EBS storage or use AWS Glue, but I'd like to avoid that.

Mattreex
  • 189
  • 2
  • 17
  • Your assumption is right, spark does not store the shuffled data into disk after the stage is completed. The only way to it is if you set it to persist or cache your data. Can you check how much data is being spilled to the disk during the stage you are transforming? You can check on the stage details: https://spark.apache.org/docs/latest/web-ui.html#stage-detail – Thiago Baldim Jul 13 '22 at 04:51
  • I have no cache and no persist anywhere in my code. There is 1300 GB of disk spill in stage 22, but my job fails at stage 25. I have no disk spill in any other stage. I am running a cluster of 50 nodes with 200 GB each, so 10TB total. My data should not be skewed. – Mattreex Jul 13 '22 at 14:25
  • @ThiagoBaldim are temp shuffle files supposed to be deleted once you persist/cache your dataframe? – Ajayv Aug 10 '22 at 16:48

0 Answers0