I'm running a large job (with a long process) on a medium sized data (~100GB input data). Below is the AWS EMR settings:
EMR version: emr-6.6.0 PySpark version: 3.2.0
EMR cluster:
master - c5.4xlarge (16 vCore, 32 GiB memory, EBS only storage EBS Storage:120 GiB)
cores - 15 of r5a.4xlarge (16 vCore, 128 GiB memory, EBS only storage EBS Storage:240 GiB)
I configured the spark settings according to the best practice: https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
It runs all fine, but one executor or the driver (I'm pretty sure it's the driver) hangs for long time at the end, see image attached. This is the cpu utilization plot from AWS CloudWatch.
I persisted the key dataframes to avoid re-computing them, and performed checkpointing for the main dataframes to reduce the data lineage as well. Also I did not see any meaningful logs during that time period. Does anyone know what could cause the hang at the end of a spark program?