i am running a job in spark shell job
--num-executors 15
--driver-memory 15G
--executor-memory 7G
--executor-cores 8
--conf spark.yarn.executor.memoryOverhead=2G
--conf spark.sql.shuffle.partitions=500
--conf spark.sql.autoBroadcastJoinThreshold=-1
--conf spark.executor.memoryOverhead=800
the job is stuck and not starting the code is doing a cross join with filter conditions on a large dataset of 270m. i have increased partitions to 16000 for the large table 270m and the small table (100000), i have converted it to a broadcast variable
i have added the spark ui for the job ,
so i do have to reduce the partitions , increase the executors, any idea
thanks for helping out .
![spark ui 1][1] ![spark ui 2][2] ![spark ui 3][3] after 10 hours
status: tasks : 7341/16936 (16624 failed)
check the container error logs
RM Home
NodeManager
Tools
Failed while trying to construct the redirect url to the log server. Log Server url may not be configured
java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all.
[50per completed ui 1 ][4][50per completed ui 2][5] [1]: https://i.stack.imgur.com/nqcys.png [2]: https://i.stack.imgur.com/S2vwL.png [3]: https://i.stack.imgur.com/81FUn.png [4]: https://i.stack.imgur.com/h5MTa.png [5]: https://i.stack.imgur.com/yDfKF.png