I have a pyspark job using spark 2.4.0 that is hanging at 198/200 blocks. When I look at what's happening with those two blocks, they say "RUNNING" and are on the same node. If I go to the stderr log page in the web browser (http://node2:8081/logPage/?appId=app-20181128112202-0000&executorId=2&logType=stderr), the error is:
"Error: invalid log directory /usr/local/spark/spark-2.4.0-bin-hadoop2.7/work/app-20181128112202-0000/2/"
If I navigate to that folder directory on that node, there is no /2/ folder, but there is a /3/ folder. This is stage 16, so the node has already done a bunch of work at this point.
This only ever occurs on one of the nodes. I've also cleared the work directory on all of the nodes to be sure.
I'm at a loss to why it's trying to change the stdout
to /2/ from /3/ - any thoughts on how I can debug that?
I'm also having trouble finding where it assigns the folder number within work: /usr/local/spark/spark-2.4.0-bin-hadoop2.7/work/app-20181129134852-0000/2/
Edit
I noticed that I'm getting blocked threads that seem to be blocking each other.
Thread 33 is blocked by 224, and 224 is blocked by 33. I'm not sure how to figure out why those are blocking - seems like something to do with memory, but I'm not quite sure how to figure out what it is...