0

I am running a pyspark script using spark-submit. The job runs succesfully.

Now I am trying to collect console output of this job to a file like below.

spark-submit in yarn-client mode

spark-submit --master yarn-client --num-executors 5 --executor-cores 5 --driver-memory 5G --executor-memory 10G --files /usr/hdp/current/spark-client/conf/hive-site.xml --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar --py-files customer_profile/customer_helper.py#customer_helper.py,customer_profile/customer_json.json customer_profile/customer.py  > /home/$USER/logs/customer_2018_10_26 2>&1

I am able to redirect all the console output written to the file /home/$USER/logs/customer_2018_10_26 includes all loglevels and any stacktrace errors

spark-submit in yarn-cluster mode

spark-submit --master yarn-cluster --num-executors 5 --executor-cores 5 --driver-memory 5G --executor-memory 10G --files /usr/hdp/current/spark-client/conf/hive-site.xml --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar --py-files customer_profile/customer_helper.py#customer_helper.py,customer_profile/customer_json.json customer_profile/customer.py  > /home/$USER/logs/customer_2018_10_26 2>&1

USing yarn-cluster mode I am unable to redirect console output written to the file /home/$USER/logs/customer_2018_10_26.

The problem is if my job fails in yarn-client mode I can go to file /home/$USER/logs/customer_2018_10_26 and easily look for the errors.

But if my job fails in yarn-cluster mode then I am not getting the stack trace to be copied to the file /home/$USER/logs/customer_2018_10_26. The only way I can debug the error is using yarn logs.

I would like to avoid using the yarn logs option Instead I want to see the error stack trace in the file /home/$USER/logs/customer_2018_10_26 itself while using yarn-cluster mode.

How can I achieve that?

nmr
  • 605
  • 6
  • 20
  • Are you using YARN log aggregation? Even if you are, logs only go to a specific folder on HDFS that the Spark History Server can read, not into the user directory for those that ran the job... Besides that, you're only capturing the driver output, not the executors – OneCricketeer Oct 27 '18 at 21:32
  • @cricket_007 Yes I have `Yarn log aggregation` enabled. and I do capture the executors output as well Please let me know – nmr Oct 29 '18 at 03:35
  • How are you getting the executor logs? I don't think `> output.log` captures that – OneCricketeer Oct 29 '18 at 03:36
  • @cricket_007 `> output.log` this doesn't get the executor logs. It gets only the executor logs. How do I get the executor logs? Could you please help me in this regards – nmr Oct 29 '18 at 15:54
  • You can't get it from `spark-submit` output, unfortunately. That's where YARN log aggregation comes in. That aggregation is not immediate, and it really only helps to get it to the Spark History Server. Other solutions exist that are outside of YARN/Spark such as FIlebeat or Fluentd to watch the logs on the actual executor machines and send them elsewhere, such as Elasticsearch for easier gathering – OneCricketeer Oct 29 '18 at 18:32

0 Answers0