How to capture std out logs from spark submit command in cluster mode

Question

I am trying to capture the stdout spark logs in an s3 path partitioned by date and was wondering how to do so using the spark submit command. The spark submit runs on a daily basis and I want to create partition based on the date at which the spark submit command is executed. below is the spark submit command I am using to run the pyspark script.

In the current process we create amazon EMR clusters on which I am able to see the logs in Yarn but once the cluster is switched off or terminated I'd lose the logs. Hence i want to redirect these print statements and etc to a s3 path. Any help would be great as I am new to this. Thanks

spark-submit --master yarn --deploy-mode cluster --packages net.snowflake:snowflake-jdbc:3.12.6,net.snowflake:spark-snowflake_2.11:2.7.2-spark_2.4 --conf spark.speculation=false  --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=python3 --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=python3 --conf spark.blacklist.enabled=true  --conf spark.blacklist.timeout=1h  --conf spark.yarn.executor.memoryOverhead=4098m  --conf fs.s3n.multipart.uploads.enabled=true  --conf spark.sql.parquet.writeLegacyFormat=true abc.py -r s3a://blahblahblah -path s3a://bleeh bleeh -e dev -dt 2021-08-11 -f xyz

Note: I do see an option where we can redirect the logs by including &> s3:/ However i wanna understand how we can store and read logs from s3 path with respect to the run date at which i am executing the spark submit command

Does this answer your question? [Where does YARN application logs get stored in EMR before sending to S3](https://stackoverflow.com/questions/52940633/where-does-yarn-application-logs-get-stored-in-emr-before-sending-to-s3) — werner, Aug 17 '21 at 16:01
Hi Werner - the yarn stdout logs should be written to an s3 path in my case on a daily basis as part of the airflow job which is automated. The solution in that wouldnt help me with it..:( Correct me if i am wrong! — Naveen Subramanian, Aug 17 '21 at 17:25
Then you could try to collect the logs with `yarn logs -applicationId ` — werner, Aug 17 '21 at 18:24

How to capture std out logs from spark submit command in cluster mode

0 Answers0