I am trying to capture the stdout spark logs in an s3 path partitioned by date and was wondering how to do so using the spark submit command. The spark submit runs on a daily basis and I want to create partition based on the date at which the spark submit command is executed. below is the spark submit command I am using to run the pyspark script.
In the current process we create amazon EMR clusters on which I am able to see the logs in Yarn but once the cluster is switched off or terminated I'd lose the logs. Hence i want to redirect these print statements and etc to a s3 path. Any help would be great as I am new to this. Thanks
spark-submit --master yarn --deploy-mode cluster --packages net.snowflake:snowflake-jdbc:3.12.6,net.snowflake:spark-snowflake_2.11:2.7.2-spark_2.4 --conf spark.speculation=false --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=python3 --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=python3 --conf spark.blacklist.enabled=true --conf spark.blacklist.timeout=1h --conf spark.yarn.executor.memoryOverhead=4098m --conf fs.s3n.multipart.uploads.enabled=true --conf spark.sql.parquet.writeLegacyFormat=true abc.py -r s3a://blahblahblah -path s3a://bleeh bleeh -e dev -dt 2021-08-11 -f xyz
Note: I do see an option where we can redirect the logs by including &> s3:/ However i wanna understand how we can store and read logs from s3 path with respect to the run date at which i am executing the spark submit command