What is the best way to daemonise the Spark-Streaming job at the same time logging any exception occur in the log file with log-rotation?
Asked
Active
Viewed 1,674 times
2
-
What do you mean by *daemonise the Streaming Job*? Why would one need to do that after scheduling the job with spark-submit? – Yuval Itzchakov May 27 '16 at 14:37
-
with nohup you can achieve this. pls check my answer. – Ram Ghadiyaram May 27 '16 at 16:41
-
@YuvalItzchakov What do you mean by `scheduling the job` with spark-submit. You just submit the job to spark. Didn't heard of it that you can schedule it for a particular start time. If you have done it could you please elaborate. – Naresh May 30 '16 at 04:50
-
@Naresh: Below way (which we implemented) useful for you? – Ram Ghadiyaram May 30 '16 at 05:48
-
if you are okay you can flagup "accepted by owner" – Ram Ghadiyaram May 30 '16 at 06:16
-
@RamPrasadG It seems promising. Though, I haven't tested it. Will do it. Can't i do it simply scheduling the `.sh` file as a `crontab` – Naresh May 30 '16 at 06:18
-
Please note as an additional information, Flume agents also run in the same way :-) we have done in flume as well as spark streaming. Both are working fine – Ram Ghadiyaram May 30 '16 at 09:23
-
spark-submit in cluster mode and supervise will achieve this . It will schedule the driver in some worker node and supervise flag will restart the job on failures. – Knight71 Jun 07 '16 at 06:10
2 Answers
1
This is the way to run 2 daemon threads, based on your requirement it can increase..
nohup ./mysparkstreamingjob.sh one> ../../logs/nohup.out 2> ../../logs/nohup.err < /dev/null &
nohup ./mysparkstreamingjob.sh two> ../../logs/nohup.out 2> ../../logs/nohup.err < /dev/null &
mysparkstreamingjob.sh will look like
#!/bin/sh
echo $CLASSPATH
spark-submit --verbose --jars $(echo /dirofjars/*.jar | tr ' ' ','),$SPARK_STREAMING_JAR --class com.xx.xx.StreamingJob \
--master yarn-client \
--num-executors 12 \
--executor-cores 4 \
--driver-memory 4G \
--executor-memory 4G \
--driver-class-path ../../config/properties/* \
--conf "spark.driver.extraJavaOptions=-XX:PermSize=256M -XX:MaxPermSize=512M" \
--conf "spark.shuffle.memoryFraction=0.5" \
--conf "spark.storage.memoryFraction=0.75" \
--conf "spark.storage.unrollFraction=0.2" \
--conf "spark.memory.fraction=0.75" \
--conf "spark.worker.cleanup.enabled=true" \
--conf "spark.worker.cleanup.interval=14400" \
--conf "spark.shuffle.io.numConnectionsPerPeer=5" \
--conf "spark.eventlog.enabled=true" \
--conf "spark.driver.extraLibrayPath=$HADOOP_HOME/*:$HBASE_HOME/*:$HADOOP_HOME/lib/*:$HBASE_HOME/lib/htrace-core-3.1.0-incubating.jar:$HDFS_PATH/*:$SOLR_HOME/*:$SOLR_HOME/lib/*" \
--conf "spark.executor.extraLibraryPath=$HADOOP_HOME/*:$HBASE_HOME/*:$HADOOP_HOME/lib/*:$HBASE_HOME/lib/htrace-core-3.1.0-incubating.jar:$HDFS_PATH/*:$SOLR_HOME/*:$SOLR_HOME/lib/*" \
--conf "spark.executor.extraClassPath=$(echo /dirofjars/*.jar | tr ' ' ',')" \
--conf "spark.yarn.executor.memoryOverhead=2048" \
--conf "spark.yarn.driver.memoryOverhead=1024" \
--conf "spark.eventLog.overwrite=true" \
--conf "spark.shuffle.consolidateFiles=true" \
--conf "spark.akka.frameSize=1024" \
--files xxxx.properties, xxxx.properties \
-DprocMySpark$1
Custom log4j rotation of file you need to configure and pass that setting to your spark submit. based on appender you use it will do in natural way as java + log4j working.
For Example :
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=/tmp/log4j.properties"
Moreover, spark webui url(which is by default) has all logs high level and low level

Ram Ghadiyaram
- 28,239
- 13
- 95
- 121
0
You should be using oozie for scheduling your spark streaming job. https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html
This will give you a good overview about scheduling, managing and monitoring your spark jobs. http://blog.cloudera.com/blog/2014/02/new-hue-demos-spark-ui-job-browser-oozie-scheduling-and-yarn-support/

Puneet Chaurasia
- 441
- 6
- 14