0

My Scala program starts streaming from Kafka to database and exit to command shell. Streaming continues working, and it's "application" is visible on Spark page.

My Scala program looks like this

val ssc = SparkHelper.getStreamingContext(spark, checkpointDir)
val dataFlowStream = KafkaHelper.getKafkaStream[DataFlow, DataFlowDeserializer](ssc, Topics.dataFlowTopic, config.getString("trackrecordsmetrics.kafka.group-id"), classOf[DataFlowDeserializer])

...

ssc.start()
ssc.awaitTermination()

When I am starting it it writes:

21/04/08 08:58:54 INFO Utils: Successfully started service 'driverClient' on port 32841.
21/04/08 08:58:54 INFO TransportClientFactory: Successfully created connection to sparkmaster/IPADDRESS:7077 after 22 ms (0 ms spent in bootstraps)
21/04/08 08:58:54 INFO ClientEndpoint: ... waiting before polling master for driver state
21/04/08 08:58:54 INFO ClientEndpoint: Driver successfully submitted as driver-20210408085854-0016
21/04/08 08:58:59 INFO ClientEndpoint: State of driver-20210408085854-0016 is RUNNING
21/04/08 08:58:59 INFO ClientEndpoint: Driver running on IPADDRESS:45233 (worker-20210405082522-IPADDRESS-45233)
21/04/08 08:58:59 INFO ClientEndpoint: spark-submit not configured to wait for completion, exiting spark-submit JVM.
21/04/08 08:58:59 INFO ShutdownHookManager: Shutdown hook called
21/04/08 08:58:59 INFO ShutdownHookManager: Deleting directory /tmp/spark-105e3428-ba75-47ca-9d3b-20aa48a45898

And then it exits. I still can see it's application on Spark page and see effects of it's work. I.e. driver exits while application keep working.

Contrary,

when I am running similar Python script, it works only while script is working. The end of my script is similar

query = df \
    .writeStream \
    .outputMode('update') \
    .foreachBatch(write_data_frame) \
    .start()

query.awaitTermination()

If I press Ctrl-C script ends and streaming stops. If I remove line query.awaitTermination(), streaming is never starts.

Is it possible to submit python scrpt so that it remains running there?

I am submitting Python script with

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1,org.mariadb.jdbc:mariadb-java-client:2.7.2 MYSCRIPT.py

I can't specify cluster mode, because it is not supported for Python.

Dims
  • 47,675
  • 117
  • 331
  • 600
  • I'm not sure I understand. Neither Python or Scala with `awaitTermination` shouldn't exit and you need to leave the process running for streaming to work... Cluster deploy mode should work fine with pyspark – OneCricketeer Apr 08 '21 at 02:37
  • Okay, how to exit, but continue streaming? – Dims Apr 08 '21 at 08:40
  • @OneCricketeer Scala program also has `awaitTermination` call, but it exits somehow. This is also what I need to understand. – Dims Apr 08 '21 at 09:11
  • Refer this: https://stackoverflow.com/questions/37200388/how-to-exit-spark-submit-after-the-submission --conf spark.yarn.submit.waitAppCompletion=false – yammanuruarun Apr 17 '21 at 12:48

0 Answers0