How to make Python driver script exit, while streaming remains working in Spark?

Question

My Scala program starts streaming from Kafka to database and exit to command shell. Streaming continues working, and it's "application" is visible on Spark page.

My Scala program looks like this

val ssc = SparkHelper.getStreamingContext(spark, checkpointDir)
val dataFlowStream = KafkaHelper.getKafkaStream[DataFlow, DataFlowDeserializer](ssc, Topics.dataFlowTopic, config.getString("trackrecordsmetrics.kafka.group-id"), classOf[DataFlowDeserializer])

...

ssc.start()
ssc.awaitTermination()

When I am starting it it writes:

21/04/08 08:58:54 INFO Utils: Successfully started service 'driverClient' on port 32841.
21/04/08 08:58:54 INFO TransportClientFactory: Successfully created connection to sparkmaster/IPADDRESS:7077 after 22 ms (0 ms spent in bootstraps)
21/04/08 08:58:54 INFO ClientEndpoint: ... waiting before polling master for driver state
21/04/08 08:58:54 INFO ClientEndpoint: Driver successfully submitted as driver-20210408085854-0016
21/04/08 08:58:59 INFO ClientEndpoint: State of driver-20210408085854-0016 is RUNNING
21/04/08 08:58:59 INFO ClientEndpoint: Driver running on IPADDRESS:45233 (worker-20210405082522-IPADDRESS-45233)
21/04/08 08:58:59 INFO ClientEndpoint: spark-submit not configured to wait for completion, exiting spark-submit JVM.
21/04/08 08:58:59 INFO ShutdownHookManager: Shutdown hook called
21/04/08 08:58:59 INFO ShutdownHookManager: Deleting directory /tmp/spark-105e3428-ba75-47ca-9d3b-20aa48a45898

And then it exits. I still can see it's application on Spark page and see effects of it's work. I.e. driver exits while application keep working.

Contrary,

when I am running similar Python script, it works only while script is working. The end of my script is similar

query = df \
    .writeStream \
    .outputMode('update') \
    .foreachBatch(write_data_frame) \
    .start()

query.awaitTermination()

If I press Ctrl-C script ends and streaming stops. If I remove line query.awaitTermination(), streaming is never starts.

Is it possible to submit python scrpt so that it remains running there?

I am submitting Python script with

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1,org.mariadb.jdbc:mariadb-java-client:2.7.2 MYSCRIPT.py

I can't specify cluster mode, because it is not supported for Python.

I'm not sure I understand. Neither Python or Scala with `awaitTermination` shouldn't exit and you need to leave the process running for streaming to work... Cluster deploy mode should work fine with pyspark — OneCricketeer, Apr 08 '21 at 02:37
@OneCricketeer Scala program also has `awaitTermination` call, but it exits somehow. This is also what I need to understand. — Dims, Apr 08 '21 at 09:11
Refer this: https://stackoverflow.com/questions/37200388/how-to-exit-spark-submit-after-the-submission --conf spark.yarn.submit.waitAppCompletion=false — yammanuruarun, Apr 17 '21 at 12:48

How to make Python driver script exit, while streaming remains working in Spark?

0 Answers0