My Scala
program starts streaming from Kafka
to database and exit to command shell. Streaming continues working, and it's "application" is visible on Spark
page.
My Scala
program looks like this
val ssc = SparkHelper.getStreamingContext(spark, checkpointDir)
val dataFlowStream = KafkaHelper.getKafkaStream[DataFlow, DataFlowDeserializer](ssc, Topics.dataFlowTopic, config.getString("trackrecordsmetrics.kafka.group-id"), classOf[DataFlowDeserializer])
...
ssc.start()
ssc.awaitTermination()
When I am starting it it writes:
21/04/08 08:58:54 INFO Utils: Successfully started service 'driverClient' on port 32841.
21/04/08 08:58:54 INFO TransportClientFactory: Successfully created connection to sparkmaster/IPADDRESS:7077 after 22 ms (0 ms spent in bootstraps)
21/04/08 08:58:54 INFO ClientEndpoint: ... waiting before polling master for driver state
21/04/08 08:58:54 INFO ClientEndpoint: Driver successfully submitted as driver-20210408085854-0016
21/04/08 08:58:59 INFO ClientEndpoint: State of driver-20210408085854-0016 is RUNNING
21/04/08 08:58:59 INFO ClientEndpoint: Driver running on IPADDRESS:45233 (worker-20210405082522-IPADDRESS-45233)
21/04/08 08:58:59 INFO ClientEndpoint: spark-submit not configured to wait for completion, exiting spark-submit JVM.
21/04/08 08:58:59 INFO ShutdownHookManager: Shutdown hook called
21/04/08 08:58:59 INFO ShutdownHookManager: Deleting directory /tmp/spark-105e3428-ba75-47ca-9d3b-20aa48a45898
And then it exits. I still can see it's application on Spark page and see effects of it's work. I.e. driver exits while application keep working.
Contrary,
when I am running similar Python script, it works only while script is working. The end of my script is similar
query = df \
.writeStream \
.outputMode('update') \
.foreachBatch(write_data_frame) \
.start()
query.awaitTermination()
If I press Ctrl-C
script ends and streaming stops. If I remove line query.awaitTermination()
, streaming is never starts.
Is it possible to submit python scrpt so that it remains running there?
I am submitting Python script with
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1,org.mariadb.jdbc:mariadb-java-client:2.7.2 MYSCRIPT.py
I can't specify cluster
mode, because it is not supported for Python
.