PySpark Wait to finish in notebook (Databricks)

Question

Currently, I'm having a few issues with having a spark dataframe (autoloader) in one cell that may take a few moments to write data. Then, in the following cell, the code references the work done by the first table. However, if the entire notebook is run (particularly, as a Job) due to the distributed nature of spark, the second cell runs before the first cell is fully completed. How can I have the second cell await the finish of the writeStream without putting them in separate notebooks.

Example:

Cell1

autoload = pysparkDF.writeStream.format('delta')....table('TABLE1')

Cell2

df = spark.sql('select count(*) from TABLE1')

score 5 · Accepted Answer · edited Sep 01 '23 at 07:00

5

You need to use awaitTermination function to wait until stream processing is finished (see docs). Like this:

cell 1

autoload = pysparkDF.writeStream.format('delta')....table('TABLE1')
autoload.awaitTermination()

cell 2

df = spark.sql('select count(*) from TABLE1')

although it could be read easier & harder to make mistake with something like this:

df = spark.read.table('TABLE1').count()

Update: To wait for multiple streams:

while len(spark.streams.active) > 0:
  spark.streams.resetTerminated() # Otherwise awaitAnyTermination() will return immediately after first stream has terminated
  spark.streams.awaitAnyTermination()

edited Sep 01 '23 at 07:00

Tommi R.

398
3
6

answered Jan 22 '22 at 10:50

Alex Ott

80,552
8
87
132

Thanks, but this will help for single stream. How to wait if there are multiple streams that are running. Need to wait for all streams to complete. – Sunil Jun 15 '22 at 01:30
wait for multiple streams, is that achieved by any way? – Athi Jan 13 '23 at 00:31

PySpark Wait to finish in notebook (Databricks)

1 Answers1

Linked