3

Currently, I'm having a few issues with having a spark dataframe (autoloader) in one cell that may take a few moments to write data. Then, in the following cell, the code references the work done by the first table. However, if the entire notebook is run (particularly, as a Job) due to the distributed nature of spark, the second cell runs before the first cell is fully completed. How can I have the second cell await the finish of the writeStream without putting them in separate notebooks.

Example:

Cell1

autoload = pysparkDF.writeStream.format('delta')....table('TABLE1')

Cell2

df = spark.sql('select count(*) from TABLE1')
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
a.powell
  • 1,572
  • 4
  • 28
  • 39

1 Answers1

5

You need to use awaitTermination function to wait until stream processing is finished (see docs). Like this:

  • cell 1
autoload = pysparkDF.writeStream.format('delta')....table('TABLE1')
autoload.awaitTermination()
  • cell 2
df = spark.sql('select count(*) from TABLE1')

although it could be read easier & harder to make mistake with something like this:

df = spark.read.table('TABLE1').count()

Update: To wait for multiple streams:

while len(spark.streams.active) > 0:
  spark.streams.resetTerminated() # Otherwise awaitAnyTermination() will return immediately after first stream has terminated
  spark.streams.awaitAnyTermination()
Tommi R.
  • 398
  • 3
  • 6
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thanks, but this will help for single stream. How to wait if there are multiple streams that are running. Need to wait for all streams to complete. – Sunil Jun 15 '22 at 01:30
  • wait for multiple streams, is that achieved by any way? – Athi Jan 13 '23 at 00:31