Chaining Delta Streams programmatically raising AnalysisException

Question

Situation : I am producing a delta folder with data from a previous Streaming Query A, and reading later from another DF, as shown here

DF_OUT.writeStream.format("delta").(...).start("path")

(...)

DF_IN = spark.readStream.format("delta").load("path)

1 - When I try to read it this wayin a subsequent readStream (chaining queries for an ETL Pipeline) from the same program I end up having the Exception below.

2 - When I run it in the scala REPL however, it runs smoothly.
Not sure What is happening there but it sure is puzzling.

org.apache.spark.sql.AnalysisException: Table schema is not set.  Write data into it or use CREATE TABLE to set the schema.;
  at org.apache.spark.sql.delta.DeltaErrors$.schemaNotSetException(DeltaErrors.scala:365)
  at org.apache.spark.sql.delta.sources.DeltaDataSource.sourceSchema(DeltaDataSource.scala:74)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:209)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:95)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:95)
  at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:33)
  at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:171)
  at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:225)
  at org.apache.spark.ui.DeltaPipeline$.main(DeltaPipeline.scala:114)

score 2 · Answer 1 · answered Jan 31 '21 at 12:27

From the Delta Lake Quick Guide - Troubleshooting:

Table schema is not set error

Problem: When the path of Delta table is not existing, and try to stream data from it, you will get the following error. org.apache.spark.sql.AnalysisException: Table schema is not set. Write data into it or use CREATE TABLE to set the schema.;

Solution: Make sure the path of a Delta table is created.

score 0 · Answer 2 · answered Dec 28 '19 at 14:30

After reading the error message, I did try to be a good boy and follow the advice, so I tried to make sure there actually IS valid data in the delta folder I am trying to read from BEFORE calling the readStream, and voila !


 def hasFiles(dir: String):Boolean = {
        val d = new File(dir)
        if (d.exists && d.isDirectory) {
                d.listFiles.filter(_.isFile).size > 0
        } else false
   }

DF_OUT.writeStream.format("delta").(...).start(DELTA_DIR)

while(!hasFiles(DELTA_DIR)){
         print("DELTA FOLDER STILL EMPTY")
         Thread.sleep(10000)
      }

print("FOUND DATA ON DELTA A - WAITING 30 SEC")
Thread.sleep(30000)

DF_IN = spark.readStream.format("delta").load(DELTA_DIR)

It ended up working but I had to make sure to wait enough time for "something to happen" (don't know what exactly TBH, but it seems that reading from delta needs some writes to be complete - maybe metadata ? -

_{However, this still is a hack. I hope it was possible to start reading from an empty delta folder and wait for content to start pouring in it.}

How about creating an empty zero version of Delta Lake to initialize it just by saving an empty Dataset? E.g. `spark.emptyDataset[(Long, String)].write.format("delta").save("/tmp/delta/t1")`? — Jacek Laskowski, Dec 29 '19 at 20:31
@JacekLaskowski I thought about that but I wondered if it would affect my schema as once I will write real data into it there will be a schema. So I refrained from doing that. As you are a highly respected expert on the topic I won't cast doubt on your suggestion and will try it ant post the result i̶f̶ ̶ when it works — Mehdi LAMRANI, Dec 30 '19 at 13:17
I'd use for the zero version the schema of the streaming query so there's no schema mismatch. — Jacek Laskowski, Dec 30 '19 at 19:21

score 0 · Answer 3 · answered Sep 14 '22 at 17:35

0

For me I couldnt find the absolute path a simple solution was using this alternative: spark.readStream.format("delta").table("tableName")

answered Sep 14 '22 at 17:35

Ohad Bitton

460
2
14

Chaining Delta Streams programmatically raising AnalysisException

3 Answers3