0

Situation : I am producing a delta folder with data from a previous Streaming Query A, and reading later from another DF, as shown here

DF_OUT.writeStream.format("delta").(...).start("path")

(...)

DF_IN = spark.readStream.format("delta").load("path)

1 - When I try to read it this wayin a subsequent readStream (chaining queries for an ETL Pipeline) from the same program I end up having the Exception below.

2 - When I run it in the scala REPL however, it runs smoothly.
Not sure What is happening there but it sure is puzzling.

org.apache.spark.sql.AnalysisException: Table schema is not set.  Write data into it or use CREATE TABLE to set the schema.;
  at org.apache.spark.sql.delta.DeltaErrors$.schemaNotSetException(DeltaErrors.scala:365)
  at org.apache.spark.sql.delta.sources.DeltaDataSource.sourceSchema(DeltaDataSource.scala:74)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:209)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:95)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:95)
  at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:33)
  at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:171)
  at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:225)
  at org.apache.spark.ui.DeltaPipeline$.main(DeltaPipeline.scala:114)

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Mehdi LAMRANI
  • 11,289
  • 14
  • 88
  • 130

3 Answers3

2

From the Delta Lake Quick Guide - Troubleshooting:

  • Table schema is not set error

Problem: When the path of Delta table is not existing, and try to stream data from it, you will get the following error. org.apache.spark.sql.AnalysisException: Table schema is not set. Write data into it or use CREATE TABLE to set the schema.;

Solution: Make sure the path of a Delta table is created.

jsay
  • 313
  • 2
  • 14
0

After reading the error message, I did try to be a good boy and follow the advice, so I tried to make sure there actually IS valid data in the delta folder I am trying to read from BEFORE calling the readStream, and voila !


 def hasFiles(dir: String):Boolean = {
        val d = new File(dir)
        if (d.exists && d.isDirectory) {
                d.listFiles.filter(_.isFile).size > 0
        } else false
   }

DF_OUT.writeStream.format("delta").(...).start(DELTA_DIR)

while(!hasFiles(DELTA_DIR)){
         print("DELTA FOLDER STILL EMPTY")
         Thread.sleep(10000)
      }

print("FOUND DATA ON DELTA A - WAITING 30 SEC")
Thread.sleep(30000)

DF_IN = spark.readStream.format("delta").load(DELTA_DIR)

It ended up working but I had to make sure to wait enough time for "something to happen" (don't know what exactly TBH, but it seems that reading from delta needs some writes to be complete - maybe metadata ? -

However, this still is a hack. I hope it was possible to start reading from an empty delta folder and wait for content to start pouring in it.

Mehdi LAMRANI
  • 11,289
  • 14
  • 88
  • 130
  • 2
    How about creating an empty zero version of Delta Lake to initialize it just by saving an empty Dataset? E.g. `spark.emptyDataset[(Long, String)].write.format("delta").save("/tmp/delta/t1")`? – Jacek Laskowski Dec 29 '19 at 20:31
  • 1
    @JacekLaskowski I thought about that but I wondered if it would affect my schema as once I will write real data into it there will be a schema. So I refrained from doing that. As you are a highly respected expert on the topic I won't cast doubt on your suggestion and will try it ant post the result i̶f̶ ̶ when it works – Mehdi LAMRANI Dec 30 '19 at 13:17
  • 1
    I'd use for the zero version the schema of the streaming query so there's no schema mismatch. – Jacek Laskowski Dec 30 '19 at 19:21
0

For me I couldnt find the absolute path a simple solution was using this alternative: spark.readStream.format("delta").table("tableName")

Ohad Bitton
  • 460
  • 2
  • 14