1

I'm trying the simplest auto loader example included in the databricks website

https://databricks.com/notebooks/Databricks-Data-Integration-Demo.html

df = (spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "json")
      .load(input_data_path))
 
(df.writeStream.format("delta")
 .option("checkpointLocation", chkpt_path)
   .table("iot_stream"))

I keep getting this message:

IllegalArgumentException: cloudFiles.schemaLocation Could not find required option: schemaLocation. Please provide a schema location using cloudFiles.schemaLocation for storing inferred schema and supporting schema evolution.

If providing cloudFiles.schemaLocation is required, why do the examples everywhere are missing it? what's the underlying issue here?

James Z
  • 12,209
  • 10
  • 24
  • 44
hikizume
  • 578
  • 11
  • 25
  • Did you got the answer to this issue as I am also facing the same error and even after trying other options, I am not able to make it work. – Nikunj Kakadiya Oct 03 '22 at 07:44

2 Answers2

0

I suspect what is going on is that you are not explicitly setting the .option("cloudFiles.schemaEvolutionMode")

Which means it is being set to the default which is "addNewColumns" as per https://docs.databricks.com/ingestion/auto-loader/options.html

And that requires you set the .option("cloudFiles.schemaLocation", path) in the reader.

Thus you are inadvertently requiring it and not setting it.

Chris de Groot
  • 342
  • 1
  • 9
0

In the example notebook there is this config here being set on session:

spark.sql("SET spark.databricks.cloudFiles.schemaInference.enabled=true")

Consider you are running the same autoloader read/write code block, are you receiving the error message even enabling this schema inference setting?