0

I have below script which I run in my unity catalog enabled databricks workspace and get the below error. The schema and code worked for my other tenant in different workspace and I was hoping it was same for this tenant. now I dont have time to look into the schema changes. Can anyone tell me what below error suggest. What will happen if I put .option("mergeSchema", "true")' in my code below

Error:

rg.apache.spark.sql.AnalysisException: A schema mismatch detected when writing to the Delta table (Table ID: 50a8bd1e32). To enable schema migration using DataFrameWriter or DataStreamWriter, please set: '.option("mergeSchema", "true")'. For other operations, set the session configuration spark.databricks.delta.schema.autoMerge.enabled to "true". See the documentation specific to the operation for details.

Code:

table_name= "main.auditlogs.Customer_Logs"
checkpoint_path = "/mnt/_checkpoint"
file_path ="/mnt/topics/audit-logs"
schema = "authenticationMechanism STRING,authenticationMechanismId STRING,environment STRUCT<cdfCluster: STRING,loggerInfo STRUCT<transport: STRING>,method STRING,path STRING,principalProjectId STRING,principalProjectIds ARRAY<STRING>,principalUniqueId STRING,principalUniqueName STRING,remoteIp STRING,requestId STRING,requestProjectId STRING,requestProjectUrlName STRING,requestTicket STRING,sourceCommitHash STRING,timestamp STRING,xHeaderFields ARRAY<STRUCT<key: STRING, value: STRING>>"
(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .option("ignoreMissingFiles", "true")
  .schema(schema)  
  .load(file_path)
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .trigger(availableNow=True)
  .toTable(table_name))

Is there any way I can ignore the schema and load everything as in the table I have created?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
ZZZSharePoint
  • 1,163
  • 1
  • 19
  • 54

1 Answers1

1

When you use the .option("mergeSchema", "true") then the destination table schema will be updated to match the schema of your DataFrame - new columns & fields in the struct will be added, some column types could be changed to incorporate bigger data types.

So you have a following choice:

  • add the .option("mergeSchema", "true") when writing a table, then your table schema will be updated

  • use only columns that you have in the destination table, but then you will lose the new columns - just add the .select(*orig_table.columns) between .load and .writeStream:

...
orig_table = spark.read.table(table_name)

....
  .load(file_path)
  .select(*orig_table.columns)
  .writeStream
....

You can find more information in this blog post.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • my destination table is a delta lake table which i have created in above step without any schema just by running this command "%sql CREATE Table IF NOT EXISTS main.auditlogs.Customer_Logs". Also where do I put .option("mergeschema","true")..I tried in between load and write and between .schema() and .load()..I get the same error – ZZZSharePoint Jun 17 '23 at 09:43
  • You need to put `.option` after `.writeStream` – Alex Ott Jun 17 '23 at 10:19
  • thanks..another question here which is bit away from the question but my above code is now running for a day and still the process is unfinished. Is there a way i can optimize the above query and run this faster. I couldnt select photon acclreator cluster since unity catalog is not supported there.What more can I do? – ZZZSharePoint Jun 18 '23 at 03:15
  • UC and photon are orthogonal things. Photon is just execution engine and works fine with UC… – Alex Ott Jun 18 '23 at 06:26
  • but when i create a cluster, its ask me to select Photon acclerator--when i select it, it says UC not supported. – ZZZSharePoint Jun 18 '23 at 08:18
  • so my cluster currently running the script, do nt hv photon execution engine – ZZZSharePoint Jun 18 '23 at 08:19
  • Just created Shared UC cluster with Photon, no problems... Maybe you use old runtime or something like that? – Alex Ott Jun 18 '23 at 09:51
  • in Azure databricks? – ZZZSharePoint Jun 18 '23 at 10:04
  • yes, and AWS... – Alex Ott Jun 18 '23 at 12:50