Databricks Autoloader throws conversion error instead of writing the value to the rescue column

Question

I'm reading parquet files and try to load them into the target Delta table, using such code:

(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "parquet")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .option("cloudFiles.schemaEvolutionMode", "rescue")
  .schema(<schema of my target table>)
  .load(file_path)
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .trigger(availableNow=True)
  .toTable(table_name))

It throws an error "Parquet column cannot be converted. Column: [my_column], Expected Timestamp, Found: INT64." The column "my_column" is defined as timestamp in the target table, in the source data files it can be INT sometimes.

By specifying option("cloudFiles.schemaEvolutionMode", "rescue") I expect Autoloader to save all data-type mismatching data into the _rescued_data column, instead of throwing an error.

Why it doesn't behave like this?

For `rescue`, the doc says _Schema is never evolved and stream does not fail due to schema changes. All new columns are recorded in the rescued data column._ It seems like it only allows for new columns, not changes to existing columns. But I find the documentation a bit ambiguous - I can't find an explanation on how you would capture an invalid record without failing — Nick.Mc, Aug 24 '23 at 00:45

Databricks Autoloader throws conversion error instead of writing the value to the rescue column

0 Answers0