0

I'm reading parquet files and try to load them into the target Delta table, using such code:

(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "parquet")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .option("cloudFiles.schemaEvolutionMode", "rescue")
  .schema(<schema of my target table>)
  .load(file_path)
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .trigger(availableNow=True)
  .toTable(table_name))

It throws an error "Parquet column cannot be converted. Column: [my_column], Expected Timestamp, Found: INT64." The column "my_column" is defined as timestamp in the target table, in the source data files it can be INT sometimes.

By specifying option("cloudFiles.schemaEvolutionMode", "rescue") I expect Autoloader to save all data-type mismatching data into the _rescued_data column, instead of throwing an error.

Why it doesn't behave like this?

archjkeee
  • 13
  • 4
  • For `rescue`, the doc says _Schema is never evolved and stream does not fail due to schema changes. All new columns are recorded in the rescued data column._ It seems like it only allows for new columns, not changes to existing columns. But I find the documentation a bit ambiguous - I can't find an explanation on how you would capture an invalid record without failing – Nick.Mc Aug 24 '23 at 00:45
  • Provide your sample data. – JayashankarGS Aug 24 '23 at 06:11

0 Answers0