1

Is it possible to update the schema (change data-type of a column) of a non-empty table in Databricks (loaded by streaming Autoloader) without impacting the checkpoint folder ?
Is there any work-around to achieve this ?

Update: The data is read by autoloader like the following -

spark.readStream \
     .format("cloudFiles") \
     .option("cloudFiles.format", "parquet") \
     .option("cloudFiles.schemaEvolutionMode", "rescue") \
     .schema(schema_str)

Is it possible to implement the change without losing checkpoint history ?

marie20
  • 723
  • 11
  • 30
  • can you provide information on how do you read data with autoloader, what kind of the type change is requeired, etc.? – Alex Ott Apr 19 '23 at 11:34
  • Hi Alex, thanks for your reply. I have updated the question with the code. The data-type change will be `int` to `long` for a particular column in the df. Can we preserve the checkpoint by any means ? – marie20 Apr 20 '23 at 12:06
  • Hi @marie20, personally have not used Autoloader, so just thinking out loud...Will it be possible to have a dummy file with a single record, enable schema evolution and then provide schema hint for this particular column as long (maybe provide for other cols too so that we do not have type issues later) and then delete that record after load and revert back the schema evolution options ? Not sure if it would work - so pls do not try in prod, maybe as a test... – rainingdistros Apr 21 '23 at 09:04
  • Hi @rainingdistros thank you for your reply. I will try out your solution. However, could you please advise which schema-evolution mode should I set the job to ? For e.g. `failOnNewColumns` or `addNewColumns` or something different ? – marie20 Apr 21 '23 at 20:46
  • @marie20, I am quite hesitant to suggest as I have not used it personally - please go through the [link](https://docs.databricks.com/ingestion/auto-loader/schema.html#override-schema-inference-with-schema-hints) - in the link a little ways above it says that with `addNewColumns - Stream fails. New columns are added to the schema. Existing columns do not evolve data types.` - How about trying only with the option `cloudFiles.schemaHints` and no other changes ? Apologies for the confusion... – rainingdistros Apr 22 '23 at 14:29

0 Answers0