0

I have a big data pipeline in Spark that writes output in parquet to delta lake (backed by storage accounts on Azure). The output schemas keep changing as I'm still figuring out what I need them to be, and sometimes this results in a column needing to change its datatype (e.g. a string now needs to be an int). However I can't simply make this change, as then when writing to the delta lake I'll get schema mismatch errors. So currently I've just been renaming columns (e.g. ColumnA that was a string becomes ColumnAInt etc.). This isn't very clean but I've been told that changing the datatype of a column is very expensive, but I haven't been able to find authoritative documentation on this.

I have seen this page: https://docs.delta.io/latest/delta-batch.html#-change-column-type but it doesn't mention how expensive this operation would be/how it scales with the amount of data. Would anyone have answers regarding that?

ROODAY
  • 756
  • 1
  • 6
  • 23

0 Answers0