0

I am loading data from SQL Server to Delta lake tables. Recently i had to repoint the source to another table(same columns), but the data type is different in new table. This is causing error while loading data to delta table. Getting following error:

Failed to merge fields 'COLUMN1' and 'COLUMN1'. Failed to merge incompatible data types LongType and DecimalType(32,0)

Command i use to write data to delta table:

DF.write.mode("overwrite").format("delta").option("mergeSchema", "true").save("s3 path)

The only option i can think of right now is to enable OverWriteSchema to True.

But this will rewrite my target schema completely. I am just concerned about any sudden change in source schema that will replace existing target schema without any notification or alert.

Also i can't explicitly convert these columns because the databricks notebook i am using is a parametrized one used to to load data from source to Target(We are reading data from a CSV file that contain all the details about Target table, Source table, partition key etc)

Is there any better way to tackle this issue? Any help is much appreciated!

Vaishak
  • 607
  • 3
  • 8
  • 30
  • `overwriteSchema` would overwrite the schema of the delta table and make it the same as `DF` schema. In case you want to maintain the current types of the delta lake, a better option would be to just explicitly case before writing. `DF.withColumn("COLUMN1", col("COLUMN1").astype(DecimalType(32, 0)).write.mode("overwrite").format("delta").save("s3 path")` – Rehan Rajput Feb 03 '23 at 10:33
  • The problem here is there are many columns where datatypes have changed and doing it manually for all the columns will be difficult. Also the notebook i am using is common framework used to load the one on one source into delta tables. So cant make these changes in this notebook. – Vaishak Feb 03 '23 at 10:45
  • Would it be possible to cast in the query that is passed to the SQL Server ? In case you are bringing the entire table by using the table name, you could try replacing the table name with the query ? – rainingdistros Feb 03 '23 at 12:19
  • That solution also went through my mind. But again that requires changing the entire control table with queries or the structure of the table to accommodate SQL queries. I was thinking of a way to programmatically cast the source datatypes to corresponding datatype in target delta table(Not sure if its very feasible) – Vaishak Feb 03 '23 at 15:33
  • @Vaishak In that case, you can use the `overwriteSchema` as it will make sure that the delta lake has the same schema as `DF`. Just know that if you use the `overwrite` mode, then the data in your delta lake would be EXACTLY the same as the `DF` dataframe. If you use any other mode `merge` or `append`, it will try to `merge` or `append` to old data but not sure what would happend to your column. – Rehan Rajput Feb 03 '23 at 16:33
  • @Vaishak, if your delta lake table is small, you can also just copy it at a different location in `S3` and then merge your `DF` to the copied location. Then you can analyze what happens. Or if your delta lake table is too big, then you can just take a sample of 1,000 values. – Rehan Rajput Feb 03 '23 at 16:34
  • thought of a very complex and less efficient approach - maybe you can hold and compare the datatypes identified and the datatypes expected - if there are changes - you can maybe add df..withColumn("Col_01", col("Col_01").astype(expected_type) - if no changes skip this part - but sounds complex even to type it out though :) – rainingdistros Feb 04 '23 at 17:01
  • @rainingdistros I am trying something similar actually. We have a code that converts certain source datatypes into ones in Target datatypes. But like you said it is not very easy – Vaishak Feb 06 '23 at 10:05

1 Answers1

0

Did a workaround for now. Wrote a custom function that compares Source and Target schema and converts Source datatype to target datatype.( Only columns common between source and target will be considered)

Vaishak
  • 607
  • 3
  • 8
  • 30