ADF data flow etl schema mapping issue

Question

I'm working with data flows in data factory to create a dynamic data flow which has parameterized source/sink/surrogate key names to reuse across many files and accomplishes the following tasks:

Join source parquet to existing delta destination to find new/updated records.
Insert new rows with a numeric surrogate key
compare hashes between source and existing destination and determine if record is different
Update existing rows that that have a difference in the record hash

I am able to get data throughout the flow, but I am running into an issue during the sinks. It seems that I am unable to insert records from my incoming stream to the sink because of a schema mismatch. sample schema below:

Source stream final schema where I've highlighted the columns that I don't want in my delta table:

I am aware of the mismatch between source/sink. It's caused by additional work that goes into creating surrogate keys and determining if records have changed.

When I turn off auto-mapping I'm able to map columns and drop columns without any issues but then I lose the ability to make this flow dynamic. Are there settings in the sink that will allow auto-mapping to drop columns from the source that don't match destination?

Things I've tried:

Unchecking "Allow schema drift". To me this makes sense, I don't want my destination schema to drift, but when I uncheck this, the auto-mapping doesn't seem to work and gives me the error: "Error at Sink 'sink1': The result has 0 output columns. Please ensure at least one column is mapped"

Checking "Allow schema drift". Checking this makes the automapper work but causes the error "Job failed due to reason: at Sink 'sink1': A schema mismatch detected when writing to the Delta table"

Now, I can also enable mergeSchema but then I will be writing the 5 columns I don't want into my delta table. Is there a way to get my sink to work such that automapping will map to columns it finds and drop columns it can't find so that I don't have to write a bunch of transient columns to my table?

score 1 · Accepted Answer · answered May 04 '23 at 11:30

If the columns that you want to remove are created within the flow and their names do not change irrespective of the source and sink columns, then you can use rule-based mapping.
I have columns which I want to remove from the below flow i.e., new_id and tp.

enter image description here

So, I have manually written a condition in select transformation as name != 'new_id' && name != 'tp'.

enter image description here

This will give only the required columns. The following is an output image for the same.

enter image description here

If the columns you want to exclude are dynamic, you need to have an expression to specify in the rule-based select statement based on sink to be able to select only required columns.

Thank you for the answer, this should get me what I need! It's unfortunate that the sink can't handle this. In a SQL db sink for example I have records flowing in with dynamic columns and the automapping will automatically drop columns that don't match the destination schema. It would be beneficial if this capability was in all sinks as an option. — Brian, May 09 '23 at 12:18

ADF data flow etl schema mapping issue

1 Answers1