Azure data factory - multiple primary keys in source excel to be inserted in SQL database

Question

I am working on a pipeline where I am using Excel as the source. The data has a primary key let's say Id, which is repeating multiple times in the Excel.

Now, when I insert it into a SQL database, it fails with the error:

java.sql.BatchUpdateException: Violation of PRIMARY KEY constraint. Cannot insert duplicate key in object 'dbo.xyz'. The duplicate key value is XXXX.

How can I make take care of this scenario using mapping data flows in ADF?

I am using a mapping data flow here to take care of the different other transformations.

Example of such data, coming from the Excel source

ID Name PhoneNo
1 John Doe 11110000
1 John Doe 88881111
2 Harry Potter 88999000
2 Harry Potter 00001112
3 abc xyz 77771111

I need to take save the top 1 ID and Name (and there are more columns) in one table and Phone No and ID will be saved in another

Hi @user961, Can you provide a sample source and expected output? As the table has a primary key constraint on the Id column, you cannot insert duplicate values into the SQL table. — NiharikaMoola-MT, Jun 14 '22 at 08:13
If the ID repeats multiple times then it cannot be a primary key; a Primary Key is, by definition, unique within the dataset. Perhaps your Primary Key candidate should be made up of a couple of columns (though these are difficult to work with), or perhaps your data set requires normalisation into several tables. — Thom A, Jun 14 '22 at 08:40
@NiharikaMoola-MT - added the sample data for your reference in the post itself. Thanks for your prompt responses. — user961, Jun 14 '22 at 09:18
What does "take care of this scenario" mean? Especially since your description does not make sense since "a primary key [...] which is repeating multiple times" cannot happen, as you were already told. Explain exactly what is given, what it the goal & where you are 1st stuck.. Use enough words, sentences & references to parts of examples to clearly & fully say what you mean. [mre]. PS Give a table in block code format or table format. Input tables to code should be initialization code. See the edit help. — philipxy, Jun 14 '22 at 10:36

NiharikaMoola-MT · Accepted Answer · 2022-06-14T10:43:36.213

You can use the aggregate transformation to remove duplicate values from the source.

Source:

Add sample excel source with duplicate values in ID and Name columns.

Aggregate transformation:

Under the group by property, add the list of columns in which the duplicate rows are identified.

Under the aggregates property, add the aggregate column. Here, we are getting the first value of the Phone column from the duplicate rows.

Expression: first(Phone)

Aggregate output:

Sink1:

Add the aggregate output to sink1 to pass the ID, and Name columns to 1 table.

Sink2:

Add another sink after aggregate transformation to pass Id and phone to a different table.

This approach will miss the data for the second table. I need to save all values for the second table as well. — user961, Jun 16 '22 at 04:48

Azure data factory - multiple primary keys in source excel to be inserted in SQL database

1 Answers1