ADF copy data activity - check for duplicate records before inserting into SQL db

Question

I have a very simple ADF pipeline to copy data from local mongoDB (self-hosted integration environment) to Azure SQL database.

My pipleline is able to copy the data from mongoDB and insert into SQL db. Currently if I run the pipeline it inserts duplicate data if run multiple times.

I have made _id column as unique in SQL database and now running pipeline throws and error because of SQL constraint wont letting it insert the record.

How do I check for duplicate _id before inserting into SQL db?

should I use Pre-copy script / stored procedure? Some guidance / directions would be helpful on where to add extra steps. Thanks

score 5 · Answer 1 · answered Mar 26 '20 at 06:45

5

Azure Data Factory Data Flow can help you achieve that:

You can follow these steps:

Add two sources: Cosmos db table(source1) and SQL database table(source2).
Using Join active to get all the data from two tables(left join/full join/right join) on Cosmos table.id= SQL table.id.
AlterRow expression to filter the duplicate _id, it not duplicate then insert it.
Then mapping the no-duplicate column to the Sink SQL database table.

Hope this helps.

answered Mar 26 '20 at 06:45

Leon Yue

15,693
1
11
23

Thanks for your time on this. I eneded up creating sproc. could you please look my posted answer and put your comment if you see any risk in that. – newdeveloper Mar 26 '20 at 16:52
Forgot to mention, I did try to create DataFlow before posting question but first step where source selection is hapening, I can't set my mongoDB as source dataset. looks like its a limitation for ADF data flow right now? – newdeveloper Mar 26 '20 at 16:55
In my case, this option does not work. The id is a string/varchar and although I followed the above mentioned steps, the duplicates are still inserted into my SQL DB. Any idea? – Matzka Sep 03 '20 at 08:30

score 2 · Answer 2 · answered Mar 26 '20 at 06:10

2

You Should implement your SQL Logic to eliminate duplicate at the Pre-Copy Script

answered Mar 26 '20 at 06:10

PowerStar

893
5
15

1

how would the pre-copy-script would look like if I want to check _id value aleady exists. source and destination both dataset have _id field. – newdeveloper Mar 26 '20 at 16:45

score 1 · Answer 3 · answered Mar 26 '20 at 16:51

Currently I got the solution using a Stored Procedure which look like a lot less work as far this requirement is concerned.

I have followed this article: https://www.cathrinewilhelmsen.net/2019/12/16/copy-sql-server-data-azure-data-factory/

I created table type and used in stored procedure to check for duplicate.

my sproc is very simple as shown below:

SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE [dbo].[spInsertIntoDb]
    (@sresults dbo.targetSensingResults READONLY)
AS

BEGIN

MERGE dbo.sensingresults AS target
USING @sresults AS source
ON (target._id = source._id)

WHEN NOT MATCHED THEN
    INSERT (_id, sensorNumber, applicationType, place, spaceType, floorCode, zoneCountNumber, presenceStatus, sensingTime, createdAt, updatedAt, _v)
    VALUES (source._id, source.sensorNumber, source.applicationType, source.place, source.spaceType, source.floorCode,
     source.zoneCountNumber, source.presenceStatus, source.sensingTime, source.createdAt, source.updatedAt, source.updatedAt);
END

I think using stored proc should do for and also will help in future if I need to do more transformation.

Please let me know if using sproc in this case has potential risk in future ?

I would approach in a similar fashion but with a few changes. I would have two tasks: a Copy task loading the data into a staging table with a Pre-copy script that truncates the staging table) then a Stored Proc task which calls a stored proc that handles all the logic, eg `INSERT INTO ... SELECT ... FROM ... WHERE NOT EXISTS ...`. NB `MERGE` would not explicitly remove duplicates from the source, ie it's not doing `SELECT DISTINCT`. Make sense? — wBob, Mar 29 '20 at 15:51
Could you please help me with pre-copy script for doing this? also how do I make staging table ? We have changed the requirement now. basically its a lot of data insert coming from IoT devices. so we need to aggregate data first from mongoDB (say aggregate by 10minutes interval) before inserting into SQL db — newdeveloper, Mar 31 '20 at 23:37

score 0 · Answer 4 · answered Mar 29 '20 at 06:59

To remove the duplicates you can use the pre-copy script. OR what you can do is you can store the incremental or new data into a temp table using copy activity and use a store procedure to delete only those Ids from the main table which are in temp table after deletion insert the temp table data into the main table. and then drop the temp table.

ADF copy data activity - check for duplicate records before inserting into SQL db

4 Answers4

Linked