Skipping perfectly identical records in SQL Server

Question

I have a set of text files I'm importing into SQL Server via a SqBulkCopy in C#. There are some records that appear in multiple records and I'd like to skip those rather than throwing an error and stopping.

For cases where there's a clear 'uniqueID'/primary key, all is well - I've used the "WITH (IGNORE_DUP_KEY = ON)" with good effect.

In this case though, no such uniqueID can be determined, short of joining ALL the fields together, or creating a compound key that involves all the fields. I've tried to get a unique transactionID or similar but none is available. Sometimes the records are the same but one digit in a field has changed; in this case I want it inserting into the table.

I figure there may be a better approach than creating a key against all fields? If I did do that, it would mean dates, strings, values etc would all have to be mixed in to form the key.

Querying the rows out and hashing the values for comparison with a hash of my input line was another thought, but performance would be terrible I'm sure :)

Anyone have suggestions on how best to do this?

Thanks

score 5 · Answer 1 · answered Jun 14 '11 at 06:07

5

If you're importing, you can just make a stage which de-duplicates the data by inserting into a temporary table which has the same schema as your target table, and then executing a SELECT DISTINCT query on it, inserting the results of this query into your final destination table. Unless you are talking about tens of millions of rows, this should work fine, since it just needs to run once.

answered Jun 14 '11 at 06:07

codekaizen

26,990
7
84
140

Hi codekaizen, this would deduplicate the input file if it happens to have dupes, but I'm aiming to see if the proposed input line exists in the DB table (which will already have records from previous files in it, some of which may be identical to one in the text file) – Glinkot Jun 14 '11 at 06:15
You have to use temp table nevertheless, just add condition for inserting into main table like "not exists(select * from main table where ...)". – Arvo Jun 14 '11 at 08:50
@Arvo is right - checking for existing rows could be another stage after de-duplicating. It wasn't clear if you wanted to check not only the incoming data, but also against the existing data set. In general, staging your input like this makes it easy to maintain and add to in the future, as more requirements are added. – codekaizen Jun 14 '11 at 21:33
Thanks for the comments codekaizen. I didn't know there was an 'except' command which is an interesting extension of the 'select distinct' idea. Pretty handy. cheers – Glinkot Jun 15 '11 at 05:27

score 2 · Accepted Answer · answered Jun 14 '11 at 11:34

2

You could use EXCEPT:

insert into TargetTable
select fieldlist from SourceTable
except
select fieldlist from TargetTable;

answered Jun 14 '11 at 11:34

Manfred Sorg

1,820
1
11
11

Cool... I didn't know this was available in SQL Server (apparently 2005+). +1. – codekaizen Jun 15 '11 at 05:32

Skipping perfectly identical records in SQL Server

2 Answers2