2

I recently discovered an error in a client API. It sends pages of unique records based on timestamps. Every so often, it will send the same exact record on end of the previous page, and start of the next page.

I was operating under the assumption that there were no duplicates (This is an event log after all). Therefore I was doing pure inserts, instead of going through the trouble of an update or some sort of 'on duplicate' step.

My question is --- what's the best way to get rid of these duplicates, and then have MS SQL quietly discard these duplicate records upon insert? They are definitely inserted AFTER the original has been written in a transaction. These duplicates happen once every 10,000 rows -- so something very memory light would be great.

The only distinct part of the records is the "ETL batch ID" - everything else is identical. Otherwise each record is supposed to have a unique "event id". If I make this event id a unique index/ key, can I somehow force MS SQL to dump any duplicate keys? Or will the application throw an error message and stop during the query?

user45867
  • 887
  • 2
  • 17
  • 30

1 Answers1

0

There is the infamous IGNORE_DUP_KEY index option that answers your question. If would try to avoid using it because it changes insert semantics in this very subtle and hard to discover way.

It is very efficient, though: http://blogs.msdn.com/b/craigfr/archive/2008/01/30/maintaining-unique-indexes-with-ignore-dup-key.aspx and http://web.archive.org/web/20180404165346/http://sqlblog.com:80/blogs/paul_white/archive/2013/02/01/a-creative-use-of-ignore-dup-key.aspx.

If I make this event id a unique index/ key, can I somehow force MS SQL to dump any duplicate keys? Or will the application throw an error message and stop during the query?

Yes, the insert will fail. This is the preferred way in most situations. If you can't make that work IGNORE_DUP_KEY might be the next best thing.

Community
  • 1
  • 1
usr
  • 168,620
  • 35
  • 240
  • 369
  • right well if ignore_dup_key is advised against, what's my alternative? prevent duplicate keys from reaching the database? I think a process, in whatever programming language, or even an update lookup against the DB, would dramatically slow down the process – user45867 Sep 29 '15 at 23:14
  • What if you keep track of last record of previous page and compare it to the first record of the current page? If you're confident that the pattern is always "the last record of a page is sometime duplicated by the first record of the next page". – Adam Porad Sep 29 '15 at 23:26
  • You could also consider asking the owner of the client API to fix their bug :) They might not know it exists, or they might have a solution/work-around. – Adam Porad Sep 29 '15 at 23:27