0

We have a requirement to insert large number of records (about 2 to 3 millions) in a table. However, we should be able to validate and segregate invalid records - primary key, foreign key and non null violations - into a separate error table for later reference. Per my study, bulk insert in SQL server works well for inserting, but I'm not able to figure out the best way to filter out bad data records. Does having having a staging table in between help? Although we could check for violations using some queues against staging table, we must load the good records into the actual table with another insert - either through insert select or merge - but is this an efficient approach? I'm concerned as it would be akin to doing 2x inserts.

I'm planning to use .net sqlbulkcopy for doing bulk insertions and it doesn't have a clear error reporting as well.

Can somebody point me to a more efficient solution?

EDIT: If this approach is the only solution, what method do you think is best for the second insert? Is it insert...select or MERGE? Would they match the efficiency and speed of BULK INSERT? Or is there any other better alternative?

Thanks!

  • this has some approaches :http://stackoverflow.com/questions/1004525/sqlbulkcopy-error-handling-continue-on-error – TheGameiswar Jan 25 '17 at 16:56
  • @TheGameiswar Thanks for the link. I did go through it and indeed the second option is what I think is applicable in this scenario. However, i'm not sure if that's the efficient as we'll be doing 2x inserts - one bulk insert from source to staging table and then another from staging table to target. I just wanted to know if there's further chance for optimization. – Bugs Bunny Jan 25 '17 at 17:03
  • @TheGameiswar edited the question on gathering opinion about best method to do the second step. Thanks! – Bugs Bunny Jan 25 '17 at 17:06

1 Answers1

1

Personally I would not consider 2/3M records as a large amount. Unless you need the data in seconds, A Single (Non-Bulk) insert will perform adequately.

If I'm nervous about the src data quality - I like to load to a stg table first and then do "Soft RI" - Check for PKs, UQs, FKs etc using SQL. If I'm worried about Numeric/non-Numeric or bad date type issues then I make the Stg table VARCHAR(8000) for all cols and use TRY_CONVERT when reading from the table.

Once data is in STG you can easily filter only the good rows and report in detail on the bad rows.

john McTighe
  • 1,181
  • 6
  • 8
  • @ johnMcTighe Thanks for the comments. I could say our SLA would roughly be 1 million records per minute. Wouldn't it be better with bulk insert? Also, apart from segregating bad records we would also need to insert the good ones into actual table. So which SQL statement you reckon would be most efficient for that step? – Bugs Bunny Jan 25 '17 at 17:18
  • Ok I'd try this: Insert To STG - This can be bulk insert as you will be reasonably confident ll rows will load (no PKs, Constraints etc) - Also STG table will be empty before each load. Then Update any rows in the Stg that are invalid for whatever reason - This can be multiple passes. then insert the good rows as a simple Insert/Select - Also Report on the Bad rows as a separate query afterwards. If this runs fast enough then great! – john McTighe Jan 25 '17 at 17:27
  • @ john McTighe I hope it's fast enough! I'll leave this question unanswered for some more time to see if other people have any thoughts. If not, will mark yours as the answer. Thanks for the help! – Bugs Bunny Jan 25 '17 at 17:45
  • Hi bugs you might need to bulk insert to the final table also to meet your 1 min sla. Should be fine unless your table is stupidly wide. Also make sure that the bulk loads run from the database server so to eliminate network I/O. – john McTighe Jan 26 '17 at 21:30