1

I am importing a file from a user on the web that contains 150K rows and has to be broken up resulting in about 1.6M items that will be added to the database.

At the moment I add the primary record first and then add the children after with the key that was provided after the first record.

While I can precompile the query and re-use it, I'd like to group the lot of them together but I'm concerned that I won't be able to parameterize the queries at that point.

At the moment I'm only importing at around 300 rows or 3000 queries/sec via the query method.

Middletone
  • 4,190
  • 12
  • 53
  • 74
  • The only way I know for performance on large data loads, is to use http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx, but this will not be helpfull for the first step, Inserting the Parent Rows to generate the Foreign Key values. It might help with the children rows though – Adriaan Stander Feb 10 '11 at 04:54
  • I do have the thought of using temp table to possibly deal with that. Perhaps add the index of the row to the larger set and update them with the real id when it's time to move the data into the live tables. I think that would work but it still doesn't resolve the larger question. – Middletone Feb 10 '11 at 05:02

1 Answers1

0

I'm not sure what your constraints are for how you can load the data, but there are a few good avenues to get good bulk import rates in to the database->Performing Bulk Copy Operations. When working with with a data import process, I always found it helpful to break it up into phases:

  1. Import phase - various different bulk methodologies available depending on your situation
  2. Staging phase - process work; e.g. data validations, key relations building, data scrubbing, etc
  3. Final Insert into "live" tables. (hopefully set based insert)

It can be very efficient to pick all of the data up in the first pass with little to no logic work, and move it to a staging area en mass; either going into temp tables or permanent staging tables for that purpose. Then you can do any processing work on the data to have everything properly structured and cleaned, before mass inserting them into their final home in the live tables. This could also give you a layer of insulation from any malicious data or sql injection attacks by having one or more intermediary steps.

This separation allows the bulk import task to be as fast as you can make it, because there is hopefully very little logic required to do a mass import to a big staging dumping ground. Then you can apply whatever logic is required to slice up the data however it is appropriate. Additionally, if you have several steps that need to be done in the staging phase, you can break this up into however many smaller steps are needed and focus on optimizing the biggest/slowest parts.

In your situation, if there is a way that you can structure the data after getting to the staging phase to match what is in the live tables, you might be able to insert it as one big set. Being able to build PK->ForeignKey relations in the staging phase before the final insert, (as well as handling any other data processing work), can allow you to go from iterative inserts to one big bulk set insert..set based is usually a very good thing. That is, oh course, if your system/constraints allow for you to do as such.

I'm not sure if any of that applies to your situation or not, of if I'm way off base from what you were asking; but hopefully there is something in there that can be useful.

MikeCov
  • 136
  • 1
  • 5
  • I'm accepting this because the concept is correct and should be followed even though it doesn't address the question's parameters.. – Middletone Apr 20 '11 at 16:48