Bulk Insert into a HEAP vs CLUSTERED index where minimal logging is not an option (SQL Server 2008)

Question

The tool currently used is Informatica and we have bookend stored procedures that drop the clustered indexes and then add them back to the database. In the stored procedure where we add the clustered indexes back we have the DDL for the indexes hard coded into the stored procedure (we don't use sys tables because a fear of Microsoft changing the sys tables and regen from there creates a bad index or fails). This causes issues where people have created the clustered indexes but not thought to update the stored procedure and the next time bulk occurs these indexes are gone. We previously did this for all indexes but switched non clustered indexes to using disable/rebuild. This not an option though because we will no longer be able to insert into the table if this is done to the clustered index because it is essentially the table.

Performance is important but not everything. Good performance and easy maintainability trumps great performance and complex maintainability.

After reading many sites it is almost universally agreed that when performing bulk insert, on data not ordered the same as your primary key, inserting into a heap and then applying the pk afterwards is faster ( http://msdn.microsoft.com/en-us/library/ms177445.aspx , http://msdn.microsoft.com/en-us/library/dd425070(v=sql.100).aspx). Most these sites make assumptions that I cannot use at my organization and with my toolset.

Currently due to our current standards policies we have to use FULL recovery model so minimally logging will not occur no matter which selection that I make in reference to heap vs clustered index.

According to our informatica admins specifying tablock or order hints on bcp is not possible through the UI and our organization is adverse to customization beyond the UI because of maintainability.

So the question after all of this is with all the factors above would you recommend that we continue with our somewhat unreliable stored procedures, insert into a clustered index or have some third far superior solution. I also realize there is other stack questions similar to this item but they do not address bulk specifically and/or make similar assumptions in their answers.

score 6 · Accepted Answer · answered Aug 25 '11 at 01:18

6

My suggestion would be to bulk load into a staging table (a heap, or CI matching the file order), (re-)build the clustered index there matching the destination table, and then insert straight from the staging table. To reduce blocking, escalation, log use etc. you could do this in batches of 10000 rows at a time, committing and/or checkpointing every so often.

You might also consider using a pre-processor (C# perhaps) that takes the log file and builds a new one with the proper sort order.

Also I think you are safer using sys.indexes etc. than hard-coding the index structures in the code. Microsoft is far less likely to change a column name in sys.indexes than someone at your shop (no offense intended) will change an index but forget to update the hard-coded definition in the procedure.

answered Aug 25 '11 at 01:18

Aaron Bertrand

272,866
37
466
490

1

+1 for using sys.indexes. Even if MS does change this view it is a once-off change instead of maintaining the sp whenever there are changes to the indexes – WT_W Aug 25 '11 at 01:28
This is currently on a staging table but a pre staging table I don't think would be out of the question. I currently presented the solution using system tables and it was a rather fancy one, sadly the fear of Microsoft changing/adding is not mine but of a supervisor. I gave you a +1 but held out on the check just in case someone has the silver bullet solution. I appreciate the help though it confirmed a lot of my thoughts and gave me some things to think about as well. – JStead Aug 25 '11 at 02:12
1

People need to get over their fear of Microsoft changing things. In SQL Server the catalog views are here to stay - these aren't the 2000 days anymore. I wonder if it will take a delayed deployment or a serious data problem for them to realize that hard-coding these definitions is much riskier than relying on the catalog views. That's exactly what they're there for. – Aaron Bertrand Aug 25 '11 at 02:15

Bulk Insert into a HEAP vs CLUSTERED index where minimal logging is not an option (SQL Server 2008)

1 Answers1