29

I have a stored procedure that is working with a large amount of data. I have that data being inserted in to a temp table. The overall flow of events is something like

CREATE #TempTable (
    Col1    NUMERIC(18,0) NOT NULL,    --This will not be an identity column.
    ,Col2   INT NOT NULL,
    ,Col3   BIGINT,

    ,Col4   VARCHAR(25) NOT NULL,
    --Etc...

    --
    --Create primary key here?
)


INSERT INTO #TempTable
SELECT ...
FROM MyTable
WHERE ...

INSERT INTO #TempTable
SELECT ...
FROM MyTable2
WHERE ...

--
-- ...or create primary key here?

My question is when is the best time to create a primary key on my #TempTable table? I theorized that I should create the primary key constraint/index after I insert all the data because the index needs to be reorganized as the primary key info is being created. But I realized that my underlining assumption might be wrong...

In case it is relevant, the data types I used are real. In the #TempTable table, Col1 and Col4 will be making up my primary key.

Update: In my case, I'm duplicating the primary key of the source tables. I know that the fields that will make up my primary key will always be unique. I have no concern about a failed alter table if I add the primary key at the end.

Though, this aside, my question still stands as which is faster assuming both would succeed?

ahsteele
  • 26,243
  • 28
  • 134
  • 248
Frank V
  • 25,141
  • 34
  • 106
  • 144

9 Answers9

17

This depends a lot.

If you make the primary key index clustered after the load, the entire table will be re-written as the clustered index isn't really an index, it is the logical order of the data. Your execution plan on the inserts is going to depend on the indexes in place when the plan is determined, and if the clustered index is in place, it will sort prior to the insert. You will typically see this in the execution plan.

If you make the primary key a simple constraint, it will be a regular (non-clustered) index and the table will simply be populated in whatever order the optimizer determines and the index updated.

I think the overall quickest performance (of this process to load temp table) is usually to write the data as a heap and then apply the (non-clustered) index.

However, as others have noted, the creation of the index could fail. Also, the temp table does not exist in isolation. Presumably there is a best index for reading the data from it for the next step. This index will need to either be in place or created. This is where you have to make a tradeoff of speed here for reliability (apply the PK and any other constraints first) and speed later (have at least the clustered index in place if you are going to have one).

Cade Roux
  • 88,164
  • 40
  • 182
  • 265
9

If the recovery model of your database is set to simple or bulk-logged, SELECT ... INTO ... UNION ALL may be the fastest solution. SELECT .. INTO is a bulk operation and bulk operations are minimally logged.

eg:

-- first, create the table
SELECT ...
INTO #TempTable
FROM MyTable
WHERE ...
UNION ALL
SELECT ...
FROM MyTable2
WHERE ...

-- now, add a non-clustered primary key:
-- this will *not* recreate the table in the background
-- it will only create a separate index
-- the table will remain stored as a heap
ALTER TABLE #TempTable ADD PRIMARY KEY NONCLUSTERED (NonNullableKeyField)

-- alternatively:
-- this *will* recreate the table in the background
-- and reorder the rows according to the primary key
-- CLUSTERED key word is optional, primary keys are clustered by default
ALTER TABLE #TempTable ADD PRIMARY KEY CLUSTERED (NonNullableKeyField) 

Otherwise, Cade Roux had good advice re: before or after.

Peter Radocchia
  • 10,710
  • 2
  • 34
  • 56
3

You may as well create the primary key before the inserts - if the primary key is on an identity column then the inserts will be done sequentially anyway and there will be no difference.

Justin
  • 84,773
  • 49
  • 224
  • 367
  • I will not be using identity. You posted this while I was updating my question. My primary key will consist of the NUMERIC(18,0) and a VARCHAR(25) field. – Frank V Jun 22 '09 at 21:16
  • Primary Keys are Clustered by default. This will order your sequentially order the data based upon the PK Values. I agree with this answer, it should happen before you insert. Also note: if you add extra non-clustered indexes. Creating a clustered PK after would cause SQL Server to rebuild the non-clustered indexes. – DBAndrew Jun 22 '09 at 21:46
  • @Justin How can i create the primary key before the inserts. can you complete your suggestion with a small example query? – itro May 21 '19 at 09:34
2

Even more important than performance considerations, if you are not ABSOLUTELY, 100% sure that you will have unique values being inserted into the table, create the primary key first. Otherwise the primary key will fail to be created.

This prevents you from inserting duplicate/bad data.

Jeff Meatball Yang
  • 37,839
  • 27
  • 91
  • 125
  • This isn't a problem for me. I understand it might be for some, but not a problem at all for me. – Frank V Jun 22 '09 at 22:31
  • I think this makes sense just because it demonstrates what you are logically trying to do more clearly and lets the engine help you when assumptions that used to be safe (i.e., the certainty that no duplicate data will be in the table after the insert) are no longer safe (e.g., resulting from some change in a different part of the database). – binki Jan 11 '18 at 15:14
1

If you add the primary key when creating the table, the first insert will be free (no checks required.) The second insert just has to see if it's different from the first. The third insert has to check two rows, and so on. The checks will be index lookups, because there's a unique constraint in place.

If you add the primary key after all the inserts, every row has to be matched against every other row. So my guess is that adding a primary key early on is cheaper.

But maybe Sql Server has a really smart way of checking uniqueness. So if you want to be sure, measure it!

Andomar
  • 232,371
  • 49
  • 380
  • 404
1

I was wondering if I could improve a very very "expensive" stored procedure entailing a bunch of checks at each insert across tables and came across this answer. In the Sproc, several temp tables are opened and reference each other. I added the Primary Key to the CREATE TABLE statement (even though my selects use WHERE NOT EXISTS statements to insert data and ensure uniqueness) and my execution time was cut down SEVERELY. I highly recommend using the primary keys. Always at least try it out even when you think you don't need it.

  • 1
    The context of the question was less about whether to create a primary key or not but at which point in time is more efficient -- pre-data population or post-data population. The conclusion prior to this question was that a key/index helped with execution time. – Frank V Aug 14 '12 at 18:55
0

I don't think it makes any significant difference in your case:

  • either you pay the penalty a little bit at a time, with each single insert
  • or you'll pay a larger penalty after all the inserts are done, but only once

When you create it up front before the inserts start, you could potentially catch PK violations as the data is being inserted, if the PK value isn't system-created.

But other than that - no big difference, really.

Marc

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
0

I wasn't planning to answer this, since I'm not 100% confident on my knowledge of this. But since it doesn't look like you are getting much response ...

My understanding is a PK is a unique index and when you insert each record, your index is updated and optimized. So ... if you add the data first, then create the index, the index is only optimized once.

So, if you are confident your data is clean (without duplicate PK data) then I'd say insert, then add the PK.

But if your data may have duplicate PK data, I'd say create the PK first, so it will bomb out ASAP.

John MacIntyre
  • 12,910
  • 13
  • 67
  • 106
0

When you add PK on table creation - the insert check is O(Tn) (where Tn is "n-th triangular number", which is 1 + 2 + 3 ... + n) because when you insert x-th row, it's checked against previously inserted "x - 1" rows

When you add PK after inserting all the values - the checker is O(n^2) because when you insert x-th row, it's checked against all n existing rows.

First one is obviously faster since O(Tn) is less than O(n^2)

P.S. Example: if you insert 5 rows it is 1 + 2 + 3 + 4 + 5 = 15 operations vs 5^2 = 25 operations

Alex from Jitbit
  • 53,710
  • 19
  • 160
  • 149