Updating 4 million records in SQL server using list of record-ids as input

Question

During a migration project, I'm faced with an update of 4 millions records in our SQL Server.

The update is very simple ; a boolean field needs to be set to true/1 and the input I have is a list of all the id's for which this field must be filled.(one id per line)

I'm not exactly an expert when it comes to sql tasks of this size, so I started out trying 1 UPDATE statement containing a "WHERE xxx IN ( {list of ids, separated by comma} )". First, I tried this with a million records. On a small dataset on a test-server, this worked like a charm, but in the production environment this gave an error. So, I shortened the length of the list of ids a couple of times, but to no avail.

The next thing I tried was to turn each id in the list into an UPDATE statement ("UPDATE yyy SET booleanfield = 1 WHERE id = '{id}'"). Somewhere, I read that it's good to have a GO every x number of lines, so I inserted a GO every 100 lines (using the excellent 'sed' tool, ported from unix).

So, I separated the list of 4 million update statements into parts of 250.000 each, saved them as sql files and started loading and running the first one into SQL Server Management Studio (2008). Do note that I also tried SQLCMD.exe, but this, to my surprise, ran about 10-20 times slower than SQL Studio.

It took about 1,5 hour to complete and resulted in "Query completed with errors". The messages-list however, contained a nice list of "1 row(s) affected" and "0 row(s) affected", the latter for when the id was not found.

Next, I checked the amount of updated records in the table using a COUNT(*) and found that there was a difference of a couple of thousand records between the amount of update statements and the amount of updated records.

I then thought that that might be due to the non-existent records, but when I substracted the amount of "0 row(s) affected" in the output, there was a mysterious gap of 895 records.

My questions :

Is there any way to find out a description and cause of the errors in "Query completed with errors."
How could the mysterious gap of 895 records be explained ?
What's a better, or the best, way to do this update ? (as I'm starting to think what I'm doing could be very inefficient and/or error-prone)

Gap seems to be associated with either the fact you had duplicate ids (which is certainly a possibility with 4 million records) or that the id's didn't exist bin the table (another possibility). — Chris Story, Feb 09 '13 at 19:13
how many records are there in the table total? I'm thinking if there are just over 4 million, it may be better to drop the column, add it with a default of 1, and update the other rows to 0 or null. — UnhandledExcepSean, Feb 09 '13 at 20:09
@Chris...duplicate ids, I didn't think of that before, thanks ! Will check this — Jay Regal, Feb 09 '13 at 21:54
@Kaf, it's actually 4 files of 1 million records (1 column though) each, generated by a custom tool that extracts them from our 'old' system — Jay Regal, Feb 09 '13 at 21:55
@SpectralGhost that would have been a great idea indeed. However, there's 8 million records in the destination table :) — Jay Regal, Feb 09 '13 at 21:56

Gordon Linoff · Accepted Answer · 2013-02-09T23:04:16.680

The best way to approach this ask is by inserting the 4 million records into a table. In fact, you can put them into a table with an identity column, by "bulk inserting" into a view.

create table TheIds (rownum int identity(1,1), id int);

create view v_TheIds (select id from TheIds);

bulk insert into v_TheIds . . .

With all the data in the database, you now have many more options. Try the update:

update t
    set booleanfield = 1
    where exists (select 1 from TheIds where TheIds.id = t.id)

You should also create an index on TheIds(id).

This is a large update, all executing as one transaction. That can have bad performance implications and start to fill the log. You can break it into smaller transactions using the rownum column:

update t
    set booleanfield = 1
    where exists (select 1 from TheIds where TheIds.id = t.id and TheIds.rownum < 1000)

The exists clause here is doing the equivalent of the left outer join. The major difference is that this correlated subquery syntax should work in other databases, where joins with updates are database-specific.

With the rownum column, you can select as many rows as you want for the update. So, you can put the update in a loop, if the overall update is too big:

where rownum < 100000
where rownum between 100000 and 199999
where rownum between 200000 and 299999

and so on. You don't have to do this, but you can if you want to batch the updates for some reason.

The key idea is to get the list of ids into a table in the database, so you can use the power of the database for the subsequent operations.

This sounds great and looks pretty simple, which is the best in most cases :) Thanks a lot Gordon, will try it out and post the results here. — Jay Regal, Feb 09 '13 at 21:41
It's unbelievable how fast 4 million ids are inserted into a table by sql server...I'm amazed. So far, so good. I'm not sure what you mean by the "select 1 from TheIds where TheIds.id = t.id and TheIds.rownum < 1000" in relation to "You can break it into smaller transactions using the rownum column". Could you explain please ? — Jay Regal, Feb 09 '13 at 22:45
I cut up the 4 million into 8 parts by using the rownumber 'trick' and ended up updating all 4 million records in less than an hour. Stunning. Thanks Gordon ! — Jay Regal, Feb 10 '13 at 10:25

score 4 · Answer 2 · edited May 09 '13 at 11:55

Warning: I have not been able to test it and I do not have a "playground database" which can hold that much data.

I am not sure about 1. and 2. but for 3. you should be better off leaving the limiting of the update to the DB:

UPDATE TOP(100000) yyy
SET booleanfield = 1
WHERE booleanfield = 0
GO

though the documentation says to "randomly select" some entries with that TOP-limitation - I hope it only does so from the ones having booleanfield = 0. Run that query repeatedly until no more updates are reported.

Another option if the above does not work is to select the affected ids directly from the DB ... this looks odd and I have not tested it either, but I hope it works:

UPDATE yyy
SET booleanfield = 1
FROM (SELECT TOP 100000 id FROM yyy WHERE booleanfield = 0 ORDER BY id ASC) AS xxxx
WHERE yyy.id = xxxx.id;
GO

(I assume here you have an unique key id in the table here). Run this query several (about 40) times until no more updates are reported.

Clemens, I didn't know about the TOP() trick. Good to know. Thanks — Jay Regal, Feb 09 '13 at 21:44

Updating 4 million records in SQL server using list of record-ids as input

2 Answers2

Linked