Robust SQL for copying large amount of data between columns

Question

I'm working with a database that stores files as ordinary varbinary(max) type columns. I want to change the table to use FILESTREAM instead. I need to write a script to create the new FILESTREAM column and copy over the old data into it. This will run on databases that can potentially have thousands of files comprising hundreds of GBs of data.

I need the script to be able to handle errors and interruptions such as disconnects, memory issues, other crashes, etc. and be able to run it again and continue without loss of data.

-- Rename old column and create a new column with the old name.
-- Check if the temporary renamed column exists in case this is
-- resuming from a failed prior attempt.
IF NOT EXISTS (SELECT 1 
               FROM sys.columns 
               WHERE [name] = 'old_DataColumn' AND [object_id] = OBJECT_ID('dbo.TableWithData'))
BEGIN
    EXEC sp_rename 'dbo.TableWithData.DataColumn', 'old_DataColumn', 'COLUMN';

    ALTER TABLE [TableWithData]
    ADD [DataColumn] VARBINARY(MAX) FILESTREAM NULL;
END

DECLARE @Id UNIQUEIDENTIFIER;

DECLARE [DataTransferCursor] CURSOR LOCAL FOR 
    SELECT [Id] 
    FROM [TableWithData] 
    WHERE [DataColumn] IS NULL AND [old_DataColumn] IS NOT NULL;
OPEN [DataTransferCursor];

FETCH NEXT FROM [DataTransferCursor] INTO @Id;
WHILE @@FETCH_STATUS = 0
BEGIN
    UPDATE [TableWithData]
    SET [DataColumn] = [old_DataColumn]
    WHERE [Id] = @Id;

    FETCH NEXT FROM [DataTransferCursor] INTO @Id;
END

CLOSE [DataTransferCursor];
DEALLOCATE [DataTransferCursor];

-- Do not drop old column until all data has been copied.
IF NOT EXISTS(SELECT 1 
              FROM [TableWithData] 
              WHERE [DataColumn] IS NULL AND [old_DataColumn] IS NOT NULL)
BEGIN
    ALTER TABLE [TableWithData]
    DROP COLUMN [old_DataColumn];
END

Is the cursor query and the check before dropping the column enough to prevent dropping the column prematurely? Is there possibly a more efficient way of achieving this?

Also, are the CLOSE and DEALLOCATE actions on the cursor necessary for local cursors?

Whenever I have had to do something like this in the past I have written a small external program (C# or powershell) which moves batches of records at a time, and potentially at a time of the day when the system isn't under heavy load. And I let it tick away for hours/days/weeks, however long is required to complete. And wrap each move in a transaction, so its all or nothing. I wouldn't try and do it in a single SQL script. — Dale K, Mar 09 '23 at 21:02
I would suggest avoiding dropping columns until all is done. I'd probably do something like this: create a progress table where you log every successful conversion. This way it's easy to resume etc. Finally, when you're finished, one can drop. Also, what about backups? You probably want to run them once in a while not to build log too much — siggemannen, Mar 09 '23 at 23:14
When setting `[DataColumn]`, consider also nulling out `[old_DataColumn]` in the same statement `SET [DataColumn] = [old_DataColumn], [old_DataColumn] = null`. Without it, each row will roughly double in size and SQL Server will spend a lot of time reallocating data. With the change, the row size may be close enough to the prior size that SQL server may be able to store the updated data in place. Although possibly unnecessary redundant overkill, I would also restate the `[DataColumn] IS NULL AND [old_DataColumn] IS NOT NULL` conditions in the `UPDATE` statement. — T N, Mar 10 '23 at 03:27
As for the flow, my usual technique for mass updates is to (1) Select all of the target rows IDs matching your initial condition into a table variable like `@SelectedIds`, (2) Create a `WHILE EXISTS(SELECT * FROM @SelectedIds) BEGIN ... END` loop, (3) Select a batch of say 1000 IDs into a second `@BatchIds` table variable, (4) Delete those `@BatchIds` values from `@SelectedIds`, (5) Use `@BatchIds` to join and update the target table, (6) Loop. Updating batches is usually faster than processing one row at a time, as long as the batch size isn't too large. — T N, Mar 10 '23 at 03:35
Addendum to the above: (5.5) Empty the @BatchIds table variable before looping. — T N, Mar 10 '23 at 03:51

Robust SQL for copying large amount of data between columns

0 Answers0