exporting a table containing a large number of records to to another database on a linked server

Question

I have a task in a project that required the results of a process, which could be anywhere from 1000 up to 10,000,000 records (approx upper limit), to be inserted into a table with the same structure in another database across a linked server. The requirement is to be able to transfer in chunks to avoid any timeouts

In doing some testing I set up a linked server and using the following code to test transfered approx 18000 records:

DECLARE @BatchSize INT = 1000

WHILE 1 = 1
BEGIN

INSERT INTO [LINKEDSERVERNAME].[DBNAME2].[dbo].[TABLENAME2] WITH (TABLOCK)  
(
    id
    ,title
    ,Initials
    ,[Last Name]
    ,[Address1]
)
SELECT TOP(@BatchSize) 
    s.id
    ,s.title
    ,s.Initials
    ,s.[Last Name]
    ,s.[Address1]
FROM [DBNAME1].[dbo].[TABLENAME1] s
WHERE NOT EXISTS ( 
    SELECT 1
    FROM [LINKEDSERVERNAME].[DBNAME2].[dbo].[TABLENAME2]
    WHERE id = s.id
)

IF @@ROWCOUNT < @BatchSize BREAK

This works fine however it took 5 mins to transfer the data.

I would like to implement this using SSIS and am looking for any advice in how to do this and speed up the process.

despite server type, I would always steer toward dumps to files. For many reasons — Drew, Sep 08 '15 at 14:02
The only advice I can think of is google some SSIS tutorials. This is a basic function of SSIS, there aren't any tricks to point out about it specifically. — Tab Alleman, Sep 08 '15 at 14:12

score 1 · Answer 1 · answered Sep 08 '15 at 15:51

Open Visual Studio/Business Intelligence Designer Studio (BIDS)/SQL Server Data Tools-BI edition(SSDT)

Under the Templates tab, select Business Intelligence, Integration Services Project. Give it a valid name and click OK.

In Package.dtsx which will open by default, in the Connection Managers section, right click - "New OLE DB Connection". In the Configure OLE DB Connection Manager section, Click "New..." and then select your server and database for your source data. Click OK, OK.

Repeat the above process but use this for your destination server (linked server).

Rename the above connection managers from server\instance.databasename to something better. If databasename does not change across the environments then just use the database name. Otherwise, go with the common name of it. i.e. if it's SLSDEVDB -> SLSTESTDB -> SLSPRODDB as you migrate through your environments, make it SLSDB. Otherwise, you end up with people talking about the connection manager whose name is "sales dev database" but it's actually pointing at production.

Add a Data Flow to your package. Call it something useful besides Data Flow Task. DFT Load Table2 would be my preference but your mileage may vary.

Double click the data flow task. Here you will add an OLE DB Source, a Lookup Task and a OLE DB Destination. Probably, as always, it will depend.

OLE DB Source - use the first connection manager we defined and a query

SELECT 
    s.id
    ,s.title
    ,s.Initials
    ,s.[Last Name]
    ,s.[Address1]
FROM [dbo].[TABLENAME1] s

Only pull in the columns you need. Your query current filters out any duplicates that already exist in the destination. Doing that can be challenging. Instead, we'll bring the entirety of TABLENAME1 into the pipeline and filter out what we don't need. For very large volumes in your source table, this may be an untenable approach and we'd need to do something different.

From the Source we need to use a Lookup Transformation. This will allow us to detect the duplicates. Use the second connection manager we defined, one that points to the destination. Change the NoMatch from "Fail Component" to "Redirect Unmatched rows" (name approximate)

Use your query to pull back the key value(s)

SELECT T2.id
FROM [dbo].[TABLENAME2] AS T2;

Map T2.id to the id column.

When the package starts, it will issue the above query against the target table and cache all the values of T2.id into memory. Since this is only a single column, that shouldn't be too expensive but again, for very large tables, this approach may not work.

There are 3 outputs now available from the Lookup: Match, NoMatch and Error. Match will be anything that exists in the source and destination. You don't care about those as you are only interested in what exists in source and not destination. When you might care is if you have to determine whether there is change between the values in source and the destination. NoMatch are the rows that exist in Source but don't exist in Destination. That's the stream you want. For completeness, Error would capture things that went very wrong but I've not experience it "in the wild" with a lookup.

Connect the NoMatch stream to the OLE DB Destination. Select your Table Name there and ensure the words Fast Load are in the destination. Click on the Columns tab and make sure everything is routed up.

Whether you need to fiddle with the knobs on the OLE DB Destination is highly variable. I would test it, especially with your larger sets of data and see whether the timeout conditions are a factor.

Design considerations for larger sets

It depends.

Really, it does. But, I would look at identifying where the pain point lies.

If my source table is very large and pulling all that data into the pipeline just to filter it back out, then I'd look at something like a Data Flow to first bring all the rows in my Lookup over to the Source database (use the T2 query) and write it into a staging table and make the one column your clustered key. Then modify your source query to reference your staging table.

Depending on how active the destination table is (whether any other process could load it), I might keep that lookup in the data flow to ensure I don't load duplicates. If this process is the only one that loads it, then drop the Lookup.

If the lookup is at fault - it can't pull in all the IDs then either go with the first alternate listed above or look at changing your caching mode from Full to Partial. Do realize that this will issue a query to the target system for potentially all the rows that come out of the source database.

If the destination is giving issues - I'd determine what the issue is. If it's network latency for the loading of data, drop the value of MaximumCommitInsertSize from 2147483647 to something reasonable, like your batch size from above (although 1k might be a bit low). If you're still encountering blocking, then perhaps staging the data to a different table on the remote server and then doing an insert locally might be an approach.

exporting a table containing a large number of records to to another database on a linked server

1 Answers1

Design considerations for larger sets