1

I am working on a T-SQL script with the following logic:

declare @tbl table(id int)

insert @tbl select distinct id -- <-- this takes about 3 minutes
  from link1.some_table
  where <some_conditions>

insert some_local_table(field1, field2, fieldN)  -- <-- this takes about 30 minutes
  select field1, field2, fieldN
  from link2.some_table
  where id in (select id from @tbl)
    and <some conditions>

insert some_OTHER_local_table(field1, field2, fieldN) -- <-- another 30 minutes
  select field1, field2, fieldN
  from link2.some_OTHER_table
  where id in (select id from @tbl)
    and <some conditions>

insert some_YET_OTHER_local_table(field1, field2, fieldN) -- <-- another 30m, and so on
  select field1, field2, fieldN
  from link2.some_YET_OTHER_table
  where id in (select id from @tbl)
    and <some conditions>

... and so on, there are 5 inserts into 5 different local tables sourced from linked server link2 but filtered using @tbl which originates from linked server link1.

Populating each of the local tables takes about 30 minutes so the whole script takes some 2.5 hours.

I am trying to parallelize this process so I am putting each of the 5 inserts into a separate SSIS task and executing them concurrently. However, I have no clue how to share the @tbl table variable between tasks in SSIS. In a script, @tbl is just a variable. In SSIS there is no table variables. Btw. the table holds about 200K id values so building a comma-separated string is out of reach.

So the question is: how to implement a shared, "disposable", package-level recordset object that would be initialized from a database query and then used in a couple of parallel tasks within the same package?

I am on SQL Server 2008 R2.

Piotr L
  • 1,065
  • 1
  • 12
  • 29
  • You could use a global temp table instead of a table variable. Or even a permanent table that you truncate at the end of the package. – Tab Alleman Feb 25 '15 at 14:20

1 Answers1

0

Couldn't you use a subquery in each of your tasks rather than the table variable?

 insert some_local_table(field1, field2, fieldN)  
 select field1, field2, fieldN
 from link2.some_table
 where id in (
      select id   
      from link1.some_table
      where <some_conditions>
   )
 and <some conditions>

... and so on?

You'd be repeating the 3-minute query in each task but that's small compared to the 2.5 hours you say the package is taking when run serially.

Or, since link1 this is a linked server and it might be faster to cache the data locally (as you are doing in the original sequence), give each task the local variable:

declare @tbl table(id int)

insert @tbl select distinct id -- <-- this takes about 3 minutes
  from link1.some_table
  where <some_conditions>

insert some_local_table(field1, field2, fieldN)  -- <-- this takes about 30 minutes
  select field1, field2, fieldN
  from link2.some_table
where id in (select id from @tbl)
   and <some conditions>

... and then

declare @tbl table(id int)

insert @tbl select distinct id -- <-- this takes about 3 minutes
  from link1.some_table
  where <some_conditions>

insert some_OTHER_local_table(field1, field2, fieldN) -- <-- another 30 minutes
  select field1, field2, fieldN
    from link2.some_OTHER_table
   where id in (select id from @tbl)
     and <some conditions>

etc.

Ann L.
  • 13,760
  • 5
  • 35
  • 66
  • Thanks Ann L. Copying the variable between indiviual tasks would certainly work. There is, however, a small risk that the data will chage at link1.some_table, resulting in incoherent results between the queries. Now, caching data in a local table rather than a local variable is tempting(we get a static @tbl data snapshot locally) but I am afraid of writing a JOIN clause between a linked server table and a local table - the remote query is already very complex (it is a view really). I will give it a try though and if it works, will mark your answer as correct. Thanks again! – Piotr L Feb 25 '15 at 14:10