3

I'm trying to get rid of some spurious warnings in my SSIS Progress log. I'm getting a bunch of warnings about unused columns in tasks that use raw SQL to do their work. I have a Data Flow responsible for archiving data in a staging table prior to loading new data. The Data Flow looks like this:

+--------------------+
| OLEDB Source task: |
| read staging table |
+--------------------+
          |
          |
+---------------------------+
| OLEDB Command task:       |
| upsert into history table |
+---------------------------+
          |
          |
+---------------------------+
| OLEDB Command task:       |
| delete from staging table |
+---------------------------+

my 'upsert' task is something like:

--------------------------------------
-- update existing rows first...
update history
set    field1 = s.field1
    ...
from   history h
inner  join staging s
on     h.id = s.id
where  h.last_edit_date <> s.last_edit_date -- only update changed records

-- ... then insert new rows
insert into history
select s.*
from   staging s
join   history h
on     h.id = s.id
where  h.id is null
--------------------------------------

The cleanup task is also a SQL command:

--------------------------------------
delete from staging
--------------------------------------

Since the upsert task doesn't have any output column definitions, I'm getting a bunch of warnings in the log:

[DTS.Pipeline] Warning: The output column "product_id" (693) on output 
"OLE DB Source Output" (692) and component "read Piv_product staging table" (681) 
is not subsequently used in the Data Flow task. Removing this unused output column 
can increase Data Flow task performance. 

How can I eliminate the references to those columns? I've tried dropping in a few different tasks, but none of them seem to let me 'swallow' the input columns and suppress them from the task's output. I'd like to keep my logs clean so I only see real problems. Any ideas?

Thanks!

Val
  • 51
  • 1
  • 1
  • 3

6 Answers6

5

Union All - select only the columns you want to pass through - delete any others.

I thought they were going to address this in the 2008 version to allow columns to be trimmed/suppressed from the pipeline.

Cade Roux
  • 88,164
  • 40
  • 182
  • 265
2

OK, I got a workaround on the MSDN forums:

use a Script Component transformation between task 1 and 2; select all the input columns; leave the script body empty.

That consumes the columns, the job processes properly and no warnings are logged.

Still not clear why I need the OLEDB Source at all, since the SQL in task 2 connects to the source tables and does all the work, but when I remove the OLEDB Source the dataflow runs but doesn't process any rows, so the staging table never gets emptied, and then the downstream process to put changed rows in the staging table fails because of PK violations. But that's a problem for another day. This is a bit clunky, but my logs are clean.

Val
  • 51
  • 1
  • 1
  • 3
1

The warnings in your pipeline are caused by columns being selected in your data source that aren't being used in any subsequent Tasks.

The easy fix to this is double click on your data source. In your case (OLEDB Source task: | | read staging table) Then click on columns and deselect any columns that you don't need in any of your future task items.

This will remove those warnings from your progress log.

However reading your item above and as explained by other answers you aren't using the columns from the Source Task in the subsequent items so it can simply be removed.

Dale Wright
  • 119
  • 2
  • 8
1

I have the same question. I have received a good answer. You can find it here.

As "Mark Wojciechowicz" said :

Here are some recommendations for reducing complexity, which will, in turn, improve performance:

  • Reduce the columns at the source. If you are selecting columns that are not subsequently used in any way, then remove them from the query or uncheck them from the source component. Removing columns in this way removes them from the buffer, which will occupy less memory.
  • Reduce the number of components in the dataflow. Very long dataflows are easy to create, a pain to test and even harder to maintain. Dataflows are expecting a unit of work, i.e. a data stream from here to there with a few things in the middle. This is where dataflows shine, in fact, they protect themselves from complexity with memory limitations and a max number of threads. It is better to divide the work into separate dataflows or stored procs. You could stage the data into a table and read it twice, rather than use a multicast, for example.
  • Use the database. SSIS is as much an orchestration tool as it is a data-moving tool. I have often found that using simple dataflows to stage the data, followed by calls to stored procedures to process the data, always out-performs an all-in-one dataflow.
  • Increase the number of times you write the data. This is completely counter intuitive, but if you process data in smaller sets of operations, it is faster running and easier to test. Given a clean slate, I will often design an ETL to write data from the source to a staging table, perform a cleansing step from the stage table to another, optionally, add a conforming step to combine data from different sources to yet another table and, finally, a last step to load a target table. Note that each source is pushed to its own target table and later combined, leveraging the database. The first and last steps are set up to run fast and avoid locking or blocking at either end.
  • Bulk Load. The prior step really does well, when you insure that bulk loading is happening. This can be a tricky thing, but generally you can get there by using "fast load" in the OLEDB destination and by never using the oledb command. Removing indexes and re-adding them is faster than loading in place (with few exceptions).
Community
  • 1
  • 1
Ardalan Shahgholi
  • 11,967
  • 21
  • 108
  • 144
0

Looking at your problem again, I think you are using SSIS "against the grain". I'm not really following what you are reading out of the staging table, since your upsert doesn't seem to depend on anything in a particular row, nor does the cleanup.

It would seem to me that the cleanup would run once for every row, but that doesn't really make sense.

DataFlows are not typically used to perform bulk actions for each row coming down the pipeline. If you are using a pipeline UPSERTs get handled using Lookup (or third-party TableDifference) components and then a spli in the pipeline to an OLEDB Destination (BULK INSERT) and either an OLEDB Command (once per UPDATE) or another OLDEB Destination for an "UPDATE staging table".

Normally, I would do this with a DataFlow to load the staging table without any split, then a single Execute SQL Task in the control flow to perform everything else in straight SQL UPSERT (like you have) by calling an SP.

OLEDBCommand is useful if you DON'T want to have a staging table, and instead read a flat file and want to execute an UPDATE or INSERT using a Lookup component or something. But it will be called for every row in the pipeline.

Cade Roux
  • 88,164
  • 40
  • 182
  • 265
0

Your DataFlow Task should finish with the "upsert". Then back in the control flow create an Execute SQL Task for the delete from staging. Link your DataFlow Task to your exec sql.

I don't use a 3rd party tool for my upserts, but do as Cade suggests, which is to split your dataflow into new records that just head to a OLE DB Destination (or similar), and update records that can go to your oledb command for updates. You can split the flow using a merge-join or a lookup.

Hadi
  • 36,233
  • 13
  • 65
  • 124
AdamH
  • 1,331
  • 7
  • 19
  • Adam, thanks -- you're exactly right, and a day or so after I posted this I finally came to that realization. Cade's right, too I - was using SSIS against the grain. I moved the DELETE code to the Control Flow after the Data Flow completes, and all's happy now. Thanks! – Val Nov 21 '08 at 01:02