2

I have an SSIS routine that reads from a very dynamic table and inserts whichever rows it finds into a table in a different database, before truncating the original source table.

Due to the dynamic nature of the source table this truncation not surprisingly leads to rows not making it to the second database.

What is the best way of deleting only those rows that have been migrated?

There is an identity column on the source table but it is not migrated across.

I can't change either table schema.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
pwmusic
  • 4,739
  • 4
  • 23
  • 14
  • Source is SQL Server and if so, what version (2005/2008/2012). Are the two on the same instance? I *would* find it surprising that data is being lost. I don't lose data and would not accept a solution that was leading to that. Can you post a screenshot of what your package looks like so that we can get an idea of how it is structured and perhaps we can provide strategies on how to prevent this loss. – billinkc Mar 19 '13 at 15:58
  • Source and destination are both SQL-Server 2008 R2. Sorry, I can't find a way of attaching a screenshot. – pwmusic Mar 19 '13 at 16:26
  • The processing is as follows: OLE DB Source: Data Access Mode is 'Table or View' -> RowCount -> Batch Insert into destination -> Fail-over of singleton inserts into destination -> Row-count errors -> Record any errors TRUNCATE the source table. I hope that helps. – pwmusic Mar 19 '13 at 16:30
  • Perhaps I'm being overly cautious but this procedure could take up to an hour, during which time the source table will have had a number of new rows inserted into it. I believe the table truncation will 'lose' these new insertions. – pwmusic Mar 19 '13 at 16:34
  • 1
    An *hour*? Yikes. Are you moving BLOB types (text/ntext/image/nvarchar(max)/varchar(max)/varbinary(max) or is the volume just outrageously high (or hardware that bad?) – billinkc Mar 19 '13 at 16:39
  • Hi @billinkc Thanks for your time and suggestions so far. For the initial run we have 2million+ rows, with hundreds being inserted every second. Due to it being on a shared resource we're only looking to schedule a daily run of this routine and although it will be used less frequently during the night it is still a 24/7 system. In a language I once used I would declare the foreach as an update cursor, insert the values into the new table, then delete from the source table from the current position. I can't afford to 'miss' even a single row. – pwmusic Mar 19 '13 at 17:37

2 Answers2

4

A option, that might sound stupid but it works, is to delete first and use the OUTPUT clause.

control flow setup

I created a simple control flow that populates a table for me.

IF EXISTS
(
    SELECT 1 FROM sys.tables AS T WHERE T.name = 'DeleteFirst'
)
BEGIN
    DROP TABLE dbo.DeleteFirst;
END

CREATE TABLE dbo.DeleteFirst
(
    [name] sysname
);

INSERT INTO
    dbo.DeleteFirst
SELECT
    V.name
FROM
    master.dbo.spt_values V
WHERE
    V.name IS NOT NULL;

dataflow

In my OLE DB Source, instead of using a SELECT, DELETE the data you want to go down the pipeline and OUTPUT the DELETED virtual table. Somethinng like

DELETE
    DF
OUTPUT
    DELETED.*
FROM
    dbo.DeleteFirst AS DF;

results

It works, it works!

billinkc
  • 59,250
  • 9
  • 102
  • 159
  • Hi @billinkc Many thanks for your very comprehensive suggestion, only some of which I was able to understand, I'm afraid. I finally got round this by rewriting the timing of the table truncation. Thanks again for your time and expertise. – pwmusic Jul 29 '13 at 09:10
  • To clarify the Microsoft documentation, which contradicts itself, the OUTPUT clause is available and working from SQL 2005 onwards. I have just tested 2005 to verify this. – Mike Feb 12 '15 at 22:55
  • 1
    @mike Yeah, MS started a weird thing with the BOL rewrite. Basically, a feature for an unsupported version of the product is written out of documentation. If you click the drop down for Other Versions, you'll see 2005 listed and there it specifies that it's available. Despite the 2014 version stating "Applies to: SQL Server (SQL Server 2008 through current version), Azure SQL Database." – billinkc Feb 12 '15 at 23:13
1

One option would be to create a table to log the identity of your processed records into, and then a separate package (or dataflow) to delete those records. If you're already logging processed records somewhere then you could just add the identity there - otherwise, create a new table to store the data.

A second option: If you're trying to avoid creating additional tables, then separate the record selection and record processing into two stages. Broadly, you'd select all your records in the control flow, then process them on-by-one in the dataflow. Specifically:

  1. Create a variable of type Object to store your record list, and another variable matching your identity type (int presumably) to store the 'current record identity'.
  2. In the control flow, add an Execute SQL task which uses a query to build a list of identity values to process, then stores them into the recordlist variable.
  3. Add a Foreach Loop Container to process that list; the foreach task would load the current record identifier into the second variable you defined above.
  4. In the foreach task, add a dataflow to copy that single record, then delete it from the source.

There's quite a few examples of this online; e.g. this one from the venerable Jamie Thomson, or this one which includes a bit more detail.

Note that you didn't talk about the scale of the data; if you have very large numbers of records the first suggestion is likely a better choice. Note that in both cases you lose the advantage of the table truncation (because you're using a standard delete call).

Geoff
  • 8,551
  • 1
  • 43
  • 50
  • 2
    An alternative to shoving it into an object variable is to use the RAW file format. You'll have a disk cost associated with staging the data but consumption will be much more straight forward than shredding it. JamieT also wrote about that approach – billinkc Mar 19 '13 at 16:41