How to load grouped data with SSIS

Question

I have a tricky flat file data source. The data is grouped, like this:

Country    City
U.S.       New York
           Washington
           Baltimore
Canada     Toronto
           Vancouver

But I want it to be this format when it's loaded in to the database:

Country    City
U.S.       New York
U.S.       Washington
U.S.       Baltimore
Canada     Toronto
Canada     Vancouver

Anyone has met such a problem before? Got a idea to deal with it?
The only idea I got now is to use the cursor, but the it is just too slow.
Thank you!

if you don't have another column like id to help figure which country the cities belong to it's impossible for DBMS to get what you need. because they don't go in order. — Raffaello.D.Huke, Apr 13 '16 at 02:50

cha · Answer 1 · 2016-04-14T04:03:03.153

Yes, it is possible. First you need to load the data to a table with an IDENTITY column:

-- drop table #t
CREATE TABLE #t (id INTEGER IDENTITY PRIMARY KEY,
Country VARCHAR(20),
City VARCHAR(20))

INSERT INTO #t(Country, City)
SELECT a.Country, a.City
 FROM OPENROWSET( BULK 'c:\import.txt', 
     FORMATFILE = 'c:\format.fmt',
     FIRSTROW = 2) AS a;

select * from #t

The result will be:

id          Country              City
----------- -------------------- --------------------
1           U.S.                 New York
2                                Washington
3                                Baltimore
4           Canada               Toronto
5                                Vancouver

And now with a bit of recursive CTE magic you can populate the missing details:

;WITH a as(
    SELECT Country
          ,City
          ,ID
    FROM #t WHERE ID = 1
    UNION ALL
    SELECT COALESCE(NULLIF(LTrim(#t.Country), ''),a.Country)
          ,#t.City
          ,#t.ID
    FROM a INNER JOIN #t ON a.ID+1 = #t.ID
    )
SELECT * FROM a
 OPTION (MAXRECURSION 0)

Result:

Country              City                 ID
-------------------- -------------------- -----------
U.S.                 New York             1
U.S.                 Washington           2
U.S.                 Baltimore            3
Canada               Toronto              4
Canada               Vancouver            5

Update:

As Tab Alleman suggested below the same result can be achieved without the recursive query:

SELECT ID
     , COALESCE(NULLIF(LTrim(a.Country), ''), (SELECT TOP 1 Country FROM #t t WHERE t.ID < a.ID AND LTrim(t.Country) <> '' ORDER BY t.ID DESC))
     , City
FROM #t a

BTW, the format file for your input data is this (if you want to try the scripts save the input data as c:\import.txt and the format file below as c:\format.fmt):

9.0
  2
  1       SQLCHAR       0       11      ""       1     Country      SQL_Latin1_General_CP1_CI_AS
  2       SQLCHAR       0       100     "\r\n"   2     City         SQL_Latin1_General_CP1_CI_AS

Thanks cha. I tried your idea, but it didn't work in my case. Because I actually have 20k+ rows in my table. So I got an error saying something like "the CTE recurs for more than 100 times" — William Xu, Apr 13 '16 at 15:49
You can use a self-join instead of the recursive CTE to avoid the recursion limit. — Tab Alleman, Apr 13 '16 at 16:02
Thanks @TabAlleman. I have updated my answer to include a version without recursion — cha, Apr 14 '16 at 04:03

Tab Alleman · Accepted Answer · 2016-04-14T12:50:16.963

The answer by cha will work, but here is another in case you need to do it in SSIS without temporary/staging tables:

You can run your dataflow through a Script Transformation that uses a DataFlow-level variable. As each row comes in the script checks the value of the Country column.

If it has a non-blank value, then populate the variable with that value, and pass it along in the dataflow.

If Country has a blank value, then overwrite it with the value of the variable, which will be last non-blank Country value you got.

EDIT: I looked up your error message and learned something new about Script Components (the Data Flow tool, as opposed to Script Tasks, the Control Flow tool):

The collection of ReadWriteVariables is only available in the PostExecute method to maximize performance and minimize the risk of locking conflicts. Therefore you cannot directly increment the value of a package variable as you process each row of data. Increment the value of a local variable instead, and set the value of the package variable to the value of the local variable in the PostExecute method after all data has been processed. You can also use the VariableDispenser property to work around this limitation, as described later in this topic. However, writing directly to a package variable as each row is processed will negatively impact performance and increase the risk of locking conflicts.

That comes from this MSDN article, which also has more information about the Variable Dispenser work-around, if you want to go that route, but apparently I mislead you above when I said you can set the value of the package variable in the script. You have to use a variable that is local to the script, and then change it in the Post-Execute event handler. I can't tell from the article whether that means that you will not be able to read the variable in the script, and if that's the case, then the Variable Dispenser would be the only option. Or I suppose you could create another variable that the script will have read-only access to, and set its value to an expression so that it always has the value of the read-write variable. That might work.

Thanks buddy. I tried using Cursor, which I think is similar to your idea. But I actually have 20k+ rows, so running row-by-row is just too slow...... — William Xu, Apr 13 '16 at 15:52
Ok, I tried your method, but I always got the error message "The collection of variables locked for read and write access is not available outside of PostExecute". But I am putting the variable assignment in PostExecute.......sign, don't know what went wrong. Thanks — William Xu, Apr 13 '16 at 22:02
Maybe you're not allowed to read the ReadWrite variable in the script either. I edited my answer with some new information I learned. — Tab Alleman, Apr 14 '16 at 12:51
Thanks Tab Alleman, I used your idea and it works perfectly! It's way faster to massage the data in the dataflow before they're loaded into SQL Server. — William Xu, Apr 14 '16 at 15:39
Glad to hear it, and good to know for my own future references! — Tab Alleman, Apr 14 '16 at 15:41

How to load grouped data with SSIS

2 Answers2