4

I need to stage a couple of thousands of txt files, using SSIS, into an SQL Server 2012 instance.

The issue is that some of the txt files have control checksums in their first two lines and some others don't.

If a file has a checksum, its first line starts with "HEADER" - in such case I have to skip the first two lines then read column headers from line #3 and load data starting from line #4.

If a file does not have a checksum, column names will exist in line #1 and data starts at line #2.

Now, my question is: what is the least invasive way of dealing with such scenario? I am lazy so I am looking for minimum effort with maximum effects.

My ideas so far:

  1. Detect existence of checksums with a C# scripting component and build two separate flows, one for each file type. Disadvantage: two almost identical flows (I am not a big fan of redundancy)

  2. Use PowerShell first to get rid of checksums from files before running SSIS flow. Disadvantage: need to re-write plenty of files (performance hit)

  3. Ask on StackOverflow. Disadvantage: can be seen as ironic by StackOverflow community.

Any hints?

Piotr L
  • 1,065
  • 1
  • 12
  • 29
  • Maybe including a conditional split on first column discarding both headers and Checksum will work for you. – vercelli Jun 09 '16 at 09:04
  • Well the thing is that checksums have different column structure (than the actual columns / data). I don't see how conditional split would help here. – Piotr L Jun 09 '16 at 09:42
  • But if you skip the first row you will get rid of the CHECKSUM, right? Then you should discard the remaining headers on a conditional split (I guess the headers do have the same structure as the data) – vercelli Jun 09 '16 at 09:52
  • OK, more details: in case of a file with checksums, the first row starts with the string "HEADER" followed by a bunch of numbers and characters; no column separators here. The second line is 3 columns: one string and two numbers; then the 3rd line contains actual "normal" data headers and lines #4 up to #whatever are data rows to be loaded. – Piotr L Jun 09 '16 at 09:56
  • I would go with idea #1, but in my head canon, where two things are ALMOST identical, there is no redundancy. – Tab Alleman Jun 09 '16 at 13:22

2 Answers2

2

Here's an approach.

  1. Create 2 Flat File Connection Managers, and in the design mode, browse to a sample file for each so you can setup the columns etc.
  2. For the Flat File Connection Manager that doesn't have checksum rows, leave the Header Rows to Skip as 0, and for the Flat File Connection Manager that has the Checksum rows, set it to skip the first 2 rows as shown below. enter image description here
  3. Set the DelayValidation property for both connection managers to true.
  4. In your Data Flow Task, use a Scripting Component To Detect if File has Checksum Rows.
  5. In the output of the Scripting Component, Have the full path to the file as an output column and also have a Flag that indicates whether or not the File has Checksum rows. Ex: HasCheckSumRows
  6. Connect the output of the Scripting Component to a Conditional Split Task.
  7. In the Conditional Split Task, split based on HasCheckSumRows Flag set to true, and create 2 outputs, 1 for Files With Checksum Rows, other for Files Without.
  8. Create 2 Flat File sources, 1 for the checksum row files, that uses the flat file connection manager you created for these type of files, and the other for the regular files without checksum rows.
  9. Connect the 2 outputs from the conditional split to the respective flat file source, using the full file path that is in the output column as the connectionstring for the flat file source.
  10. Finally, use a Union All component to get the rows from both types of files back into 1 input data stream. So this doesn't duplicate your workflow (i.e. won't cause redundancy) as you have all the rows in 1 stream.

Hope this makes sense.

Shiva
  • 20,575
  • 14
  • 82
  • 112
0

I would have wrote this in a comment but you know I am new to contributing after using for some many years. Anyway, your option 1 of just having redundancy data flows with a condition of existence of Checksum line is probably the most straight forward approach and seeing your data set columns are limited may be fastest to create and maintain. For that approach you could put in a Foreach loop container set it up to enumerate the files. Add a variable for testing whether the checksum is there. Add a c# script task to test for checksum and populate variable. Then add 2 dataflow tasks 1 to handle the one format and the second the alternative format. It would only be a duplication of 4 components (FileConnection, Dataflow, source, destination). Then for the precedence constraint editor (double click green arrow) change to evaluate by Expression and Constraint in the Expression simply choose your variable or !variable (opposite) depending on which data flow is being fed. This is assuming you are going to stagging without first combining all 2000+ files in the same dataflow task with unions etc. There is definitely a lot of cross over to @Shiva answer but I don't see the need for a conditional split of union all if you are going to a staging table.

enter image description here

On a different note if you really want no redundancy of data flow tasks etc. There is no need to do the powershell script before running the SSIS package. You can create a c# script to do it right in the package. For example I have a problem where line endings are not consistent from a data source that I must import. So I have a c# script to normalize the line endings saving the file prior to my data flow task. you could normalize your file to one structure on the fly in the ssis package but it will take additional syste resources as you point out however the it will be at a time the file is already being loaded into memory/processed.

@xpil, 15 ish different types wasn't in the original question. I would probably do 1 of 2 things then. First implement your idea for #2 but do it inside of an SSIS script. So I would strip out the unwanted lines via a system.file.io operation. Then I would build out all of the different types and set a variable in the script as well to tell you which type it is or fail through different dfts. Or I would actually just script the entire operation, I might or might not use SSIS but I would just have a system.file.io to load the file detect existence and type and then simply use SQLBulkCopy to put it into the table right then rather than creating DFTs is would be less time consuming on setup. Though if the files are large few hundred MB you might still want to go the DFT route. Here is a few snippets from a SSIS script task I wrote that could help of course you will need to alter for your purposes.

If going going fix file and DFT route.

 string contents = File.ReadAllText(_WFRFullPath);
    contents = Regex.Replace(contents, @"\r\n?|\n", crlf);
    File.WriteAllText(_SSISWorkspaceFullPath, contents);

Read file contents fix via regex write file back to new location.

If going load via script route you will then just need to read it to a datatable then test format probably by column names or data types. then load it.

        SqlConnection sqlConnection = new SqlConnection(sqlConnectionString);
        sqlConnection.Open();

        SqlBulkCopy bulkCopy = new SqlBulkCopy(sqlConnection);
        bulkCopy.DestinationTableName = _stagingTableName;
        foreach (DataColumn col in _jobRecDT.Columns)
        {
            //System.Windows.Forms.MessageBox.Show(col.ColumnName);
            bulkCopy.ColumnMappings.Add(col.ColumnName, col.ColumnName);
        }


        bulkCopy.WriteToServer(_jobRecDT);

        sqlConnection.Close();

Note I don't have the code of it off hand but if you have large files you can actually implement a stream reader and chunk the file and bulkcopy up.

Matt
  • 13,833
  • 2
  • 16
  • 28
  • Matt, you would be absolutely correct if it was just about one file type. In fact in my scenario I have over 12,000 files of 15 (ish) different types, and each type exists in two variants: with and without checksum headers. So I do have a valid reason to remove redundancy as much as possible. – Piotr L Jun 15 '16 at 09:30