3

I am working in a data warehouse project with a lot of sources creating flat files as sources and we are using SSIS to load these into our staging tables, we are currently using the Flat File Source component.

However, after a while, we need an extra column in one of the files and from a date the file specification change to add that extra column. This exercise happens quite frequently and over time accumulate quite a lot versions.

According to answers I can find here and on the rest of the internet the agreed method to handle this scenario seems to be to set up a new flat file source in a new separate data flow for this version, to keep re-runablility for ETL process for old files. Method is outlined here for example: SSIS pkg with flat-file connection with fewer columns will fail

In our specific setup, the additional columns are always additional columns (never remove old columns) and also, for logical reasons the new columns can not be mandantory if we keep re-runability for the older files in their separate data flows.

I don´t think the method of creating a duplicate data flow handling the same set of columns over and over again is a good answer for a data warehouse project as ours and I would prefeer a source component that takes the last file version and have the ability to mark columns as "not mandadory" and deliver nulls if they are missing.

Is anybody aware of a SSIS Flat File component that is more flexible in handle old file versions or have a better solution for this problem? I assume that such a component would need to approach the files on a named column basis rather than the existing left-to-right approach?

Any thoughts or suggestions are welcome!

Community
  • 1
  • 1
MrSallad
  • 31
  • 2
  • 1
    Flat file doesnt have versions and the only way you can get the new versions (without reusability) is to compare the metadata of the new and old flat file. This is easier said than done. – rvphx Nov 15 '12 at 15:20

1 Answers1

1

The following will lose efficiency when processing (over having separate data flows), but will provide you with the flexibility to handle multiple file types within a single data flow.

You can arrange you flat file connection to return lines rather than individual columns, by only specifying the row delimiter. Connect this to a flat file source component which will output a single column per row. We now have a single row that represents one of the many file types that you are aware of – the next step is to determine which file type you have.

Consume the output from a flat file type with a script component. Pass in a single column and pass out the superset of all possible columns. We have lost the meta data normally gleamed from a file source, so you will need to build up the column name / type / size within the script component output types.

Within the script component, pass the line and break it into its component columns. You will have to perform a pattern match (maybe using RegularExpression.Regex.Match) to identify when a new column starts. Hopefully the file is well formed which will aid you - beware of quotes and commas within text columns.

You can now access the file type by determining the number of columns you have and default the missing columns. Set the rows’ output columns to pass out the constituent parts. You may want to attach a new column to record the file type with your output.

The rest of the process should be able to load your table with a single data flow as you have catered for all file types within your script.

I would not recommend that you perform the above lightly. The benefit of SSIS is somewhat reduced when you have to code up all the columns / types etc, however it will provide you with a single data flow to handle each file version and can be extended as new columns are passed.

Daryl Wenman-Bateson
  • 3,870
  • 1
  • 20
  • 37