Flat file normalization with a dynamic number of columns

Question

I have a flat file with an unfortunately dynamic column structure. There is a value that is in a hierarchy of values, and each tier in the hierarchy gets its own column. For example, my flat file might resemble this:

StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Tier3ObjectId|Status
1234|7890|abcd|efgh|ijkl|mnop|Pending
...

The same feed the next day may resemble this:

StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Status
1234|7890|abcd|efgh|ijkl|Complete
...

The thing is, I don't care much about all the tiers; I only care about the id of the last (bottom) tier, and all the other row data that is not a part of the tier columns. I need normalize the feed to something resembling this to inject into a relational database:

StatisticID|FileId|ObjectId|Status
1234|7890|ijkl|Complete
...

What would be an efficient, easy-to-read mechanism for determining the last tier object id, and organizing the data as described? Every attempt I've made feels kludgy to me.

Some things I've done:

I have tried to examine the column names for regular expression patterns, identify the columns that are tiered, order them by name descending, and select the first record... but I lose the ordinal column number this way, so that didn't look good.
I have placed the columns I want into an IDictionary<string, int> object to reference, but again reliably collecting the ordinal of the dynamic columns is an issue, and it seems this would be rather non-performant.

How about putting up those examples of other attempts so that others who try to answer don't duplicate those efforts? — Mike Perrenoud, Mar 13 '13 at 16:20
@AbeMiessler hmm, looks like it is that, too. I'm used to flat files having fixed column lengths. We use flat file as to distinguish fixed column format from other formats (including csv). Didn' know its used for csv, too. So my comment might be wrong... — kratenko, Mar 13 '13 at 16:29
CSV or any delimited text files are flat. Non-flat would be something like XML or JSON which has structure and can accurately represent hierarchy. — Abe Miessler, Mar 13 '13 at 16:46

score 1 · Accepted Answer · answered Mar 14 '13 at 07:30

I ran into a simular problem a few years ago. I used a Dictionary to map the columns, it was not pretty, but it worked.

First make a Dictionary:

private Dictionary<int, int> GetColumnDictionary(string headerLine)
    {
        Dictionary<int, int> columnDictionary = new Dictionary<int, int>();
        List<string> columnNames = headerLine.Split('|').ToList();

        string maxTierObjectColumnName = GetMaxTierObjectColumnName(columnNames);
        for (int index = 0; index < columnNames.Count; index++)
        {
            if (columnNames[index] == "StatisticID")
            {
                columnDictionary.Add(0, index);
            }

            if (columnNames[index] == "FileId")
            {
                columnDictionary.Add(1, index);
            }

            if (columnNames[index] == maxTierObjectColumnName)
            {
                columnDictionary.Add(2, index);
            }

            if (columnNames[index] == "Status")
            {
                columnDictionary.Add(3, index);
            }
        }

        return columnDictionary;
    }

    private string GetMaxTierObjectColumnName(List<string> columnNames)
    {
        // Edit this function if Tier ObjectId is greater then 9
        var maxTierObjectColumnName = columnNames.Where(c => c.Contains("Tier") && c.Contains("Object")).OrderBy(c => c).Last();

        return maxTierObjectColumnName;
    }

And after that it's simply running thru the file:

private List<DataObject> ParseFile(string fileName)
    {
        StreamReader streamReader = new StreamReader(fileName);

        string headerLine = streamReader.ReadLine();
        Dictionary<int, int> columnDictionary = this.GetColumnDictionary(headerLine);

        string line;
        List<DataObject> dataObjects = new List<DataObject>();
        while ((line = streamReader.ReadLine()) != null)
        {
            var lineValues = line.Split('|');

            string statId = lineValues[columnDictionary[0]];
            dataObjects.Add(
                new DataObject()
                {
                    StatisticId = lineValues[columnDictionary[0]],
                    FileId = lineValues[columnDictionary[1]],
                    ObjectId = lineValues[columnDictionary[2]],
                    Status = lineValues[columnDictionary[3]]
                }
            );
        }

        return dataObjects;
    }

I hope this helps (even a little bit).

This is very similar to what I ended up doing. I was a little frustrated that there was not a more elegant and efficient solution, but it did work. — Jeremy Holovacs, Mar 14 '13 at 11:50

score 0 · Answer 2 · answered Mar 13 '13 at 16:21

0

Personally I would not try to reformat your file. I think the easiest approach would be to parse each row from the front and the back. For example:

itemArray = getMyItems();
statisticId = itemArray[0];
fileId = itemArray[1];
//and so on for the rest of your pre-tier columns

//Then get the second to last column which will be the last tier
lastTierId = itemArray[itemArray.length -1];

Since you know the last tier will always be second from the end you can just start at the end and work your way forwards. This seems like it would be much easier than trying to reformat the datafile.

If you really want to create a new file, you could use this approach to get the data you want to write out.

answered Mar 13 '13 at 16:21

Abe Miessler

82,532
99
305
486

Well I simplified the example, there are actually multiple cases of variable length tier columns in the feed, so I don't think that will help me in this case, but it's an approach I had not considered. – Jeremy Holovacs Mar 13 '13 at 16:32
So the last tier column will not always be the second from last column? – Abe Miessler Mar 13 '13 at 16:33
Well, positionally, yes, but there are multiple tier-based structures in the feed, each with a potentially dynamic column count, and they are not next to each other, so I don't think this method would work for collecting columnar information across the feed. – Jeremy Holovacs Mar 13 '13 at 16:40
Sounds like you will have to do some crazy parsing of the header columns then to determine where the data you want resides. See if you can get them to send you XML or JSON instead. If it's anything like my experience with trying to get data files from people, it will probably be a lost cause but it's worth a shot. – Abe Miessler Mar 13 '13 at 16:49

score 0 · Answer 3 · answered Mar 13 '13 at 16:34

0

I don't know C# syntax, but something along these lines:

split line in parts with | as separator
get parts [0], [1], [length - 2] and [length - 1]
pass the parts to the database handling code

answered Mar 13 '13 at 16:34

Kwebble

2,065
14
23

Flat file normalization with a dynamic number of columns

3 Answers3