1

I have implemented Cuong's solution here: C# Processing Fixed Width Files

Here is my code:

        var lines = File.ReadAllLines(@fileFull);
        var widthList = lines.First().GroupBy(c => c)
        .Select(g => g.Count())
        .ToList();

        var list = new List<KeyValuePair<int, int>>();

        int startIndex = 0;

        for (int i = 0; i < widthList.Count(); i++)
        {
            var pair = new KeyValuePair<int, int>(startIndex, widthList[i]);
            list.Add(pair);

            startIndex += widthList[i];
        }

        var csvLines = lines.Select(line => string.Join(",",
        list.Select(pair => line.Substring(pair.Key, pair.Value))));

        File.WriteAllLines(filePath + "\\" + fileName + ".csv", csvLines);

@fileFull = File Path & Name

The issue I have is the first line of the input file also contains digits. So it could be AAAAAABBC111111111DD2EEEEEE etc. For some reason the output from Cuong's code gives me CSV headings like 1111RRRR and 222223333.

Does anyone know why this is and how I would fix it?


Header row example:

AAAAAAAAAAAAAAAABBBBBBBBBBCCCCCCCCDEFCCCCCCCCCGGGGGGGGHHHHHHHHIJJJJJJJJKKKKLLLLMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOPPPPQQQQ1111RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR222222222333333333444444444555555555666666666777777777888888888999999999S00001111TTTTTTTTTTTTUVWXYZ!"£$$$$$$%&  

Converted header row:

AAAAAAAAAAAAAAAA    BBBBBBBBBB  CCCCCCCCDEFCCCCCC   C   C   C   GGGGGGGG    HHHHHHHH    I   JJJJJJJJ    KKKK    LLLL    MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN  OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO  PPPP    QQQQ    1111RRRR    RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR2222    222223333   333334444   444445555   555556666   666667777   777778888   888889999   99999S000   0   1111    TTTTTTTTTTTT    U   V   W   X   Y   Z   !   ",�,$$$$$$,%,&,"  

Jodrell - I implemented your suggestion but the header output is like:

BBBBBBBBBBCCCCCC    CCCCCCCCD   DEFCCCC             GGGGGGGG    HHHHHHH IJJJJJJ     KKKKLLL LLL MMM NNNNNNNNNNNNNNNNNNNNNNNNNNNNN   OOOOOOOOOOOOOOOOOOOOOOOOOOOOO   PPPPQQQQ1111RRRRRRRRRRRRRRRRR   QQQ 111 RRR 33333333    44444444    55555555    66666666    77777777    88888888    99999999    S0000111        111 TTT UVWXYZ!"�$$                                       %&
Community
  • 1
  • 1
hshah
  • 842
  • 4
  • 14
  • 35
  • The code only cares that the heading `char`s are different, there is no special treatment for digits. – Jodrell Oct 08 '12 at 08:57
  • can you provide a two line sample that goes wrong. i.e. header and first row? – Jodrell Oct 08 '12 at 09:00
  • perhaps if you provide the sample, I can't see why it wouldn't work at the moment. – Jodrell Oct 08 '12 at 09:02
  • `1` is repeated, so there is your first problem, the header is invalid. That is probably causing your problem. – Jodrell Oct 08 '12 at 09:10
  • @Jodrell - I added the header rows to the question. A lot of the input files have this at the end of the header row !"£$$$$$$%& and that is part of the data. – hshah Oct 08 '12 at 09:14
  • You can trim that from the header before building the list of column widths but, its a problem you haven't got to yet. – Jodrell Oct 08 '12 at 09:20
  • `"` and `,` are problem chars in your source file because they have a special meaning in CSV files, they need to be escaped. – Jodrell Oct 09 '12 at 09:21
  • I wrote a solution that handles `"` and `,`, its at the top of my answer. – Jodrell Oct 09 '12 at 11:08

2 Answers2

2

As Jodrell already mentioned, your code doesn't work because it assumed that the character representing each column header is distinct. Change the code that parse the header widths would fix it.

Replace:

var widthList = lines.First().GroupBy(c => c)
.Select(g => g.Count())
.ToList();

With:

var widthList = new List<int>(); 
var header = lines.First().ToArray(); 
for (int i = 0; i < header.Length; i++) 
{ 
    if (i == 0 || header[i] != header[i-1]) 
        widthList.Add(0); 
    widthList[widthList.Count-1]++; 
}

Parsed header columns:

AAAAAAAAAAAAAAAA    BBBBBBBBBB  CCCCCCCC    D   E   F   CCCCCCCCC   GGGGGGGG    HHHHHHHH    I   JJJJJJJJ    KKKK    LLLL    MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN  OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO  PPPP    QQQQ    1111    RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR    222222222   333333333   444444444   555555555   666666666   777777777   888888888   999999999   S   0000    1111    TTTTTTTTTTTT    U   V   W   X   Y   Z   !   "   £   $$$$$$  %   &
Fung
  • 3,508
  • 2
  • 26
  • 33
  • The only issue with this is that CSV doesn't like the "£$$$$$$%& at the end and shows them all in one cell (for the header) when opened in Excel. – hshah Oct 09 '12 at 06:44
  • Because there is a double quote in one of the column, and Excel treat it specially (ignoring the commas between a pair of "). You may want to: 1) Prepare the file in another way so that " would not be used as a column name, or 2) Use the Text Import Wizard in Excel, using comma as delimiter and {none} for Text qualifier to open the file – Fung Oct 09 '12 at 07:41
  • @hshah, why build the list once to get widths, then iterate that list to build the key value pairs? My code builds the pairs on the first iteration, and saves the second. – Jodrell Oct 09 '12 at 07:46
  • @Fung - I can't change the input files as they come from the mainframe at work. – hshah Oct 09 '12 at 07:54
  • @Jodrell - Both solutions worked for me, but I ended up choosing Fung's one because it was shorter. – hshah Oct 09 '12 at 07:55
  • @hshah, shorter code doesn't necessarily mean shorter execution. – Jodrell Oct 09 '12 at 09:24
1

EDIT

Because the problem annoyed me I wrote some code that handles " and ,. This code replaces the header row with comma delimited alternating zeros and ones. Any commas or double quotes in the body are appropriately escaped.

static void FixedToCsv(string sourceFile)
{
    if (sourceFile == null)
    {
        // Throw exception
    }

    var dir = Path.GetDirectory(sourceFile)
    var destFile = string.Format(
        "{0}{1}",
        Path.GetFileNameWithoutExtension(sourceFile),
        ".csv");

    if (dir != null)
    {
        destFile = Path.Combine(dir, destFile);
    }

    if (File.Exists(destFile))
    {
        // Throw Exception
    }

    var blocks = new List<KeyValuePair<int, int>>();
    using (var output = File.OpenWrite(destFile))
    {
        using (var input = File.OpenText(sourceFile))
        {
            var outputLine = new StringBuilder();

            // Make header
            var header = input.ReadLine();

            if (header == null)
            {
                return;
            }

            var even = false;
            var lastc = header.First();
            var counter = 0;
            var blockCounter = 0;
            foreach(var c in header)
            {
                counter++;
                if (c == lastc)
                {
                    blockCounter++;
                }
                else
                {
                    blocks.Add(new KeyValuePair<int, int>(
                        counter - blockCounter - 1,
                        blockCounter));
                    blockCounter = 1;
                    outputLine.Append(',');
                    even = !even;
                }

                outputLine.Append(even ? '1' : '0');

                lastc = c;
            }

            blocks.Add(new KeyValuePair<int, int>(
                counter - blockCounter,
                blockCounter));

            outputLine.AppendLine();
            var lineBytes = Encoding.UTF.GetBytes(outputLine.ToString());
            outputLine.Clear();
            output.Write(lineBytes, 0, lineBytes.Length);

            // Process Body
            var inputLine = input.ReadLine();
            while (inputLine != null)
            {
                foreach(var block in block.Select(b =>
                    inputLine.Substring(b.Key, b.Value)))
                {
                    var sanitisedBlock = block;
                    if (block.Contains(',') || block.Contains('"'))
                    {
                        santitisedBlock = string.Format(
                            "\"{0}\"",
                            block.Replace("\"", "\"\""));
                    }

                   outputLine.Append(sanitisedBlock);
                   outputLine.Append(',');
                }

                outputLine.Remove(outputLine.Length - 1, 1);
                outputLine.AppendLine();
                lineBytes = Encoding.UTF8.GetBytes(outputLne.ToString());
                outputLine.Clear();
                output.Write(lineBytes, 0, lineBytes.Length);

                inputLine = input.ReadLine();
            }
        }
    }
}

1 is repeated in your header row, so your two fours get counted as one eight and everything goes wrong from there.

(There is a block of four 1s after the Qs and another block of four 1s after the 0s)

Essentialy, your header row is invalid or, at least, doesen't work with the proposed solution.


Okay, you could do somthing like this.

public void FixedToCsv(string fullFile)
{
    var lines = File.ReadAllLines(fullFile);
    var firstLine = lines.First();

    var widths = new List<KeyValuePair<int, int>>();

    var innerCounter = 0;
    var outerCounter = 0
    var firstLineChars = firstLine.ToCharArray();
    var lastChar = firstLineChars[0]; 
    foreach(var c in firstLineChars)
    {
        if (c == lastChar)
        {
            innerCounter++;
        }
        else
        {
            widths.Add(new KeyValuePair<int, int>(
                outerCounter
                innerCounter);
            innerCounter = 0;
            lastChar = c;
        }
        outerCounter++;
    }

    var csvLines = lines.Select(line => string.Join(",",
        widths.Select(pair => line.Substring(pair.Key, pair.Value))));

    // Get filePath and fileName from fullFile here.
    File.WriteAllLines(filePath + "\\" + fileName + ".csv", csvLines);
}
Jodrell
  • 34,946
  • 5
  • 87
  • 124
  • Does it matter if the numbers repeat? I just want to use the header row to determine the field sizes... it shouldn't actually matter what the characters are. I would say the code is broken, as it was written to handle the input files as they are. – hshah Oct 08 '12 at 09:24
  • @Jordell - Hmm. They might not be unique, although I think every effort has been made to make them unique. The characters at the end (!"£$$$$$$%&) are also part of the header and when they change it signifies a new column. Do you know how I can modify the code to make it work with the information that has come to light? – hshah Oct 08 '12 at 09:30
  • perhaps that tail is ment to express somthing about type. – Jodrell Oct 08 '12 at 10:13
  • I asked around and found out that because every effort was made to make the header characters unique, they are just a continuation of that. – hshah Oct 08 '12 at 10:53
  • This might be a silly question, but what does the hadTail do? Ok, realised what it does... does this apply if the "tail" is meant to be there? – hshah Oct 08 '12 at 10:54
  • @hshah tells you that the header had the passed tail string on the end. – Jodrell Oct 08 '12 at 10:55
  • @hshah, you define what the tail might be, I'm just giving you an idea for how you might deal with the tail, I can't really know the correct way without knowing its true meaning. – Jodrell Oct 08 '12 at 10:59
  • The tail is part of the header and the same as AAABBBB111CCC etc. Since whoever set all these up ran out of letters and numbers, they started using special characters instead, and therefore should be treated the same as the numbers and letters. – hshah Oct 08 '12 at 12:11
  • well, then don't pass the tail, I'll just remove it – Jodrell Oct 08 '12 at 12:36
  • Please see my question. The header output is not as we expected :( – hshah Oct 08 '12 at 12:48
  • I've answered your first question, "why dosen't it work" and I've proposed a solution to that problem. You'll have to specify how the tail should be interpreted and how you are struggling to handle it, perhaps this should be a seperate question. – Jodrell Oct 08 '12 at 12:56
  • @hshah, a sperate question that I answered anyway. – Jodrell Oct 11 '12 at 10:43