5

I am trying to parse a text report that is formatted into columns. Each column appears to be right justified with a fixed length. For each line, there are times that not all the columns are used. In that case it appears that spaces are used to justify each column in the line. Example input:

031   91    1221,154
043   66     312,222    1      3,047                       3,047    1.5%    .9%
040  118     529,626    1      1,842                       1,842     .8%    .3%
037   45     427,710
019   80     512,153    1     14,685                      14,685    1.2%   2.8%
009   68     520,301                      1    16,085     16,085    1.4%   3.0%
030   13     106,689                      1     1,581      1,581    7.6%   1.4%
008   54     377,593    1      7,098                       7,098    1.8%   1.8%
018   24     171,264
022   25       8,884    1        433                         433    4.0%   4.8%
035    9      42,043
041   13     112,355

The column widths appear to be as follows (in character counts including white spaces): 3,5,12,6,10,7,10,11,8,7.

What is a good way to parse this? I have tried using a regular expression to do it, but it obviously fails on the first line being read in because I am using an expression that expects the whole line to have data:

string pattern = @"^(?.{3})(?.{5})(?.{12})(?thirtyeightyninenumber>.{6})(?{10})(?.{7}(?.{10})(?.{11})(?.{8})(?.{7})";

Looking for a good way to read this into appropriate variables depending on whether that column has data or not. I feel like I need to throw a bunch of if checks in, but am hoping there is a better way I am not thinking of.

Thanks for any help.

BTW - I am reading the lines using a StreamReader and ReadLine.

Shawn
  • 85
  • 2
  • 6

3 Answers3

8

There is a TextFieldParser available that is specifically meant for reading fixed-width/delimited text files like this.

It's in the Microsoft.VisualBasic.FileIO namespace but you should can still call it from C#.

Add a reference to Microsoft.VisualBasic, a using Microsoft.VisualBasic.FileIO;, then the code looks like this:

TextFieldParser parser = new TextFieldParser(stream);
parser.TextFieldType = FieldType.FixedWidth;
parser.SetFieldWidths(3, 5, 12, 6, 10, 7, 10, 11, 8, 7);
while (!parser.EndOfData)
{
    //Processing row
    string[] fields = parser.ReadFields();

    // Treat each field appropriately e.g. int.TryParse,
    // remove the "%" then float.TryParse etc.
}
parser.Close();

Edit: That said, looking in Reflector, I think this fails if your shortened lines don't have a full width worth of spaces. I'm not sure how to suggest you fix this; you could pre-process your stream to insert any missing spaces per line?

Rawling
  • 49,248
  • 7
  • 89
  • 127
  • I should mention that I tested this and this method is VERY slow on large files. File.ReadLines() worked faster and then I used Substring to get the values. – anitag95 Mar 30 '21 at 06:28
6

Don't use regular expressions for this. You know the number of columns and the widths of those columns, so just use String.Substring and String.Trim:

string field1 = line.Substring(0, 5).Trim();
string field2 = line.Substring(5, 3).Trim();
string field3 = line.Substring(12, 8).Trim();
/* etc, etc */
Sean Bright
  • 118,630
  • 17
  • 138
  • 146
  • 1
    As with my answer, this will fail on lines where the right-most column isn't populated, as the indexes will be out of range. However, your answer will be much easier to adapt to overcome this, I expect : ) – Rawling Sep 20 '12 at 15:04
  • I don't know who down voted because I was considering the Substring method also. Rawling gave me something new to consider (and learn), but with just adding padding to the input line, I suppose Substring would be the easiest solution (until I learn Rawlings method maybe). – Shawn Sep 20 '12 at 15:26
-1

Just check your regex (or whatever else) against myDataString + new string(' ', 1000);

Some_Yahoo
  • 509
  • 6
  • 9