2

Today i found out why this problem occurs or how this problem occurs during reading line by line from text file using C# ReadLine().

Problem :

Assume there are 3 lines in text file. Each of which has length equals to 400.(manually counted) while reading line from C# ReadLine() and checking for length in Console.WriteLine(str.length); I found out that it prints:

Line 1 => 400
Line 2 => 362
Line 3 => 38
Line 4 => 400

I was confused and that text file has only 3 lines why its printing 4 that too with length changed. Then i quickly checked out for "\n" or "\r" or combination "\r\n" but i didn't find any, but what i found was 2 double quotes ex=> "abcd" , in second line.

Then i changed my code to print lines itself and boy i was amaze, i was getting output in console like :

Line 1 > blahblahblabablabhlabhlabhlbhaabahbbhabhblablahblhablhablahb
Line 2 > blablabbablablababalbalbablabal"blabablhabh
Line 3 > "albhalbahblablab
Line 4 > blahblahblabablabhlabhlabhlbhaabahbbhabhblablahblhablhablahb

now i tried removing the double quotes "" using replace function but i got same 4 lines result just without double quotes.

Now please let me know any solution other than manual edit to overcome this scenario. Here is my code simple code:

static void Main(string[] args)
{

    FileStream fin;
    string s;
    string fileIn = @"D:\Testing\CursedtextFile\testfile.txt";

    try
    {
        fin = new FileStream(fileIn, FileMode.Open); 
    }
    catch (FileNotFoundException exc)
    {
        Console.WriteLine(exc.Message + "Cannot open file.");
        return;
    }

    StreamReader fstr_in = new StreamReader(fin, Encoding.Default, true);

    int cnt = 0;

    while ((s = fstr_in.ReadLine()) != null)  
    {
        s = s.Replace("\""," ");
        cnt = cnt + 1;
        //Console.WriteLine("Line "+cnt+" => "+s.Length);
        Console.WriteLine("Line " + cnt + " => " + s);
    }

    Console.ReadLine();
    fstr_in.Close();
    fin.Close();
}

Note: i was trying to read and upload 37 text files of 500 MB each of finance domain where i always face this issue and has to manually do the changes. :(

admdrew
  • 3,790
  • 4
  • 27
  • 39
wizavi
  • 104
  • 1
  • 9
  • 8
    Are you sure the choice of encoding is correct? – Jon Dec 12 '13 at 21:12
  • 1
    I suggest you code this line differently: while ((s = fstr_in.ReadLine()) != null) Try something like: using (StreamReader fstr_in = new StreamReader(fileIn)) { line = fstr_in.ReadLine(); } – NoChance Dec 12 '13 at 21:20
  • 2
    The quotes are a red herring. There simply is a linefeed/newline character (or something being interpreted as it) in the position where line 2 breaks over to line 3 incorrectly. @Jon is on to something with the encoding. If you've picked the wrong encoding, something that should've been read together with the previous or next byte to affect a character is now being read as a character on its own, and happen to be something that breaks the line. – Lasse V. Karlsen Dec 12 '13 at 21:21
  • I agree with Lasse. Using the code the OP posted, I get 3 lines no problem. I used a text editor to save the 3 lines of `blabla"bla"bl`, and your code returns 3 lines, with each having 60 characters. – Kayla Dec 12 '13 at 21:24
  • This post shows the max you can read from console by default:http://stackoverflow.com/questions/5557889/console-readline-max-length – NoChance Dec 12 '13 at 21:47
  • @LasseV.Karlsen Yes you are right. There is an encoding problem, newline character(char = 10) is hidden in between the double quotes. but what more treacherous is, you will not be able to see the same in notepad means notepad will show you single line.. but when same line is copy and pasted in some other text editor like "EditPlus", it automatically breaks up in Editplus and can visual to eyes. Text file has been spooled from SAP, while checking in notepad is perfectly OK but same breaks up while reading from C# readline(). Is manual edit is the only option ???? – wizavi Dec 23 '13 at 20:17
  • Well, the *best option* is in fact to fix the export to begin with. Make the author of that SAP routine write out data in the appropriate format to begin with. If you can't do that', then you *could* probably scan the file for lone newline characters and replace those, if you expect a "line break" to be both carriage return and newline. – Lasse V. Karlsen Dec 23 '13 at 20:20

2 Answers2

0

If the problem is that:

  • Proper line breaks should be a combination of newline (10) and carriage return (13)
  • Lone newlines and/or carriage returns are incorrectly being interpreted as line breaks

Then you can fix this, but the best and probably most correct way to fix this problem is to go to the source, fix the program that writes out this incorrectly formatted file in the first place.

However, here's a LINQPad program that replaces lone newlines or carriage returns with spaces:

void Main()
{
    string input = "this\ris\non\ra\nsingle\rline\r\nThis is on the next line";
    string output = ReplaceLoneLineBreaks(input);

    output.Dump();
}

public static string ReplaceLoneLineBreaks(string input)
{
    if (string.IsNullOrEmpty(input))
        return input;

    var result = new StringBuilder();

    int index = 0;
    while (index < input.Length)
    {
        switch (input[index])
        {
            case '\n':
                if (index == input.Length - 1 || input[index+1] != '\r')
                {
                    result.Append(' ');
                    index++;
                }
                else
                {
                    result.Append(input[index]);
                    result.Append(input[index + 1]);
                    index += 2;
                }
                break;

            case '\r':
                if (index == input.Length - 1 || input[index+1] != '\n')
                {
                    result.Append(' ');
                    index++;
                }
                else
                {
                    result.Append(input[index]);
                    result.Append(input[index + 1]);
                    index += 2;
                }
                break;

            default:
                result.Append(input[index]);
                index++;
                break;
        }
    }
    return result.ToString();
}
Lasse V. Karlsen
  • 380,855
  • 102
  • 628
  • 825
0

If the lines are all of the same length, split the lines by their length instead of watching for end of lines.

const int EndOfLine = 2; // CR LF  or = 1 if only LF.
const int LineLength = 400;

string text = File.ReadAllText(path);
for (int i = 0; i < text.Length - EndOfLine; i += LineLength + EndOfLine) {
    string line = text.Substring(i, Math.Min(LineLength, text.Length - i - EndOfLine));
    // TODO Process line
}

If the last line is not terminated by end of line characters, remove the two - EndOfLine. Also the Math.Min part is only a safety measure. It might not be necessary if no line is shorter than 400.

Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188