1

I need to be able to take a text file with unknown encoding (e.g., UTF-8, UTF-16, ...) and copy it line by line, making specific changes as I go. In this example, I am changing the encoding, however there are other uses for this kind of processing.

What I can't figure out is how to determine if the last line has a newline! Some programs care about the difference between a file with these records:

Rec1<newline>
Rec2<newline>

And a file with these:

Rec1<newline>
Rec2

How can I tell the difference in my code so that I can take appropriate action?

using (StreamReader reader = new StreamReader(sourcePath))
using (StreamWriter writer = new StreamWriter(destinationPath, false, outputEncoding))
{
    bool isFirstLine = true;

    while (!reader.EndOfStream)
    {
        string line = reader.ReadLine();

        if (isFirstLine)
        {
            writer.Write(line);
            isFirstLine = false;
        }
        else
        {
            writer.Write("\r\n" + line);
        }
    }


    //if (LastLineHasNewline)
    //{
    //  writer.Write("\n");
    //}

    writer.Flush();
}

The commented out code is what I want to be able to do, but I can't figure out how to set the condition lastInputLineHadNewline! Remember, I have no a priori knowledge of the input file encoding.

David Rogers
  • 4,010
  • 3
  • 29
  • 28

2 Answers2

8

Remember, I have no a priori knowledge of the input file encoding.

That's the fundamental problem to solve.

If the file could be using any encoding, then there is no concept of reading "line by line" as you can't possibly tell what the line ending is.

I suggest you first address this part, and the rest will be easy. Now, without knowing the context it's hard to say whether that means you should be asking the user for the encoding, or detecting it heuristically, or something else - but I wouldn't start trying to use the data before you can fully understand it.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I solved the problem when I finally realized that the stream reader was heuristically determining the encoding and that determination was available to me. The stream reader takes the stream from an unknown encoding to a known encoding (Unicode). There I can act on a known encoding and then output to the original encoding or an encoding of my choosing. – David Rogers Jan 07 '14 at 20:55
  • 1
    @anyoneis: Well, StreamReader can choose between *some* encodings heuristically - UTF-8, UTF-16 (big endian) and UTF-16 (little endian) I believe. I don't think it will detect others though... – Jon Skeet Jan 07 '14 at 21:01
  • Thanks! These limitations are discussed here: http://weblog.west-wind.com/posts/2007/Nov/28/Detecting-Text-Encoding-for-StreamReader. I will incorporate his detection into my solution to handle the default encoding correctly and to avoid adding or deleting Byte Order Marks. – David Rogers Jan 07 '14 at 21:13
  • @anyoneis: Okay - just be aware that correct detection isn't even possible in all cases. You're operating with incomplete information. It may well be that it's good enough for the cases you'll be using though. – Jon Skeet Jan 07 '14 at 21:19
0

As often happens, the moment you go to ask for help, the answer comes to the surface. The commented out code becomes:

if (LastLineHasNewline(reader))
{
    writer.Write("\n");
}

And the function looks like this:

private static bool LastLineHasNewline(StreamReader reader)
{
    byte[] newlineBytes = reader.CurrentEncoding.GetBytes("\n");
    int newlineByteCount = newlineBytes.Length;

    reader.BaseStream.Seek(-newlineByteCount, SeekOrigin.End);

    byte[] inputBytes = new byte[newlineByteCount];
    reader.BaseStream.Read(inputBytes, 0, newlineByteCount);
    for (int i = 0; i < newlineByteCount; i++)
    {
        if (newlineBytes[i] != inputBytes[i])
            return false;
    }
    return true;
}
David Rogers
  • 4,010
  • 3
  • 29
  • 28
  • 1
    "the moment you go to ask for help, the answer comes to the surface" AKA [Rubber Duck Problem Solving](http://www.codinghorror.com/blog/2012/03/rubber-duck-problem-solving.html) – Scott Chamberlain Jan 07 '14 at 20:23
  • Thanks! I am an olde user, so I had not seen that page. (I have to find a duck...) – David Rogers Jan 07 '14 at 20:50
  • 1
    You should read the [full story](http://hwrnmnbsol.livejournal.com/148664.html) from the quote on the page it is quite the entertaining read. – Scott Chamberlain Jan 07 '14 at 20:53