0

I'm getting files which can be latin1 or utf8 encoding. I Get it as a stream in C#. How can I detect if its latin1 ("ISO-8859-1") or UTF-8? When I try to detect it, it will always detect it as UTF-8. This code don't work, if will always be UTF-8.

     private Encoding GetUtf8EncodeStream(Stream fileStream)
            {    
     using var reader = new StreamReader(fileStream, true);
                    var encoding = reader.CurrentEncoding;
                    if (Equals(encoding, Encoding.UTF8))
                    {
                        return Encoding.UTF8;
                    }
                    return Encoding.GetEncoding("ISO-8859-1");
    }

void Method(){
 var encoding = GetUtf8EncodeStream(fileStream);
                    using (TextReader reader = new StreamReader(fileStream, encoding))
}

I first need to know the encoding, and then I will read it with that encoding.

I need to know the encoding, because it has special characters æ, ø and å. And if i try to read a stream, which has encoding: latin1 and set the streamreader to UTF-8, there will be question marks instead of the characters. And if I do it reversed where I set the StreamWriter to encoding UTF-8, and its in latin1 the hell will break lose ;)

  • 1
    Does the stream you're checking include any characters outside the 7-bit ASCII range? If not, it might be detecting as UTF8 because of that (I would assume). – ProgrammingLlama May 28 '20 at 07:44
  • Yes it have some characters æ, ø and å. Updated my question. :) – Mads Illemann May 28 '20 at 07:50
  • 1
    You can try to verify if the file is valid utf8. Otherwise, there is no algorithm to detect encodings, I'm afraid.`UTF8Encoding.GetString(byteArray)` will throw an `ArgumentException` if Error detection is enabled. – jira May 28 '20 at 08:02
  • "The presence of invalid 8-bit characters outside valid multi-byte sequences can also be used to "auto-detect" that an encoding is actually an extended ASCII encoding rather than UTF-8, and decode it accordingly." See [Wikipedia](https://en.wikipedia.org/wiki/UTF-8) – jira May 28 '20 at 08:05

1 Answers1

0

I found a solution. :) This site gave me the right answer. https://archive.codeplex.com/?p=utf8checker

It check for if its a valid UTF-8, which Latin1 is not. And then my code was straight forward.

    private Encoding GetUtf8EncodeStream(Stream fileStream)
        {
            if (_utf8Checker.IsUtf8(fileStream))
            {
                return Encoding.UTF8;
            }

            return Encoding.GetEncoding("ISO-8859-1");
        }

  var encoding = GetUtf8EncodeStream(stream);
            stream.Position = 0;
            using (TextReader reader = new StreamReader(stream, encoding))