-1

I am dealing with files in many formats, including Shift-JIS and UTF8 NoBOM. Using a bit of language knowledge, I can detect if the files are being interepeted correctly as UTF8 or ShiftJIS, but if I detect that the file is not of the type I read in, I was wondering if there is a way to just reinterperet my in-memory array without having to re-read the file with a new encoding specified.

Right now, I read in the file assuming Shift-JIS as such:

using (StreamReader sr = new StreamReader(path, Encoding.GetEncoding("shift-jis"), true))
{
   String line = sr.ReadToEnd();

   // Detection must be done AFTER you read from the file.  Silly rabbit.
   fileFormatCertain = !sr.CurrentEncoding.Equals(Encoding.GetEncoding("shift-jis"));
                codingFromBOM = sr.CurrentEncoding;
}

and after I do my magic to determine if it is either a known format (has a BOM) or that the data makes sense as Shift-JIS, all is well. If the data is garbage though, then I am re-reading the file via:

using (StreamReader sr = new StreamReader(path, Encoding.UTF8))
{
    String line = sr.ReadToEnd();
}

I am trying to avoid this re-read step and reinterperet the data in memory if possible.

Or is magic already happening and I am needlessly worrying about double I/O access?

Michael Dorgan
  • 12,453
  • 3
  • 31
  • 61

1 Answers1

1
var buf = File.ReadAllBytes(path);
var text = Encoding.UTF8.GetString(buf);
if (text.Contains("\uFFFD")) // Unicode replacement character
{
    text = Encoding.GetEncoding(932).GetString(buf);
}
Artem
  • 1,773
  • 12
  • 30
  • What are you attempting to detect with the FFFD check? – Michael Dorgan Aug 12 '15 at 00:16
  • @MichaelDorgan, see https://en.wikipedia.org/wiki/Specials_%28Unicode_block%29#Replacement_character – Artem Aug 12 '15 at 08:51
  • Aw mojibake detection. The problem with UTf8 and Shift JIS is that the characters overlap each other in such a way that 0xFFFD doesn't get generated - at least in my own tests. Instead, I have to do language analysis to catch errors. The binary base to UTF8 or Shift-JIS though is exactly what I was looking for. Thank you. – Michael Dorgan Aug 12 '15 at 17:44