0

I need to identify files with UTF8 with out BOM/ ANSi format from a set of files. How can I identify these files format? Currently using strategy is check whether the file format is belongs to any already known formats(Having BOM) if not declare as UTF8/ANSI. Is there any direct method to find these formats?

  • 1
    Does this answer your question? [Howto identify UTF-8 encoded strings](https://stackoverflow.com/questions/377294/howto-identify-utf-8-encoded-strings) – Joe May 19 '20 at 13:17
  • UTF8 is not ANSI (and ANSI is not really ANSI, it is just a bad name interpreted in the wrong way). – Giacomo Catenazzi Jun 11 '20 at 15:15

1 Answers1

0

The usual method works:

  • Check for BOM, and read the file according format given by BOM. If you have error (or not BOM) go to next point

  • Assume it is UTF8, and read the file accordingly. If you have errors, go to next point. It is very seldom to have a false positive (non-UTF8 file which is also correct UTF8)

  • Assume the file is Latin1 or CP1252 (ANSI, which is a superset of Latin1)

This is the easier way and most safe. With other methods (detection), you should still implement additionally also such method, because you may fail to read a file with detected encoding.

Remember that BOM strings could be in real ANSI files, as real characters, and unfortunately there are files which uses different encodings (e.g. various source codes, they may have in copyright, some name in some encoding, but comments in other encoding.

If you want to implement a better algorithm, after point 1, check code 00. If there are some (or many), fallback with UTF-32 (if there are 3 consecutive 00), or UTF16LE or BE depending if most 00 are in even (LE) or odd position. Ignore/substitute illegal combinations.

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32