0

I have a text file with the following contents in it: Ã(195) Ü(220) Â(195) ë(235) Ó(211) Ã(195) »(187) §(167) Ã(195) û(251) Ã(195) Ü(220) Â(194) ë(235) Ã(195) û(251) ³(179) Æ(198) Ã(195) û(251) ³(179) Æ(198). For simplicity, along with the text I have added the Unicode values that I got from http://www.fileformat.info/. Going by the Unicode Character set, this file seems to comply with this line A character from JIS-X-0208 (code set 1) is represented by two bytes, both in the range 0xA1 – 0xFE. mentioned in https://en.wikipedia.org/wiki/Extended_Unix_Code#EUC-JP and my rendering engine seems to display Japanese characters. However, this actually is a Chinese Text File containing 密码用户名密码名称名称 which gets recognized as GB2312 encoded file by Notepad++. Are there some more restrictions for determining if a file is JIS-X-0208(EUC-JP) encoded, since it seems to comply with what Wiki says?

However, my rendering engine seems to recognize this file as both EUC-JP and Chinese,but since EUC-JP comes higher in the order, we think it is Japanese and Japanese Characters are displayed.

learn_develop
  • 1,735
  • 4
  • 15
  • 33
  • The pointer to Unicode especially next to the Latin-1 rendering of the bytes in the file is incredibly confused, and/or confusing. There is no Unicode here, other than in the very abstract sense. – tripleee Oct 06 '15 at 09:45
  • 1
    Tangentially, see also http://unicodebook.readthedocs.org/en/latest/guess_encoding.html – tripleee Oct 07 '15 at 07:32

2 Answers2

2

Are there some more restrictions for determining if a file is JIS-X-0208(EUC-JP) encoded

A little, in that the lead bytes 0xF5–0xF8 and 0xFD–0xFE are unassigned, and there are also some other unassigned characters sprinkled at the end of blocks throughout.

That doesn't help you here, though, as the byte sequence C3DCC2EBD3C3BBA7C3FBC3DCC2EBC3FBB3C6C3FBB3C6 is equally valid in GB (密码用户名密码名称名称) and EUC-JP (畜鷹喘薩兆畜鷹兆各兆各).

Such is the joy of charset sniffing. You'll have to prune and reorder the charsets you have based on the likelihood of them existing in your input. Typically in a Windows world EUC-JP is rare (Shift-JIS-alike code page 932 would be used instead) so GB-alike code page 936 would typically ‘win’.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • [Google translate](https://translate.google.com/#auto/en/%E5%AF%86%E7%A0%81%E7%94%A8%E6%88%B7%E5%90%8D%E5%AF%86%E7%A0%81%E5%90%8D%E7%A7%B0%E5%90%8D%E7%A7%B0) gives "Password Username Password Name Name" for the Chinese, and gibberish for the Japanese reading. – tripleee Oct 06 '15 at 09:42
  • @bobince : You seem to catch me exactly at the point I am hitting the issue, My rendering engine seems to display "畜鷹喘薩兆畜鷹兆各兆各" (EUC-JP). Are you suggesting I move the order of GB up in the order as compared to EUC-JP? – learn_develop Oct 06 '15 at 10:14
  • @bobince : My rendering engine seems to also determine the encoding as big5, I haven't investigated if it is a bug in the code or not, but I will have a look. – learn_develop Oct 06 '15 at 10:27
  • @tripleee : true, but if we write the text file with encoding as EUC-JP, you will see the exact Japanese text as posted by bob – learn_develop Oct 06 '15 at 10:29
  • Yes, but it doesn't appear to mean anything useful. – tripleee Oct 06 '15 at 10:44
  • What bobince is trying to tell you is that these bytes represent *something* in pretty much every double-byte encoding; you will have to use human recognition to determine whether each respective "something" is useful and correct. – tripleee Oct 06 '15 at 10:46
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/91474/discussion-between-learn-develop-and-tripleee). – learn_develop Oct 06 '15 at 11:05
  • 1
    If you need to handle input with unspecified input encodings, you will need to change/configure your “rendering engine” to prefer certain encodings that your own input is likely to be using, at the expense of others that are not well-represented in your input. You will still get errors, because charset-sniffing is fundamentally unreliable and should generally be avoided, but you might be able to reduce the frequency of error. What is your “rendering engine”? If it cannot be reconfigured you may have to pre-process the input beforehand to put it in a known-good encoding such as UTF-8. – bobince Oct 06 '15 at 11:13
  • For the sake of simplicity, let's consider the rendering engine is like notepad++. It takes in any file, determines encoding internally upon loading it from the set of supported encoding and uses the one that has precedence over the other. – learn_develop Oct 06 '15 at 11:30
2

There is no completely reliable way to identify an unknown encoding.

Distribution patterns can probably help you determine whether you are looking at an 8-bit or a 16-bit encoding. Double-byte encodings tend to have a slightly constrained distribution pattern for every other byte. This is where you are now.

Among 16-bit encodings, you can also probably easily determine whether you are looking at a big-endian or little-endian encoding. Little-endian will have the constrained pattern on the even bytes, while big-endian will have it on the odd bytes. Unfortunately, most double-byte encodings seem to be big-endian, so this is not going to help much. If you are looking at little-endian, it's likely UTF-16LE.

Looking at your example data, every other byte seems to be equal to or close to 0xC3, starting at the first byte (but there seem to be some bytes missing, perhaps?)

There are individual byte sequences which are invalid in individual encodings, but on the whole, this is rather unlikely to help you reach a conclusion. If you can remove one or more candidate 16-bit encodings with this tactic, good for you; but it will probably not be sufficient to solve your problem.

Within this space, all you have left is statistics. If the text is long enough, you can probably find repeating patterns, or use a frequency table for your candidate encodings to calculate a score for each. Because the Japanese writing system is sharing a common heritage with Chinese, you will find similarities in their distributions, but also differences. Typologically, the Japanese language is quite different from Chinese, which means that Japanese will have particles every few characters, whereas Chinese does not have them at all. So you would look for "no" の, "wa" は, "ka" か, "ga" が, "ni" に etc and if they are present, conclude that you are looking at Japanese (or, conversely, surmise that perhaps you are looking at Chinese if they are absent; but if you are looking at lists of names, for example, it could still be Japanese).

Within Chinese (and also tangentially for Japanese) you can look at http://www.zein.se/patrick/3000char.html for frequency information; but keep in mind that the Japanese particles will be much more common in Japanese running text than any of these glyphs.

For example, 的 (the first item on the list) aka U+7684 will be 0x76 0x84 in UTF-16be, 0xAA 0xBA in Big-5, 0xC5 0xAA in EUC-JP, 0xB5 0xC4 in GB2312, etc.

From your sample data, you likely have item 139 on that list 名 aka U+540D which is 0x54 0x0D in UTF-16be, 0xA5 0x57 in Big-5, 0xCC 0xBE in EUC-JP, and 0xC3 0xFB in GB2312. (Do you see? Hit!)

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Both the answers seems pretty convincing, thanks for provided a detailed insight. I will try out and see if I can may be figure out a repeating pattern as suggested and get my way around the problem. Just out of curiosity, how often do we encounter files that are EUC-JP encoded? I guess many of the standard formats now use SHIFT JIS? – learn_develop Oct 07 '15 at 06:32
  • 1
    EUC-JP used to be preferred on U*x but these days, those would by and large have switched to UTF-8, I expect. The source and age of the files are an important factor; in general, it doesn't make sense to try to express an "average probable file encoding". – tripleee Oct 07 '15 at 07:19
  • Marking this as an answer since it is much more informative, and even someone with no background on the information would be able to understand well. – learn_develop Oct 09 '15 at 09:48