5

I want to know if there is a way to detect mojibake (Invalid) characters by their byte range. (For a simple example, detecting valid ascii characters is just to see if their byte values are less 128) Given the old customized characters sets, such as JIS, EUC and of course, UNICODE, is there a way to do this?

The immediate interest is in a C# project, but I'd like to find a language/platform independent solution as much as possible, so I could use in c++, Java, PHP or whatever.

Arrigato

Makoto
  • 104,088
  • 27
  • 192
  • 230
James John McGuire 'Jahmic'
  • 11,728
  • 11
  • 67
  • 78
  • You spelled arigatou wrong :) – MGZero Jun 30 '11 at 15:05
  • Yes, but it always depends on what system of romanization you're using - just like you're use of 'u' for a long vowel. – Michael Jun 30 '11 at 15:11
  • I was using 'romagi', to further confuse things. – James John McGuire 'Jahmic' Jun 30 '11 at 22:23
  • Well, to fully confuse things. We have 4 (+?) alphabets we dealing with (hiragana (original Japanese), kanji (uhmm, imported from China), katanana (left for modern non-japanese words), romagi (The english sounding names or the english spelling equivelints (some standardization... I'm not even close to qualifying...), and of course, english (europo) ascii)...anyways. Then there are the the are the charecter encodings, at least six JIS, S-JIS, EUC and a much a minor ones. Slowly, things are stating to get standardized on Unicode, but once you get bad data in your DB, how do you get it out? – James John McGuire 'Jahmic' Jun 30 '11 at 22:35
  • I'm curious because I don't understand much of this: Is this a question about an abstract sequence of Unicode codepoints, or about a particular encoding, or about transliteration? – Kerrek SB Jul 01 '11 at 12:07
  • @kerrik - I've tried to generalize the question, since I've come across this problems numerous times. Looking for very real solutions, not abstract. I'm not sure how much more I can be specific, but... Given a code page or encoding, are there byte ranges you can compare a given character (single-byte, multi-byte or Unicode) to see if it is a valid Japanese character? – James John McGuire 'Jahmic' Jul 02 '11 at 12:09
  • "Arigatou" isn't even the correct choice here. It'd be "yoroshiku onegaishimasu". – Kef Schecter Jul 02 '11 at 15:57
  • @kef Schector - Without trying to stray too far away from the main point... As my understanding goes, "yoroshiku onegaishimasu" is the formal way (which is default) of saying 'thank you', but 'arrigato' is very common in casual instances. – James John McGuire 'Jahmic' Jul 04 '11 at 00:04
  • "Arigatou" (which is *never* spelled "arrigato") is used to thank somebody for something that's already done. "Yoroshiku onegaishimasu" is to thank them for something in advance. If you want to be less formal, you can just say "onegaishimasu". – Kef Schecter Jul 05 '11 at 04:11

4 Answers4

1

detecting 文字化け(mojibake) by byte range is very difficult.

As you know, most Japanese characters consist of multi-bytes. In Shift-JIS (one of most popular encodings in Japan) case, the first-byte range of a Japanese character is 0x81 to 0x9f and 0xe0 to 0xef, and the second-byte has other range. In addition, ASCII characters may be inserted into Shift-JIS text. it's difficult.

In Java, you can detect invalid characters with java.nio.charset.CharsetDecoder.

t_motooka
  • 555
  • 5
  • 12
  • I think you're right, at the very least, difficult. In fact, without a reference indicator, there are cases where you can not tell if byte stream is Unicode or not. But, I'm still going to leave this question open for a bit longer to see what other responses may show up. – James John McGuire 'Jahmic' Jul 02 '11 at 12:12
0

What you're trying to do here is character encoding auto-detection, as performed by Web browsers. So you could use an existing character encoding detection library, like the universalchardet library in Mozilla; it should be straightforward to port it to the platform of your choice.

For example, using Mark Pilgrim's Python 3 port of the universalchardet library:

>>> chardet.detect(bytes.fromhex('83828357836f8350'))
{'confidence': 0.99, 'encoding': 'SHIFT_JIS'}
>>> chardet.detect(bytes.fromhex('e383a2e382b8e38390e382b1'))
{'confidence': 0.938125, 'encoding': 'utf-8'}

But it's not 100% reliable!

>>> chardet.detect(bytes.fromhex('916d6f6a6962616b6592'))
{'confidence': 0.6031748712523237, 'encoding': 'ISO-8859-2'}

(Exercise for the reader: what encoding was this really?)

Gareth Rees
  • 64,967
  • 9
  • 133
  • 163
0

This is not a direct answer to the question, but I've had luck using the ftfy Python package to automatically detect/fix mojibake:

>>> import ftfy
>>> print(ftfy.fix_encoding("(ง'⌣')ง"))
(ง'⌣')ง

It works surprisingly well for my purposes.

joe
  • 3,752
  • 1
  • 32
  • 41
-1

I don't have time and / or priority level to follow up on this for the moment, but I think, if knowing the source is Unicode, using these charts and following on some of the work done here, I think some headway can be made into the issue. Likewise, for Shift-JIS, using this chart can be helpful.

James John McGuire 'Jahmic'
  • 11,728
  • 11
  • 67
  • 78