You need to read the UTF-8 file as a sequence of UTF-32 codepoints. For example:
std::shared_ptr<FILE> f(fopen(filename, "r"), fclose);
uint32_t c = 0;
while (utf8_read(f.get(), c))
{
if (is_english_char(c))
...
else if (is_cjk_char(c))
...
else
...
}
Where utf8_read
has the signature:
bool utf8_read(FILE *f, uint32_t &c);
Now, utf8_read
may read 1-4 bytes depending on the value of the first byte. See http://en.wikipedia.org/wiki/UTF-8, google for an algorithm or use a library function already available to you.
With the UTF-32 codepoint, you can now check ranges. For English, you can check if it is ASCII (c < 0x7F
) or if it is a Latin
character (Including support for accented characters for imported words from e.g. French). You may also want to exclude non-printable control characters (e.g. 0x01
).
For the Latin
and/or CJK
character checks, you can check if the character is in a given code block (see http://www.unicode.org/Public/UNIDATA/Blocks.txt for the codepoint ranges). This is the simplest approach.
If you are using a library with Unicode support that has writing script detection (e.g. the glib library), you can use the script type to detect the characters. Alternatively, you can get the data from http://www.unicode.org/Public/UNIDATA/Scripts.txt:
Name : Code : Language(s)
=========:===========:========================================================
Common : Zyyy : general punctuation / symbol characters
Latin : Latn : Latin languages (English, German, French, Spanish, ...)
Han : Hans/Hant : Chinese characters (Chinese, Japanese)
Hiragana : Hira : Japanese
Katakana : Kana : Japanese
Hangul : Hang : Korean
NOTE: The script codes come from http://www.iana.org/assignments/language-subtag-registry (Type == 'script'
).