0

I have a text file which can contain a mix of Chinese, Japanese, Korean (CJK) and English characters. I have to validate the file for English characters. The file can be allowed to contain CJK characters only when a line begins with the '$' character, which represents a comment in my text file. Searching through the net, I found out that I can use fgetws() and the wchar_t type to read wide chars.

Q1) But I am wondering how CJK characters would be stored in my text file - what byte order etc.

Q2) How can I loop through CJK characters. Since Unicode characters can have 1 to 6 bytes, I cannot use i++.

Any help would be appreciated.

Thanks a lot.

dda
  • 6,030
  • 2
  • 25
  • 34
Manik Sidana
  • 2,005
  • 2
  • 18
  • 29

3 Answers3

1

You need to read the UTF-8 file as a sequence of UTF-32 codepoints. For example:

std::shared_ptr<FILE> f(fopen(filename, "r"), fclose);
uint32_t c = 0;
while (utf8_read(f.get(), c))
{
    if (is_english_char(c))
        ...
    else if (is_cjk_char(c))
        ...
    else
        ...
}

Where utf8_read has the signature:

bool utf8_read(FILE *f, uint32_t &c);

Now, utf8_read may read 1-4 bytes depending on the value of the first byte. See http://en.wikipedia.org/wiki/UTF-8, google for an algorithm or use a library function already available to you.

With the UTF-32 codepoint, you can now check ranges. For English, you can check if it is ASCII (c < 0x7F) or if it is a Latin character (Including support for accented characters for imported words from e.g. French). You may also want to exclude non-printable control characters (e.g. 0x01).

For the Latin and/or CJK character checks, you can check if the character is in a given code block (see http://www.unicode.org/Public/UNIDATA/Blocks.txt for the codepoint ranges). This is the simplest approach.

If you are using a library with Unicode support that has writing script detection (e.g. the glib library), you can use the script type to detect the characters. Alternatively, you can get the data from http://www.unicode.org/Public/UNIDATA/Scripts.txt:

Name     : Code      : Language(s)
=========:===========:========================================================
Common   : Zyyy      : general punctuation / symbol characters
Latin    : Latn      : Latin languages (English, German, French, Spanish, ...)
Han      : Hans/Hant : Chinese characters (Chinese, Japanese)
Hiragana : Hira      : Japanese
Katakana : Kana      : Japanese
Hangul   : Hang      : Korean

NOTE: The script codes come from http://www.iana.org/assignments/language-subtag-registry (Type == 'script').

reece
  • 7,945
  • 1
  • 26
  • 28
0

You need to understand UTF-8 and use some UTF8 handling library (or code your own). FYI, Glib (from GTK) has UTF-8 handling functions, which are able to deal with variable-length UTF-8 chars & strings. There are other UTF-8 libraries e.g. iconv - inside GNU libc - and ICU and many others.

UTF-8 does define the byte order and content of multi-byte UTF8 characters, e.g. Chinese ones.

dda
  • 6,030
  • 2
  • 25
  • 34
Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • Can a check like if (wmessage[i] > 0x7F) be sufficient to know that the character is non-ASCII. ? – Manik Sidana Oct 08 '12 at 06:00
  • You really should use some existing UTF8 library.... or understand all the gory details of UTF8 encoding to code your own function... – Basile Starynkevitch Oct 08 '12 at 06:02
  • 1
    A check like `if (wmessage[i] > 0x7F)` is sufficient to know that a byte is not an ASCII codepoint; but isn't enough to determine if the byte is part of valid UTF-8. My advice would be to begin by converting the UTF-8 into UTF-32 (while checking that the input is valid UTF-8); then do whatever you need to with the UTF-32. The UTF-32 should be stored in something like `uint32_t` and *not* `wchar_t`. The conversion from UTF-8 to UTF-32 is relatively easy; but you may also want to convert into a canonical form (which is hard), and using a (well tested) library is the best idea. – Brendan Oct 08 '12 at 11:36
  • My input files has UTF-8 encoding. Brendan (below post), in his comment suggested that wchar_t is not portable. Now, since my input file is UTF-8, can I use uint32_t and proceed with 'if (wmessage[i] > 0x7F)'check ? Will it be safe and accurate ? – Manik Sidana Oct 09 '12 at 06:24
  • 1
    No. You should not read UTF-8 files assuming a set size (uint32 or else). But you shouldn't try to reinvent the wheel and write a UTF-8 parsing library. Others have suggested GLib and iconv. Use that to parse your files. – dda Oct 09 '12 at 10:40
0

I am pasting a sample program to illustrate wchar_t handling. Hope it helps someone.

#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#define BUFLEN 1024
int main() {
  wchar_t *wmessage=L"Lets- beginめん(下) 震災後、保存-食で-脚光-(経済ナビゲーター)-lets- end";
  wchar_t warray[BUFLEN + 1];
  wchar_t a = L'z';
  int i=0;
  FILE *fp;
  wchar_t *token = L"-";
  wchar_t *state;
  wchar_t *ptr;
  setlocale(LC_ALL, "");
  /* FIle in current dirrctory containing CJK chars */
  fp = fopen("input", "r");
  if (fp == NULL) {
      printf("%s\n", "Cannot open file!!!");
      return (-1);
  }
  fgetws(warray, BUFLEN, fp);
  wprintf(L"\n *********************START reading from file*******************************\n");
  wprintf(L"%ls\n",warray);
  wprintf(L"\n*********************END reading from file*******************************\n");
  fclose(fp);
  wprintf(L"printing character %lc = <0x%x>\n", a, a);
  wprintf(L"\n*********************START Checking string for Japanese*******************************\n");
  for(i=0;wmessage[i] != '\0';i++) {
      if (wmessage[i] > 0x7F) {
          wprintf(L"\n This is non-ASCII <0x%x> <%lc>", wmessage[i],  wmessage[i]);
      } else {
          wprintf(L"\n This is ASCII <0x%x> <%lc>", wmessage[i],  wmessage[i]);
      }
  }
  wprintf(L"\n*********************END Checking string for Japanese*******************************\n");
  wprintf(L"\n*********************START Tokenizing******************************\n");
  state = wcstok(warray, token, &ptr);
  while (state != NULL) {
      wprintf(L"\n %ls", state);
      state = wcstok(NULL, token, &ptr);
  }
  wprintf(L"\n*********************END Tokenizing******************************\n");
  return 0;
}
dda
  • 6,030
  • 2
  • 25
  • 34
Manik Sidana
  • 2,005
  • 2
  • 18
  • 29
  • 2
    The first rule of `wchar_t` is "never use `wchar_t`". Its a portability nightmare (for historical reasons, it may be a 16-bit unsigned integer that is incapable of handling all valid Unicode codepoints). – Brendan Oct 08 '12 at 11:26
  • `unsigned int` is a bad idea for portability reasons (e.g. it could be 16-bit). I use `char` for ASCII, `uint8_t` for UTF-8 (mostly to make it clear that it's not ASCII but also because it's quicker to type than `unsigned char`), and `uint32_t` for UTF-32. Also note that even for UTF-32, a single "character" may consist of multiple code points. The only option that isn't painful is to use a good library. – Brendan Oct 18 '12 at 16:14