13

I want to have different process for English word and Japanese word in this function

function process_word($word) {
   if($word is english) {
     /////////
   }else if($word is japanese) {
      ////////
   }
}

thank you

Makoto
  • 104,088
  • 27
  • 192
  • 230
bbnn
  • 3,505
  • 10
  • 50
  • 68
  • 1
    maybe it does not have to be language... just to differentiate double byte character – bbnn May 18 '10 at 12:15

6 Answers6

25

A quick solution that doesn't need the mb_string extension:

if (strlen($str) != strlen(utf8_decode($str))) {
    // $str uses multi-byte chars (isn't English)
}

else {
    // $str is ASCII (probably English)
}

Or a modification of the solution provided by @Alexander Konstantinov:

function isKanji($str) {
    return preg_match('/[\x{4E00}-\x{9FBF}]/u', $str) > 0;
}

function isHiragana($str) {
    return preg_match('/[\x{3040}-\x{309F}]/u', $str) > 0;
}

function isKatakana($str) {
    return preg_match('/[\x{30A0}-\x{30FF}]/u', $str) > 0;
}

function isJapanese($str) {
    return isKanji($str) || isHiragana($str) || isKatakana($str);
}
Community
  • 1
  • 1
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
  • This leaves out english words which use diacritics. These are not used very often, however it's a tradeoff that should be known when making the choice :) – Thomas Winsnes May 18 '10 at 14:57
  • @Thomas.Winsnes: You mean stuff like `Hai`, `Wa`, `Ka`, `Arigatou` and so on, right? – Alix Axel May 18 '10 at 14:59
  • No, I mean english words like: naïve, café, résumé, soufflé etc. – Thomas Winsnes May 18 '10 at 15:19
  • @Thomas.Winsnes: Oh, I see. I never understood if those are considered valid english words or not. Specially "café", that I've never seen / heard in either british or american english. – Alix Axel May 18 '10 at 15:24
  • 1
    I always write naïve with a diæresis, and diæresis with a æ. – Zorf May 19 '10 at 16:46
  • Characters in the 4E00-9FBF range are not necessarily Japanese kanji. This range is also used to encode Chinese and Korean usage of those characters, so this is not a reliable test for whether a text is Japanese. http://unicode.org/faq/han_cjk.html#4 – Paul Legato Jul 30 '12 at 02:39
  • @PaulLegato: Do you have any solution then? I can imagine n-gram analysis would work, but it would be so much more computationally expensive. – Alix Axel Jul 31 '12 at 23:00
  • 1
    @AlixAxel You can check for the kana Unicode ranges, given in the isHiragana() and isKatakana() functions above. Any text with either of those is almost definitely Japanese, and almost all Japanese text that isn't extremely short will have at least a few characters in those ranges. isJapanese() above should be written as just isHiragana($str) || isKatakana($str), since isKanji() will also return true for Chinese or (some) Korean text. – Paul Legato Aug 02 '12 at 22:38
22

This function checks whether a word contains at least one Japanese letter (I found unicode range for Japanese letters in Wikipedia).

function isJapanese($word) {
    return preg_match('/[\x{4E00}-\x{9FBF}\x{3040}-\x{309F}\x{30A0}-\x{30FF}]/u', $word);
}
Alexander Konstantinov
  • 5,406
  • 1
  • 26
  • 31
  • 1
    As per the comment above, characters in 4E00-9FBF are not limited to usage in Japanese, so this is not a reliable test. http://unicode.org/faq/han_cjk.html#4 – Paul Legato Jul 30 '12 at 02:40
  • Thanks Alexander for the good code. But, what does the \x do? – Trevor Wood Sep 22 '14 at 16:00
  • 1
    @TrevorW, sequence like \x{4E00} is used to specify a UTF-8 character by its hex code. See PHP manual for more info: http://php.net/manual/en/regexp.reference.escape.php – Alexander Konstantinov Sep 23 '14 at 09:50
3

You could try Google's Translation API that has a detection function: http://code.google.com/apis/language/translate/v2/using_rest.html#detect-language

Erik Johansson
  • 323
  • 1
  • 5
  • 15
Alec
  • 9,000
  • 9
  • 39
  • 43
1

Try with mb_detect_encoding function, if encoding is EUC-JP or UTF-8 / UTF-16 it can be japanese, otherwise english. The better is if you can ensure which encoding each language, as UTF encodings can be used for many languages

Benoit
  • 3,569
  • 2
  • 22
  • 20
0

English text usually consists only of ASCII characters (or better say, characters in ASCII range).

Messa
  • 24,321
  • 6
  • 68
  • 92
  • 1
    Although it's fairly easy to identify most words as being either English or Japanese, there are some characters that belong to both character sets. For example, a string containing only numbers should return true for both English and Japanese. – Jin Kim Jun 07 '10 at 16:57
0

You can try to convert the charset and check if it succeeds.

Take a look at iconv: http://www.php.net/manual/en/function.iconv.php

If you can convert a string to ISO-8859-1 it might be english, if you can convert to iso-2022-jp it is propably japanese (I might be wrong for the exact charsets, you should google for them).