1

I want to categorize Japanese words base on there first character. For example:

Group 1 All Japanese word that starts from

あa いi うu えe おo
かka きki くku けke こko

Group 2 All Japanese word that starts from

さsa しshi すsu せse そso
たta ちchi つtsu てte とto

What makes it difficult is some characters in Japanese are written to different type such as Hiragana, Katakana and Kanji which words are written differently but same in meaning.

It might be possible if I could only convert hiragana, katakana or kanji to romaji.

Can someone help me, Is it possible in PHP? or is their a plugin in Wordpress that can do this?

Markipe
  • 616
  • 6
  • 16
  • If the word begins with hiragana or katakana, it's easy to group them. However, with kanji, you're going to have issues because take for example, 腹痛, the readings are ふくつう and はらいた. Take a look [here](http://stackoverflow.com/q/5827439) for more information on this. – Dave Chen Jul 21 '14 at 05:08
  • I see, thanks for your insight Dave, yes that's really my problem. Although I find KAKASI interesting but how could I use it? Is it really applicable to my situation? Just additional Info. We are using a shared server, that's why I don't have a full control to the server. – Markipe Jul 21 '14 at 05:34
  • KAKASHI is a bundled UNIX application, and since you mentioned "shared", I guess that's not an option. I recommend taking a look at [Yahoo's API](http://developer.yahoo.co.jp/webapi/jlp/furigana/v1/furigana.html). – Dave Chen Jul 21 '14 at 05:40

4 Answers4

1

You could use a function with repeated lines of simple string replacements.

function HiraganaToRomaji($hold) {

    $hold = str_replace("つ", "tsu", $hold);
    $hold = str_replace("きゃ", "kya", $hold);
    $hold = str_replace("きゅ", "kyu", $hold);
    $hold = str_replace("きょ", "kyo", $hold);
    $hold = str_replace("しゃ", "sha", $hold);
    $hold = str_replace("しゅ", "shu", $hold);
    $hold = str_replace("しょ", "sho", $hold);
    $hold = str_replace("し", "shi", $hold);

    $hold = str_replace("ちゃ", "cha", $hold);
    $hold = str_replace("ちゅ", "chu", $hold);
    $hold = str_replace("ちょ", "cho", $hold);
    $hold = str_replace("ち", "chi", $hold);
    $hold = str_replace("ひゃ", "hya", $hold);
    $hold = str_replace("ひゅ", "hyu", $hold);
    $hold = str_replace("ひょ", "hyo", $hold);
    $hold = str_replace("りゃ", "rya", $hold);
    $hold = str_replace("りゅ", "ryu", $hold);
    $hold = str_replace("りょ", "ryo", $hold);
    $hold = str_replace("ぎゃ", "gya", $hold);
    $hold = str_replace("ぎゅ", "gyu", $hold);
    $hold = str_replace("ぎょ", "gyo", $hold);
    $hold = str_replace("びゃ", "bya", $hold);
    $hold = str_replace("びゅ", "byu", $hold);
    $hold = str_replace("びょ", "byo", $hold);
    $hold = str_replace("ぴゃ", "pya", $hold);

    $hold = str_replace("ぴゅ", "pyu", $hold);
    $hold = str_replace("ぴょ", "pyo", $hold);
    $hold = str_replace("じゃ", "ja", $hold);
    $hold = str_replace("じゅ", "ju", $hold);
    $hold = str_replace("じょ", "jo", $hold);
    $hold = str_replace("ば", "ba", $hold);
    $hold = str_replace("だ", "da", $hold);
    $hold = str_replace("が", "ga", $hold);
    $hold = str_replace("は", "ha", $hold);
    $hold = str_replace("か", "ka", $hold);
    $hold = str_replace("ま", "ma", $hold);
    $hold = str_replace("ぱ", "pa", $hold);

    $hold = str_replace("ら", "ra", $hold);
    $hold = str_replace("さ", "sa", $hold);
    $hold = str_replace("た", "ta", $hold);
    $hold = str_replace("わ", "wa", $hold);
    $hold = str_replace("や", "ya", $hold);
    $hold = str_replace("ざ", "za", $hold);
    $hold = str_replace("な", "na", $hold);
    $hold = str_replace("あ", "a", $hold);

    $hold = str_replace("べ", "be", $hold);
    $hold = str_replace("で", "de", $hold);
    $hold = str_replace("げ", "ge", $hold);
    $hold = str_replace("へ", "he", $hold);
    $hold = str_replace("け", "ke", $hold);
    $hold = str_replace("め", "me", $hold);
    $hold = str_replace("ぺ", "pe", $hold);
    $hold = str_replace("れ", "re", $hold);
    $hold = str_replace("せ", "se", $hold);
    $hold = str_replace("て", "te", $hold);
    $hold = str_replace("ぜ", "ze", $hold);
    $hold = str_replace("ね", "ne", $hold);

    $hold = str_replace("え", "e", $hold);
    $hold = str_replace("び", "bi", $hold);
    $hold = str_replace("ぎ", "gi", $hold);
    $hold = str_replace("ひ", "hi", $hold);
    $hold = str_replace("き", "ki", $hold);
    $hold = str_replace("み", "mi", $hold);
    $hold = str_replace("ぴ", "pi", $hold);
    $hold = str_replace("り", "ri", $hold);
    $hold = str_replace("じ", "ji", $hold);
    $hold = str_replace("に", "ni", $hold);

    $hold = str_replace("い", "i", $hold);
    $hold = str_replace("ぼ", "bo", $hold);
    $hold = str_replace("ど", "do", $hold);
    $hold = str_replace("ご", "go", $hold);
    $hold = str_replace("ほ", "ho", $hold);
    $hold = str_replace("こ", "ko", $hold);

    $hold = str_replace("も", "mo", $hold);
    $hold = str_replace("ぽ", "po", $hold);
    $hold = str_replace("ろ", "ro", $hold);
    $hold = str_replace("そ", "so", $hold);
    $hold = str_replace("と", "to", $hold);
    $hold = str_replace("を", "wo", $hold);
    $hold = str_replace("よ", "yo", $hold);
    $hold = str_replace("ぞ", "zo", $hold);

    $hold = str_replace("の", "no", $hold);
    $hold = str_replace("お", "o", $hold);
    $hold = str_replace("ぶ", "bu", $hold);
    $hold = str_replace("ぐ", "gu", $hold);
    $hold = str_replace("ふ", "fu", $hold);
    $hold = str_replace("く", "ku", $hold);
    $hold = str_replace("む", "mu", $hold);
    $hold = str_replace("ぷ", "pu", $hold);
    $hold = str_replace("る", "ru", $hold);
    $hold = str_replace("す", "su", $hold);

    $hold = str_replace("ゆ", "yu", $hold);
    $hold = str_replace("ず", "zu", $hold);
    $hold = str_replace("ぬ", "nu", $hold);
    $hold = str_replace("う", "u", $hold);
    $hold = str_replace("ん", "n", $hold);

    $hold = preg_replace("/っ([a-z])/", "$1$1", $hold);

    return $hold;
}
Makyen
  • 31,849
  • 12
  • 86
  • 121
Kamil Dąbrowski
  • 984
  • 11
  • 17
0

Here's a hiragana javascript version. You could port it over by using PHP's preg_replace

You could also make a katakana version this way. However, Kanji would be a lot more complicated.

function romajiToHiragana(hold) {
    hold = hold.replace(/tsu/g, "つ"); 
    hold = hold.replace(/kya/g, "きゃ");
    hold = hold.replace(/kyu/g, "きゅ");
    hold = hold.replace(/kyo/g, "きょ");
    hold = hold.replace(/sha/g, "しゃ");
    hold = hold.replace(/shi/g, "し"); 
    hold = hold.replace(/shu/g, "しゅ"); 
    hold = hold.replace(/sho/g, "しょ");
    hold = hold.replace(/chi/g, "ち"); 
    hold = hold.replace(/cha/g, "ちゃ"); 
    hold = hold.replace(/chu/g, "ちゅ"); 
    hold = hold.replace(/cho/g, "ちょ"); 
    hold = hold.replace(/hya/g, "ひゃ");
    hold = hold.replace(/hyu/g, "ひゅ");
    hold = hold.replace(/hyo/g, "ひょ");
    hold = hold.replace(/rya/g, "りゃ");
    hold = hold.replace(/ryu/g, "りゅ");
    hold = hold.replace(/ryo/g, "りょ");
    hold = hold.replace(/gya/g, "ぎゃ");
    hold = hold.replace(/gyu/g, "ぎゅ");
    hold = hold.replace(/gyo/g, "ぎょ");
    hold = hold.replace(/bya/g, "びゃ");
    hold = hold.replace(/byu/g, "びゅ");
    hold = hold.replace(/byo/g, "びょ");
    hold = hold.replace(/pya/g, "ぴゃ");
    hold = hold.replace(/pyu/g, "ぴゅ");
    hold = hold.replace(/pyo/g, "ぴょ");
    hold = hold.replace(/ja/g, "じゃ");
    hold = hold.replace(/ju/g, "じゅ");
    hold = hold.replace(/jo/g, "じょ");
    hold = hold.replace(/ba/g, "ば"); 
    hold = hold.replace(/da/g, "だ"); 
    hold = hold.replace(/ga/g, "が"); 
    hold = hold.replace(/ha/g, "は"); 
    hold = hold.replace(/ka/g, "か"); 
    hold = hold.replace(/ma/g, "ま"); 
    hold = hold.replace(/pa/g, "ぱ"); 
    hold = hold.replace(/ra/g, "ら"); 
    hold = hold.replace(/sa/g, "さ"); 
    hold = hold.replace(/ta/g, "た"); 
    hold = hold.replace(/wa/g, "わ"); 
    hold = hold.replace(/ya/g, "や"); 
    hold = hold.replace(/za/g, "ざ");
    hold = hold.replace(/na/g, "な"); 
    hold = hold.replace(/a/g, "あ"); 
    hold = hold.replace(/be/g, "べ"); 
    hold = hold.replace(/de/g, "で"); 
    hold = hold.replace(/ge/g, "げ"); 
    hold = hold.replace(/he/g, "へ"); 
    hold = hold.replace(/ke/g, "け"); 
    hold = hold.replace(/me/g, "め"); 
    hold = hold.replace(/pe/g, "ぺ"); 
    hold = hold.replace(/re/g, "れ"); 
    hold = hold.replace(/se/g, "せ"); 
    hold = hold.replace(/te/g, "て"); 
    hold = hold.replace(/ze/g, "ぜ"); 
    hold = hold.replace(/ne/g, "ね");
    hold = hold.replace(/e/g, "え");
    hold = hold.replace(/bi/g, "び"); 
    hold = hold.replace(/gi/g, "ぎ"); 
    hold = hold.replace(/hi/g, "ひ"); 
    hold = hold.replace(/ki/g, "き"); 
    hold = hold.replace(/mi/g, "み"); 
    hold = hold.replace(/pi/g, "ぴ"); 
    hold = hold.replace(/ri/g, "り"); 
    hold = hold.replace(/ji/g, "じ"); 
    hold = hold.replace(/ni/g, "に"); 
    hold = hold.replace(/i/g, "い");
    hold = hold.replace(/bo/g, "ぼ"); 
    hold = hold.replace(/do/g, "ど"); 
    hold = hold.replace(/go/g, "ご"); 
    hold = hold.replace(/ho/g, "ほ"); 
    hold = hold.replace(/ko/g, "こ"); 
    hold = hold.replace(/mo/g, "も"); 
    hold = hold.replace(/po/g, "ぽ"); 
    hold = hold.replace(/ro/g, "ろ"); 
    hold = hold.replace(/so/g, "そ"); 
    hold = hold.replace(/to/g, "と"); 
    hold = hold.replace(/wo/g, "を"); 
    hold = hold.replace(/yo/g, "よ"); 
    hold = hold.replace(/zo/g, "ぞ"); 
    hold = hold.replace(/no/g, "の");
    hold = hold.replace(/o/g, "お"); 
    hold = hold.replace(/bu/g, "ぶ"); 
    hold = hold.replace(/gu/g, "ぐ"); 
    hold = hold.replace(/fu/g, "ふ"); 
    hold = hold.replace(/ku/g, "く"); 
    hold = hold.replace(/mu/g, "む"); 
    hold = hold.replace(/pu/g, "ぷ"); 
    hold = hold.replace(/ru/g, "る"); 
    hold = hold.replace(/su/g, "す"); 
    hold = hold.replace(/yu/g, "ゆ"); 
    hold = hold.replace(/zu/g, "ず");
    hold = hold.replace(/nu/g, "ぬ");
    hold = hold.replace(/u/g, "う");
    hold = hold.replace(/n/g, "ん");
    return hold
}
Trevor Wood
  • 2,347
  • 5
  • 31
  • 56
0

This Gist I put together is a Javascript solution (rather than PHP), it only works for Hiragana and Katakana to Romaji, and it doesn't handle all edge-cases, but you might still find it useful as a starting point: https://gist.github.com/Venryx/ecbea1a0c7a8a6cb21d80886488045f1

Venryx
  • 15,624
  • 10
  • 70
  • 96
0

You could use a morphological analyzer such as https://github.com/siahr/igo-php

You could then grab each word's katakana-only conversion from igo's output:

require_once 'Igo.php';

// '/unidic' is the path to the proprietary dictionary folder *** 
$igo = new Igo("/unidic", "UTF-8");
$parts = $igo->parse($wordString);

function getValueAtIndex($str, $i) {
  $val = explode(',', $str);
  if (isset($val[$i])) {
    return $val[$i];
  }
  return null;
}

$katakanaWords = [];

foreach ($parts as $i => $parts) { 
  $katakanaWords[] = getValueAtIndex($parts->feature, 11);
}

var_dump($katakanaWords);

*** You could get a dictionary (https://sourceforge.net/projects/mecab/files/mecab-ipadic/2.7.0-20070801/) to be converted to a format igo could read (such as the /unidic) using igo's own converter (https://osdn.net/projects/igo/downloads/55029/igo-0.4.5.jar/). Using java from the cli, you could run the following:

java -cp igo-0.4.5.jar net.reduls.igo.bin.BuildDic ipadic mecab-ipadic-2.7.0-20070801 EUC-JP

        where mecab-ipadic-2.7.0-20070801 is the folder containing the *.csv and *.def files. This will output the folder containing the dictionary format igo could read.

The above php code should output the following if the $wordString is "走っていた":

array(4) { [0]=> string(9) "ハシル" [1]=> string(3) "テ" [2]=> string(6) "イル" [3]=> string(3) "タ" } 

You could then use the other answers here to convert katakana to romaji.

ryelin
  • 3
  • 3