15

Given the text below, how can I classify each character as kana or kanji?

誰か確認上記これらのフ

To get some thing like this

誰 - kanji
か - kana
確 - kanji
認 - kanji 
上 - kanji 
記 - kanji 
こ - kana 
れ - kana
ら - kana
の - kana
フ - kana

(Sorry if I did it incorrectly.)

Evgeniy Berezovsky
  • 18,571
  • 13
  • 82
  • 156
alex2k8
  • 42,496
  • 57
  • 170
  • 221
  • 2
    What exactly do you mean by 'split'? – Crag Sep 30 '10 at 00:42
  • Updated question to make the goal more clear. – alex2k8 Sep 30 '10 at 00:48
  • 2
    Hieroglyphs are an ancient Egyptian form of text. They have nothing to do with modern Japanese forms of text. – Stephen C Sep 30 '10 at 00:57
  • @alex2k8 - we understand what you mean by "splitting" now. But that is not what splitting means. What you are really trying to do is "classify" Japanese characters (not hieroglyphs!!) as either Kanji or Kana. (Splitting implies putting the characters into different piles / collections ...) – Stephen C Sep 30 '10 at 01:02
  • @alex2k8: you want to differentiate between "ideograms" and "syllabaries" (? plural of "syllabary"). A kana is a syllabary: both katakana and hiragana are syllabaries. As Stephen C said (+1), hieroglyphs have nothing to do here ;) It really shouldn't be hard for there are only about 60 hiragana and 60 katakana (I used to know them ;) – SyntaxT3rr0r Sep 30 '10 at 05:31
  • Thank you all for clarifications. – alex2k8 Sep 30 '10 at 11:53

5 Answers5

34

This functionality is built into the Character.UnicodeBlock class. Some examples of the Unicode blocks related to the Japanese language:

Character.UnicodeBlock.of('誰') == CJK_UNIFIED_IDEOGRAPHS
Character.UnicodeBlock.of('か') == HIRAGANA
Character.UnicodeBlock.of('フ') == KATAKANA
Character.UnicodeBlock.of('フ') == HALFWIDTH_AND_FULLWIDTH_FORMS
Character.UnicodeBlock.of('!') == HALFWIDTH_AND_FULLWIDTH_FORMS
Character.UnicodeBlock.of('。') == CJK_SYMBOLS_AND_PUNCTUATION

But, as always, the devil is in the details:

Character.UnicodeBlock.of('A') == HALFWIDTH_AND_FULLWIDTH_FORMS

where is the full-width character. So this is in the same category as the halfwidth Katakana above. Note that the full-width is different from the normal (half-width) A:

Character.UnicodeBlock.of('A') == BASIC_LATIN
Evgeniy Berezovsky
  • 18,571
  • 13
  • 82
  • 156
Josh Lee
  • 171,072
  • 38
  • 269
  • 275
14

Use a table like this one to determine which unicode values are used for katakana and kanji, then you can simply cast the character to an int and check where it belongs, something like

int val = (int)て;
if (val >= 0x3040 && val <= 0x309f)
  return KATAKANA
..
Jack
  • 131,802
  • 30
  • 241
  • 343
6

This seems like it'd be an interesting use for Guava's CharMatcher class. Using the tables linked in Jack's answer, I created this:

public class JapaneseCharMatchers {
  public static final CharMatcher HIRAGANA = 
      CharMatcher.inRange((char) 0x3040, (char) 0x309f);

  public static final CharMatcher KATAKANA = 
      CharMatcher.inRange((char) 0x30a0, (char) 0x30ff);

  public static final CharMatcher KANA = HIRAGANA.or(KATAKANA);

  public static final CharMatcher KANJI = 
      CharMatcher.inRange((char) 0x4e00, (char) 0x9faf);

  public static void main(String[] args) {
    test("誰か確認上記これらのフ");
  }

  private static void test(String string) {
    System.out.println(string);
    System.out.println("Hiragana: " + HIRAGANA.retainFrom(string));
    System.out.println("Katakana: " + KATAKANA.retainFrom(string));
    System.out.println("Kana: " + KANA.retainFrom(string));
    System.out.println("Kanji: " + KANJI.retainFrom(string));
  }
}

Running this prints the expected:

誰か確認上記これらのフ

Hiragana: かこれらの

Katakana: フ

Kana: かこれらのフ

Kanji: 誰確認上記

This gives you a lot of power for working with Japanese text by defining the rules for determining if a character is in one of these groups in an object that can not only do a lot of useful things itself, but can also be used with other APIs such as Guava's Splitter class.

Edit:

Based on jleedev's answer, you could also write a method like:

public static CharMatcher inUnicodeBlock(final Character.UnicodeBlock block) {
  return new CharMatcher() {
    public boolean matches(char c) {
      return Character.UnicodeBlock.of(c) == block;
    }
  };
}

and use it like:

CharMatcher HIRAGANA = inUnicodeBlock(Character.UnicodeBlock.HIRAGANA);

I think this might be a bit slower than the other version though.

Community
  • 1
  • 1
ColinD
  • 108,630
  • 30
  • 201
  • 202
  • Right, if you only want to test for membership in a specific range, it might be faster to do it yourself. Surprisingly, the UnicodeBlock class doesn’t have a method to test a character for membership, and it seems the only way is its static `of` method, which loops through every block until it finds one. – Josh Lee Sep 30 '10 at 03:10
4

You need to get a reference that gives the separate ranges for kana and kanji characters. From what I've seen, alphabets and equivalents typically get a block of characters.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
mP.
  • 18,002
  • 10
  • 71
  • 105
  • 1
    Well, in Unicode Kanji has a range of U+4E00 to U+9FBF, Katakana has a range of U+30A0 to U+30FF and Hiragana has a range of U+3040 to U+309F. With that 'splitting' text should be easy, depending on what splitting actually is. – Crag Sep 30 '10 at 00:45
  • This isn't as easy as it sounds, because there are multiple ranges for each. – Noah Jun 10 '12 at 14:21
-1

I know you didn't ask for VBA, but here is the VBA flavor for those who want to know:

Here's a function that will do it. It will break down the sentence like you have above into a single cell. You might need to add some error checking for how you want to deal with line breaks or English characters, etc. but this should be a good start.

Function KanjiKanaBreakdown(ByVal text As String) As String

Application.ScreenUpdating = False
Dim kanjiCode As Long
Dim result As String
Dim i As Long

For i = 1 To Len(text)
    If Asc(Mid$(text, i, 1)) > -30562 And Asc(Mid$(text, i, 1)) < -950 Then
        result = (result & (Mid$(text, i, 1)) & (" - kanji") & vbLf)
    Else
        result = (result & (Mid$(text, i, 1)) & (" - kana") & vbLf)
    End If
Next

KanjiKanaBreakdown = result
Application.ScreenUpdating = True

End Function
Gaijinhunter
  • 14,587
  • 4
  • 51
  • 57