How to classify Japanese characters as either kanji or kana?

Question

Given the text below, how can I classify each character as kana or kanji?

誰か確認上記これらのフ

To get some thing like this

誰 - kanji
か - kana
確 - kanji
認 - kanji 
上 - kanji 
記 - kanji 
こ - kana 
れ - kana
ら - kana
の - kana
フ - kana

(Sorry if I did it incorrectly.)

Hieroglyphs are an ancient Egyptian form of text. They have nothing to do with modern Japanese forms of text. — Stephen C, Sep 30 '10 at 00:57
@alex2k8 - we understand what you mean by "splitting" now. But that is not what splitting means. What you are really trying to do is "classify" Japanese characters (not hieroglyphs!!) as either Kanji or Kana. (Splitting implies putting the characters into different piles / collections ...) — Stephen C, Sep 30 '10 at 01:02
@alex2k8: you want to differentiate between "ideograms" and "syllabaries" (? plural of "syllabary"). A kana is a syllabary: both katakana and hiragana are syllabaries. As Stephen C said (+1), hieroglyphs have nothing to do here ;) It really shouldn't be hard for there are only about 60 hiragana and 60 katakana (I used to know them ;) — SyntaxT3rr0r, Sep 30 '10 at 05:31

score 34 · Accepted Answer · edited Nov 05 '13 at 04:44

34

This functionality is built into the Character.UnicodeBlock class. Some examples of the Unicode blocks related to the Japanese language:

Character.UnicodeBlock.of('誰') == CJK_UNIFIED_IDEOGRAPHS
Character.UnicodeBlock.of('か') == HIRAGANA
Character.UnicodeBlock.of('フ') == KATAKANA
Character.UnicodeBlock.of('ﾌ') == HALFWIDTH_AND_FULLWIDTH_FORMS
Character.UnicodeBlock.of('！') == HALFWIDTH_AND_FULLWIDTH_FORMS
Character.UnicodeBlock.of('。') == CJK_SYMBOLS_AND_PUNCTUATION

But, as always, the devil is in the details:

Character.UnicodeBlock.of('Ａ') == HALFWIDTH_AND_FULLWIDTH_FORMS

where Ａ is the full-width character. So this is in the same category as the halfwidth Katakana ﾌ above. Note that the full-width Ａ is different from the normal (half-width) A:

Character.UnicodeBlock.of('A') == BASIC_LATIN

edited Nov 05 '13 at 04:44

Evgeniy Berezovsky

18,571
13
82
156

answered Sep 30 '10 at 02:04

Josh Lee

171,072
38
269
275

Interesting, I didn't know about that. – ColinD Sep 30 '10 at 02:09
but CJK_UNIFIED_IDEOGRAPHS isn't found be default, I assume an additional import statement is needed, beyond that needed for Character. – Noah Aug 19 '12 at 06:41
Didn't even know this was a function! Thanks! – Kenny Cason Jan 01 '15 at 04:12

score 14 · Answer 2 · answered Sep 30 '10 at 00:48

14

Use a table like this one to determine which unicode values are used for katakana and kanji, then you can simply cast the character to an int and check where it belongs, something like

int val = (int)て;
if (val >= 0x3040 && val <= 0x309f)
  return KATAKANA
..

answered Sep 30 '10 at 00:48

Jack

131,802
30
241
343

1

Note that jleedev has essentially the same method, but using a table provided by the JVM. – MSalters Sep 30 '10 at 11:40

score 6 · Answer 3 · edited Jun 20 '20 at 09:12

This seems like it'd be an interesting use for Guava's CharMatcher class. Using the tables linked in Jack's answer, I created this:

public class JapaneseCharMatchers {
  public static final CharMatcher HIRAGANA = 
      CharMatcher.inRange((char) 0x3040, (char) 0x309f);

  public static final CharMatcher KATAKANA = 
      CharMatcher.inRange((char) 0x30a0, (char) 0x30ff);

  public static final CharMatcher KANA = HIRAGANA.or(KATAKANA);

  public static final CharMatcher KANJI = 
      CharMatcher.inRange((char) 0x4e00, (char) 0x9faf);

  public static void main(String[] args) {
    test("誰か確認上記これらのフ");
  }

  private static void test(String string) {
    System.out.println(string);
    System.out.println("Hiragana: " + HIRAGANA.retainFrom(string));
    System.out.println("Katakana: " + KATAKANA.retainFrom(string));
    System.out.println("Kana: " + KANA.retainFrom(string));
    System.out.println("Kanji: " + KANJI.retainFrom(string));
  }
}

Running this prints the expected:

誰か確認上記これらのフ

Hiragana: かこれらの

Katakana: フ

Kana: かこれらのフ

Kanji: 誰確認上記

This gives you a lot of power for working with Japanese text by defining the rules for determining if a character is in one of these groups in an object that can not only do a lot of useful things itself, but can also be used with other APIs such as Guava's Splitter class.

Edit:

Based on jleedev's answer, you could also write a method like:

public static CharMatcher inUnicodeBlock(final Character.UnicodeBlock block) {
  return new CharMatcher() {
    public boolean matches(char c) {
      return Character.UnicodeBlock.of(c) == block;
    }
  };
}

and use it like:

CharMatcher HIRAGANA = inUnicodeBlock(Character.UnicodeBlock.HIRAGANA);

I think this might be a bit slower than the other version though.

Right, if you only want to test for membership in a specific range, it might be faster to do it yourself. Surprisingly, the UnicodeBlock class doesn’t have a method to test a character for membership, and it seems the only way is its static `of` method, which loops through every block until it finds one. — Josh Lee, Sep 30 '10 at 03:10

score 4 · Answer 4 · edited Jun 09 '12 at 05:36

4

You need to get a reference that gives the separate ranges for kana and kanji characters. From what I've seen, alphabets and equivalents typically get a block of characters.

edited Jun 09 '12 at 05:36

Peter Mortensen

30,738
21
105
131

answered Sep 30 '10 at 00:42

mP.

18,002
10
71
105

1

Well, in Unicode Kanji has a range of U+4E00 to U+9FBF, Katakana has a range of U+30A0 to U+30FF and Hiragana has a range of U+3040 to U+309F. With that 'splitting' text should be easy, depending on what splitting actually is. – Crag Sep 30 '10 at 00:45
This isn't as easy as it sounds, because there are multiple ranges for each. – Noah Jun 10 '12 at 14:21

score -1 · Answer 5 · answered Jun 30 '11 at 03:59

I know you didn't ask for VBA, but here is the VBA flavor for those who want to know:

Here's a function that will do it. It will break down the sentence like you have above into a single cell. You might need to add some error checking for how you want to deal with line breaks or English characters, etc. but this should be a good start.

Function KanjiKanaBreakdown(ByVal text As String) As String

Application.ScreenUpdating = False
Dim kanjiCode As Long
Dim result As String
Dim i As Long

For i = 1 To Len(text)
    If Asc(Mid$(text, i, 1)) > -30562 And Asc(Mid$(text, i, 1)) < -950 Then
        result = (result & (Mid$(text, i, 1)) & (" - kanji") & vbLf)
    Else
        result = (result & (Mid$(text, i, 1)) & (" - kana") & vbLf)
    End If
Next

KanjiKanaBreakdown = result
Application.ScreenUpdating = True

End Function

How to classify Japanese characters as either kanji or kana?

5 Answers5

Linked