8

Which lucene analyzer can be used to handle Japanese text properly? It should be able to handle Kanji, Hiragana, Katakana, Romaji, and any of their combination.

Makoto
  • 104,088
  • 27
  • 192
  • 230
Franz See
  • 3,282
  • 5
  • 41
  • 48

2 Answers2

4

You should probably look at the CJK package that is in the contrib area of Lucene. There is an analyzer and a tokenizer specifically for dealing with Chinese, Japanese, and Korean.

adrianbanks
  • 81,306
  • 22
  • 176
  • 206
  • The CJK Analyzer seems to be a naive way of searching things, and from previous experience, does not seem to provide very relevant search results. Is there anything I need to do specifically to make CJK Analyzer work like modify some weights or something ? Thanks – Franz See Dec 24 '09 at 05:40
  • I've never used the CJK analyzer myself so cannot say. You could try asking on the Lucene mailing list (http://lucene.apache.org/java/docs/mailinglists.html#Java User List) for more specific help - there are people who are very experienced with Lucene on that list. – adrianbanks Dec 24 '09 at 09:52
3

I found lucene-gosen while doing a search for my own purposes:

Their example looks fairly decent, but I guess it's the kind of thing that needs extensive testing. I'm also worried about their backwards-compatibility policy (or rather, the complete lack of one.)

Hakanai
  • 12,010
  • 10
  • 62
  • 132
  • 1
    We didn't use lucene-gosen, but we did use gosen. So I'm accepting this answer (since it's close enough and the project does look interesting). CJK does a very naive searching wherein it just matches characters and not words unlike gosen (which uses a dictionary for proper parsing). – Franz See Jan 03 '12 at 07:58