What lucene analyzer can be used to handle Japanese text?

Question

Which lucene analyzer can be used to handle Japanese text properly? It should be able to handle Kanji, Hiragana, Katakana, Romaji, and any of their combination.

score 4 · Answer 1 · answered Oct 26 '09 at 14:33

4

You should probably look at the CJK package that is in the contrib area of Lucene. There is an analyzer and a tokenizer specifically for dealing with Chinese, Japanese, and Korean.

answered Oct 26 '09 at 14:33

adrianbanks

81,306
22
176
206

The CJK Analyzer seems to be a naive way of searching things, and from previous experience, does not seem to provide very relevant search results. Is there anything I need to do specifically to make CJK Analyzer work like modify some weights or something ? Thanks – Franz See Dec 24 '09 at 05:40
I've never used the CJK analyzer myself so cannot say. You could try asking on the Lucene mailing list (http://lucene.apache.org/java/docs/mailinglists.html#Java User List) for more specific help - there are people who are very experienced with Lucene on that list. – adrianbanks Dec 24 '09 at 09:52

score 3 · Accepted Answer · answered Oct 18 '11 at 04:54

3

I found lucene-gosen while doing a search for my own purposes:

Their example looks fairly decent, but I guess it's the kind of thing that needs extensive testing. I'm also worried about their backwards-compatibility policy (or rather, the complete lack of one.)

answered Oct 18 '11 at 04:54

Hakanai

12,010
10
62
132

1

We didn't use lucene-gosen, but we did use gosen. So I'm accepting this answer (since it's close enough and the project does look interesting). CJK does a very naive searching wherein it just matches characters and not words unlike gosen (which uses a dictionary for proper parsing). – Franz See Jan 03 '12 at 07:58

What lucene analyzer can be used to handle Japanese text?

2 Answers2