7

I have implemented lucene for my application and it works very well unless you have introduced something like japanese characters.

The problem is that if I have japanese string こんにちは、このバイネイです and I search with こ that is the first character than it works well whereas if I use more than one japanese character(こんにち)in search token search fails and there is no document found.

Are japanese characters supported in lucene? what are the settings to be done to get it working?

Pranali Desai
  • 974
  • 1
  • 7
  • 22

3 Answers3

4

Built-in analyzer of lucene does not support japanese.

You need to install some analyzer like sen, which is java port of mecab, quite popular japanese analyzer, and its fast.

There is 2 sub types called

  1. CJKAnalyzer, which support chinese, and korean too, and using bi-gram method
  2. JapaneseAnalyzer, which only support japanese, using Morphological Analyzer and supposed to be very fast.
YOU
  • 120,166
  • 34
  • 186
  • 219
  • @S.Mark, user can have any thing in their text field how do I decide which analyzer to use. Is there some generic analyzer that would work for all the languages – Pranali Desai Apr 15 '10 at 07:34
  • @Pranali, bi-gram method would be better for that case. – YOU Apr 15 '10 at 08:00
  • @S.Mark, do you have any sample code or link for implementing bi-gram method. what is the analyzer that is required for this and how do I configure it – Pranali Desai Apr 15 '10 at 08:15
  • @Pranali, I don't have actual experience with that, but since CJKAnalyzer using bi-gram method, that will be good try. there is a link in my answer called [sen](https://sen.dev.java.net/), thats a java port, so you might need to port it to C#, but since mecab is written in C, it shouldn't be that difficult. – YOU Apr 15 '10 at 08:38
  • @Pranali, there is list of posts regarding [lucene+.net](http://stackoverflow.com/search?q=lucene+.net) in SO, you may want to look at it. – YOU Apr 15 '10 at 08:56
3

I don't think there can be an analyzer that will work for all languages. The problem is that different languages have different rules about word boundaries and stemming (for example, the Thai language doesn't use spaces at all to separate words). Or if there is, I certainly wouldn't want to be the maintainer!

What you will need to do is "tag" blocks of text as one language or another and use the correct analyzer for that particular language. You can attempt to detect the language "automatically" by doing character analysis (i.e. text using predominantly Japanese Katakana is likely Japanese)

Dean Harding
  • 71,468
  • 13
  • 145
  • 180
  • @codeka, do I have to search specify the analayzer to be used for certain words say (A-Z) for english and(こ-す) for japanese and then go through the supplied text to find out the analyzer to be used. – Pranali Desai Apr 15 '10 at 07:58
0

You should use the new Japanese analyzers recently released in Lucene 3.6.0. They are based on the excellent Kuromoji morphological analyzer recently donated to Lucene in LUCENE-3305.

Docs are a bit sparse as of this writing, so here are a few more links…

  • If you use Solr, here's a sample schema that will work on Websolr.
  • Slides from my presentation at the 20 Apr 2012 herokujp meetup, on full-text search with an emphasis on analyzing Japanese.

(This is all for the Java version of Lucene.)

Nick Zadrozny
  • 7,906
  • 33
  • 38