-1

I am using solr-6.2.0 and filedType : text_ja .
I am facing problem with JapaneseTokenizer, its properly tokenising
ドラゴンボールヒーロー


"ドラゴン"      "ドラゴンボールヒーロー"      "ボール" "ヒーロー"



But its failing to tokenize ドラゴンボールヒーローズ properly,
ドラゴンボールヒーローズ

"ドラゴン"       "ドラゴンボールヒーローズ"      "ボールヒーローズ"

Hence searching with ドラゴンボール doesn't hit in later case .

Also it doesn't seperate ディズニーランド into two words .

Prashant
  • 112
  • 8

2 Answers2

1

First, I'm fairly certain that it is working as intended. Looking into how the Kuromoji morphpological analyzer works would probably be the best way to gain a better understanding of it's rules and rationale.

There are a couple of things you could try. You could put the JapaneseAnalyzer into EXTENDED, instead of SEARCH mode, which should give you significant looser matching (though most likely at the cost of introducing more false positives, of course):

Analyzer analyzer = new JapaneseAnalyzer(
        null, 
        JapaneseTokenizer.Mode.EXTENDED, 
        JapaneseAnalyzer.getDefaultStopSet(), 
        JapaneseAnalyzer.getDefaultStopTags()
        );

Or you could try using CJKAnalyzer, instead.

(By the way, EnglishAnalyzer doesn't split "Disneyland" into two tokens either)

femtoRgon
  • 32,893
  • 7
  • 60
  • 87
  • thank you for your suggestion, CJKAnalyzer will creates lot of tokens which doesn't have any meaning, also EXTENDED mode unigrams english words, I considered using JapaneseTokenzer itself with custom userDictionary, is there any readily available dictionary ?, any idea ? – Prashant Mar 03 '17 at 06:19
0

I was able to solve this using lucene-gosen Sen Tokenizer,
and compiling ipadic dictionary with custom rules and word weights.

Prashant
  • 112
  • 8