1

I'm trying to index some Chinese documents with Solr, but it looks like Solr doesn't index some segmented words.

Analyzer I use is IK analyzer http://code.google.com/p/ik-analyzer/.

The field to be indexed:

 <field name="hospital_alias_splitted" type="cn_ik" indexed="true" stored="true" multiValued="true" omitNorms="false"/>

cn_ik definition:

<fieldType name="cn_ik" class="solr.TextField" positionIncrementGap="100">
<analyzer> 
    <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" useSmart ="false"/>
</analyzer>

For example, the word that will be indexed is "AB" (without quotes). After word segmentation using a Chinese analyzer, I got 3 tokens, they are "AB", "A" and "B".

As we can see, the first token "AB" covers the following two tokens.

After feeding these tokens to Solr, it looks like Solr only index "AB", "A" and "B" are ignored. Because when I search "A" or search "B" doesn't get any result.

I guess when Solr indexing "AB", it already reaches the end of indexed word, so "A" and "B" are ignored.

Using Luke and Analysis Request Handler don't show me more hints. I'm not sure this is a bug or a feature of Solr.

Any comment or suggestion?

Thanks :)

Shai
  • 111,146
  • 38
  • 238
  • 371
emlaggr
  • 41
  • 4
  • Could you add field spec from schema? – Fuxi Sep 23 '12 at 11:31
  • Looks like it's a bug of IK segmenter, which has been fixed in new IK code. This bug is reproduced if analyzer and query parser use different segementation mode (smart mode and finest granularity mode). – emlaggr Jan 07 '13 at 09:08

1 Answers1

0

(As I am not able to comment on the question, I am typing here)

I would recommend you to try it with different analzyers. As you didnt tell us your analyzer, I assume that you are using something default like CJK and so on.

As far as I know, there are more analyzers for Chinese and languages like Chinese which dont have spaces between two words. They might also help you.

It would be really nice to see some part of your schema about that field though...

edit : you can also check this link

denizdurmus
  • 1,289
  • 1
  • 13
  • 39
  • Add schema in the original post. – emlaggr Sep 24 '12 at 02:10
  • okay... although I am living in China, my Chinese is not good enough to read the notes in the code, so i will try to play with the code... but meanwhile, can you try it with other analyzers? I havent dealt with CJK before, except some basic cases, so I am not sure which analyzer could solve this... but as far as I know analyzers behave different.. e.g 你好 can be parsed as 你好, 你, 好 depending on the analyzer.. and another humble idea of mine would recommend you to post similar question on a Chinese board.. as most of laowai wont be able to read the documentation in the code ;) – denizdurmus Sep 24 '12 at 06:39