0

I am trying to use carrot2 API to cluster documents in japanese language. It throws out this WARN:

org.carrot2.text.linguistic.DefaultTokenizerFactory: Tokenizer for Japanese (ja) is not available. This may degrade clustering quality of Japanese content.

Hence, the clustering process failed and all docs belong to "other topic" cluster.

Is there any help to solve this problem?

Thanks in advance.

1 Answers1

0

The open source algorithms available in Carrot2 unfortunately do not support Japanese. The constant was added to cover the possible future support for Japanese.

Alternatively, you can try running Carrot2 with a customized linguistic pipeline, the UsingCustomLanguageModel example class in Carrot2 Java API distribution shows how to do it.

Stanislaw Osinski
  • 1,231
  • 1
  • 7
  • 9
  • I knew it. But they support to create custom Language model to customize the text analyzer. However, since the carrot2 api is lack of guideline and document, it is very difficult for me to override their text analyzer. Is there any document/example(in detailed) for override the language model? – Tran Vu Anh Oct 27 '15 at 01:09
  • Good point. I've edited the answer to add the link to the customization code example. – Stanislaw Osinski Oct 27 '15 at 14:10