carrot2 api not support japanese language

Question

I am trying to use carrot2 API to cluster documents in japanese language. It throws out this WARN:

org.carrot2.text.linguistic.DefaultTokenizerFactory: Tokenizer for Japanese (ja) is not available. This may degrade clustering quality of Japanese content.

Hence, the clustering process failed and all docs belong to "other topic" cluster.

Is there any help to solve this problem?

Thanks in advance.

Stanislaw Osinski · Answer 1 · 2015-10-27T14:09:40.697

0

The open source algorithms available in Carrot² unfortunately do not support Japanese. The constant was added to cover the possible future support for Japanese.

Alternatively, you can try running Carrot² with a customized linguistic pipeline, the UsingCustomLanguageModel example class in Carrot² Java API distribution shows how to do it.

edited Oct 27 '15 at 14:09

answered Oct 24 '15 at 20:20

Stanislaw Osinski

1,231
1
7
9

I knew it. But they support to create custom Language model to customize the text analyzer. However, since the carrot2 api is lack of guideline and document, it is very difficult for me to override their text analyzer. Is there any document/example(in detailed) for override the language model? – Tran Vu Anh Oct 27 '15 at 01:09
Good point. I've edited the answer to add the link to the customization code example. – Stanislaw Osinski Oct 27 '15 at 14:10

carrot2 api not support japanese language

1 Answers1