Struggling to understand user dictionary format in Elasticsearch Kuromoji tokenizer

Question

I wanted to use Elasticsearch Kuromoji plugin for Japanese language. However, I'm struggling to understand the user_dictionary format of the file in the tokenizer. It's explained in elastic doc https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji-tokenizer.html as the CSV of the following form:

The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A user_dictionary may be appended to the default dictionary. The dictionary should have the following CSV format:

<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag> So there is not much in the documentation about that.

When looking at the sample entry the doc shows, it can looks like below: 東京スカイツリー,東京スカイツリー,トウキョウスカイツリー,カスタム名詞 So, breaking it down, the first element is the dictionary text:

東京スカイツリー - Tokyo Sky Tree
東京スカイツリー - is Tokyo Sky tree - I assuming the space here is to denote token, but wondering why only "Tokyo" is a separate token, but sky tree is not split into "sky" "tree" ?
トウキョウスカイツリー - Then we have a reading forms. And again, "Tokyo" "sky tree" - again, why it's splited such way. Can I specify more than one reading form of the text in this column (of course if there are any)
And the last is the part of speech, which is the bit I don't understand. カスタム名詞 means "Custom noun". I assuming I can define the part of speech such as verb, noun etc. But what are the rules, should it follow some format of part of speech name. I saw examples where it's specified as "noun" - 名詞. But in this example is custom noun.

Anyone have some ideas, materials especially around Part of speech field - such as what are the available values. Additionally, what impact has this field to the overall tokenizer capabilities ?

Thanks

Hi, I was trying to solve similar problem. So googled good article here:https://docs.altis-dxp.com/search/search-configuration/custom-dictionaries/#japanese-user-dictionary — 40min, Jul 14 '21 at 15:02
That is great article! Thanks for sharing this - it explains all my doubts — czeczek, Jul 15 '21 at 18:09

score 0 · Answer 1 · answered Sep 01 '22 at 06:31

Do you try to define "tokyo sky tree" like this

"東京スカイツリー,東京スカイツリー,トウキョウスカイツリー,カスタム名詞"
"東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞"

I encounter another problem Found duplicate term [東京スカイツリー] in user dictionary at line [1]

Struggling to understand user dictionary format in Elasticsearch Kuromoji tokenizer

1 Answers1