1

I wanted to use Elasticsearch Kuromoji plugin for Japanese language. However, I'm struggling to understand the user_dictionary format of the file in the tokenizer. It's explained in elastic doc https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji-tokenizer.html as the CSV of the following form:

The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A user_dictionary may be appended to the default dictionary. The dictionary should have the following CSV format:

<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag> So there is not much in the documentation about that.

When looking at the sample entry the doc shows, it can looks like below: 東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞 So, breaking it down, the first element is the dictionary text:

  1. 東京スカイツリー - Tokyo Sky Tree
  2. 東京 スカイツリー - is Tokyo Sky tree - I assuming the space here is to denote token, but wondering why only "Tokyo" is a separate token, but sky tree is not split into "sky" "tree" ?
  3. トウキョウ スカイツリー - Then we have a reading forms. And again, "Tokyo" "sky tree" - again, why it's splited such way. Can I specify more than one reading form of the text in this column (of course if there are any)
  4. And the last is the part of speech, which is the bit I don't understand. カスタム名詞 means "Custom noun". I assuming I can define the part of speech such as verb, noun etc. But what are the rules, should it follow some format of part of speech name. I saw examples where it's specified as "noun" - 名詞. But in this example is custom noun.

Anyone have some ideas, materials especially around Part of speech field - such as what are the available values. Additionally, what impact has this field to the overall tokenizer capabilities ?

Thanks

czeczek
  • 61
  • 5
  • 1
    Hi, I was trying to solve similar problem. So googled good article here:https://docs.altis-dxp.com/search/search-configuration/custom-dictionaries/#japanese-user-dictionary – 40min Jul 14 '21 at 15:02
  • That is great article! Thanks for sharing this - it explains all my doubts – czeczek Jul 15 '21 at 18:09

1 Answers1

0

Do you try to define "tokyo sky tree" like this

"東京スカイツリー,東京スカイツリー,トウキョウスカイツリー,カスタム名詞"
"東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞"

I encounter another problem Found duplicate term [東京スカイツリー] in user dictionary at line [1]

kewang
  • 464
  • 7
  • 15