I wanted to use Elasticsearch Kuromoji plugin for Japanese language. However, I'm struggling to understand the user_dictionary format of the file in the tokenizer. It's explained in elastic doc https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji-tokenizer.html as the CSV of the following form:
The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A user_dictionary may be appended to the default dictionary. The dictionary should have the following CSV format:
<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
So there is not much in the documentation about that.
When looking at the sample entry the doc shows, it can looks like below:
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
So, breaking it down, the first element is the dictionary text:
東京スカイツリー
- Tokyo Sky Tree東京 スカイツリー
- is Tokyo Sky tree - I assuming the space here is to denote token, but wondering why only "Tokyo" is a separate token, but sky tree is not split into "sky" "tree" ?トウキョウ スカイツリー
- Then we have a reading forms. And again, "Tokyo" "sky tree" - again, why it's splited such way. Can I specify more than one reading form of the text in this column (of course if there are any)- And the last is the part of speech, which is the bit I don't understand.
カスタム名詞
means "Custom noun". I assuming I can define the part of speech such as verb, noun etc. But what are the rules, should it follow some format of part of speech name. I saw examples where it's specified as "noun" -名詞
. But in this example is custom noun.
Anyone have some ideas, materials especially around Part of speech field - such as what are the available values. Additionally, what impact has this field to the overall tokenizer capabilities ?
Thanks