I recently worked on training a Part-of-Speech model for Hindi in Spacy. I got the model already trained but when analyzing any text, the .pos_
attribute of any token always points to X
. The fine-grained tags, .tag_
- which were the ones the model was trained with - are correct though.
The mapping between this fine-grained tags and the "universal" tags (VERB, NOUN, ADJ, etc) is found in the spacy/lang/hi/tag_map.py
file.
Lemma यूरोप, Lemmatized: False, POS: X, TAG: NNP
Lemma के, Lemmatized: False, POS: X, TAG: PSP
Lemma जिन, Lemmatized: False, POS: X, TAG: DEM
Lemma राजनीतिक, Lemmatized: False, POS: X, TAG: JJ
Lemma दलों, Lemmatized: False, POS: X, TAG: NN
Lemma को, Lemmatized: False, POS: X, TAG: PSP
Lemma व्यवस्था, Lemmatized: False, POS: X, TAG: NN
Lemma ,, Lemmatized: False, POS: SYM, TAG: SYM
Lemma राजनेताओं, Lemmatized: False, POS: X, TAG: NN
Lemma और, Lemmatized: False, POS: X, TAG: CC
Lemma मीडिया, Lemmatized: False, POS: X, TAG: NN
Lemma द्वारा, Lemmatized: False, POS: X, TAG: PSP
Lemma अति, Lemmatized: False, POS: X, TAG: INTF
Lemma दक्षिणपंथी, Lemmatized: False, POS: X, TAG: NN
Lemma कहा, Lemmatized: False, POS: X, TAG: VM
Lemma जाता, Lemmatized: False, POS: X, TAG: VAUX
Lemma है, Lemmatized: False, POS: X, TAG: VAUX
Lemma (, Lemmatized: False, POS: SYM, TAG: SYM
Lemma परन्तु, Lemmatized: False, POS: X, TAG: CC
Lemma मेरी, Lemmatized: False, POS: X, TAG: PRP
Lemma ओर, Lemmatized: False, POS: X, TAG: NST
Lemma से, Lemmatized: False, POS: X, TAG: PSP
Lemma सभ्यतावादी, Lemmatized: False, POS: X, TAG: NNP
Lemma कहा, Lemmatized: False, POS: X, TAG: VM
Lemma जाता, Lemmatized: False, POS: X, TAG: VAUX
Lemma है, Lemmatized: False, POS: X, TAG: VAUX
Lemma ), Lemmatized: False, POS: SYM, TAG: SYM
Lemma उनकी, Lemmatized: False, POS: X, TAG: PRP
Lemma आलोचना, Lemmatized: False, POS: X, TAG: NN
Lemma उनकी, Lemmatized: False, POS: X, TAG: PRP
Lemma भूलों, Lemmatized: False, POS: X, TAG: NN
Lemma और, Lemmatized: False, POS: X, TAG: CC
Lemma अतिवादिता, Lemmatized: False, POS: X, TAG: NN
Lemma के, Lemmatized: False, POS: X, TAG: PSP
Lemma कारण, Lemmatized: False, POS: X, TAG: PSP
Lemma की, Lemmatized: False, POS: X, TAG: VM
Lemma जाती, Lemmatized: False, POS: X, TAG: VAUX
Lemma है|, Lemmatized: False, POS: X, TAG: NNPC
Investigating a little bit I found out that the reason the .pos_
has this X
value is because in the generated lang_model/tagger/tag_map
binary file, all of its keys point to 101
which is the "code" assigned to the Part-of-Speech X
, which is Other
.
I deduce it is generating the keys pointing to 101
because there's no information at how it should map each of the provided tags from the dataset to the "universal" ones. The thing is, I can provide a tag_map.py
in the definition of my Hindi(Language)
class, but when passing a text through the pipeline, it will eventually use the tag map defined in the tagger/
directory created with by the output of the train
command.
Here's a link which will clarify what I'm explaining: https://universaldependencies.org/tagset-conversion/hi-conll-uposf.html
The first item of the first column (CC
, DEM
, INTF
, etc) are the ones provided to the model. The universal tags are the ones from the second column.
My question is, where should I define the tag_map to overwrite the one generated by the spacy train
command?