Spacy custom POS model for Hindi

Question

I recently worked on training a Part-of-Speech model for Hindi in Spacy. I got the model already trained but when analyzing any text, the .pos_ attribute of any token always points to X. The fine-grained tags, .tag_ - which were the ones the model was trained with - are correct though.

The mapping between this fine-grained tags and the "universal" tags (VERB, NOUN, ADJ, etc) is found in the spacy/lang/hi/tag_map.py file.

Lemma यूरोप, Lemmatized: False, POS: X, TAG: NNP
Lemma के, Lemmatized: False, POS: X, TAG: PSP
Lemma जिन, Lemmatized: False, POS: X, TAG: DEM
Lemma राजनीतिक, Lemmatized: False, POS: X, TAG: JJ
Lemma दलों, Lemmatized: False, POS: X, TAG: NN
Lemma को, Lemmatized: False, POS: X, TAG: PSP
Lemma व्यवस्था, Lemmatized: False, POS: X, TAG: NN
Lemma ,, Lemmatized: False, POS: SYM, TAG: SYM
Lemma राजनेताओं, Lemmatized: False, POS: X, TAG: NN
Lemma और, Lemmatized: False, POS: X, TAG: CC
Lemma मीडिया, Lemmatized: False, POS: X, TAG: NN
Lemma द्वारा, Lemmatized: False, POS: X, TAG: PSP
Lemma अति, Lemmatized: False, POS: X, TAG: INTF
Lemma दक्षिणपंथी, Lemmatized: False, POS: X, TAG: NN
Lemma कहा, Lemmatized: False, POS: X, TAG: VM
Lemma जाता, Lemmatized: False, POS: X, TAG: VAUX
Lemma है, Lemmatized: False, POS: X, TAG: VAUX
Lemma (, Lemmatized: False, POS: SYM, TAG: SYM
Lemma परन्तु, Lemmatized: False, POS: X, TAG: CC
Lemma मेरी, Lemmatized: False, POS: X, TAG: PRP
Lemma ओर, Lemmatized: False, POS: X, TAG: NST
Lemma से, Lemmatized: False, POS: X, TAG: PSP
Lemma सभ्यतावादी, Lemmatized: False, POS: X, TAG: NNP
Lemma कहा, Lemmatized: False, POS: X, TAG: VM
Lemma जाता, Lemmatized: False, POS: X, TAG: VAUX
Lemma है, Lemmatized: False, POS: X, TAG: VAUX
Lemma ), Lemmatized: False, POS: SYM, TAG: SYM
Lemma उनकी, Lemmatized: False, POS: X, TAG: PRP
Lemma आलोचना, Lemmatized: False, POS: X, TAG: NN
Lemma उनकी, Lemmatized: False, POS: X, TAG: PRP
Lemma भूलों, Lemmatized: False, POS: X, TAG: NN
Lemma और, Lemmatized: False, POS: X, TAG: CC
Lemma अतिवादिता, Lemmatized: False, POS: X, TAG: NN
Lemma के, Lemmatized: False, POS: X, TAG: PSP
Lemma कारण, Lemmatized: False, POS: X, TAG: PSP
Lemma की, Lemmatized: False, POS: X, TAG: VM
Lemma जाती, Lemmatized: False, POS: X, TAG: VAUX
Lemma है|, Lemmatized: False, POS: X, TAG: NNPC

Investigating a little bit I found out that the reason the .pos_ has this X value is because in the generated lang_model/tagger/tag_map binary file, all of its keys point to 101 which is the "code" assigned to the Part-of-Speech X, which is Other.

I deduce it is generating the keys pointing to 101 because there's no information at how it should map each of the provided tags from the dataset to the "universal" ones. The thing is, I can provide a tag_map.py in the definition of my Hindi(Language) class, but when passing a text through the pipeline, it will eventually use the tag map defined in the tagger/ directory created with by the output of the train command.

Here's a link which will clarify what I'm explaining: https://universaldependencies.org/tagset-conversion/hi-conll-uposf.html

The first item of the first column (CC, DEM, INTF, etc) are the ones provided to the model. The universal tags are the ones from the second column.

My question is, where should I define the tag_map to overwrite the one generated by the spacy train command?

score 0 · Accepted Answer · answered Oct 30 '19 at 14:30

You need to add your tag_map.py to spacy/lang/hi/ and tell the default model (which is what gets loaded with spacy train hi) to load it. It sounds like you already have a tag_map.py, but if not, you can see examples for any of the languages that have provided spacy models, like:

https://github.com/explosion/spaCy/blob/master/spacy/lang/en/tag_map.py

Import the tag map and add it to the HindiDefaults in spacy/lang/hi/__init__.py to load the tag map:

from .tag_map import TAG_MAP

class HindiDefaults(Language.Defaults):
    tag_map = TAG_MAP

I think you could also modify the tag map in nlp.vocab.morphology.tag_map on-the-fly after initializing the blank model before you starting training, but I don't think there's any easy way to do it with command-line options to spacy train, so that would require a custom training script.

You can use spacy debug-data hi train.json dev.json to make sure the settings worked, since it will show warnings for any tags in your training data that aren't in the tag map.

Hey, thanks for the help! I already figured out and it is as you suggested, I had to place it on the source code of Spacy. Now I have another problem, my custom NER is not detecting any entities. I checked that the NER pipeline is loaded and it contains the tags which I trained it with. Any idea what I'm missing? Thank you! — Adrián, Oct 31 '19 at 11:14
@Adrián Hi, I am trying to add POS of a specific domain to an existing model in Portuguese, from a * .conllu file (that is, it takes into account the context of the sentence), I am not finding information on the internet (I found for NER but not for POS), if your script is freely available could you please share it? Thanks. — user140259, Oct 04 '21 at 21:14

Spacy custom POS model for Hindi

1 Answers1