I'm trying to build a POS Tagger for a Voise Assistant. However, the nltk's pos tagger nltk.pos_tag doesn't work well for me. For example:
sent = 'open Youtube'
tokens = nltk.word_tokenize(sent)
nltk.pos_tag(tokens, tagset='universal')
>>[('open', 'ADJ'), ('Youtube', 'NOUN')]
In the above case I'd want the word open to be a verb and not an adjective. Similarly, it tags the word 'close' as an adverb and not a verb.
I have also tried using an n-gram tagger
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
default_tagger = nltk.DefaultTagger('NOUN')
unigram_tagger = nltk.UnigramTagger(train_sents, backoff = default_tagger)
bigram_tagger = nltk.BigramTagger(train_sents, backoff = unigram_tagger)
trigram_tagger = nltk.TrigramTagger(train_sents, backoff = bigram_tagger)
I have used the brown corpus from nltk
. But it still gives the same result.
So I'd like to know:
- Is there a better tagged corpus to train a tagger for making a voice/virtual assistant?
- Is there a higher n-gram than trigram i.e. that looks at 4 words or more together like trigram and bigram look at 3 and 2 words together respectively. Will it improve the performance?
- How can I fix this?