1

I'm trying to build a POS Tagger for a Voise Assistant. However, the nltk's pos tagger nltk.pos_tag doesn't work well for me. For example:

sent = 'open Youtube'
tokens = nltk.word_tokenize(sent)
nltk.pos_tag(tokens, tagset='universal')
>>[('open', 'ADJ'), ('Youtube', 'NOUN')]

In the above case I'd want the word open to be a verb and not an adjective. Similarly, it tags the word 'close' as an adverb and not a verb.

I have also tried using an n-gram tagger

train_sents = brown_tagged_sents[:size] 
test_sents = brown_tagged_sents[size:]
default_tagger = nltk.DefaultTagger('NOUN')
unigram_tagger = nltk.UnigramTagger(train_sents, backoff = default_tagger)
bigram_tagger = nltk.BigramTagger(train_sents, backoff = unigram_tagger)
trigram_tagger = nltk.TrigramTagger(train_sents, backoff = bigram_tagger)

I have used the brown corpus from nltk. But it still gives the same result.

So I'd like to know:

  1. Is there a better tagged corpus to train a tagger for making a voice/virtual assistant?
  2. Is there a higher n-gram than trigram i.e. that looks at 4 words or more together like trigram and bigram look at 3 and 2 words together respectively. Will it improve the performance?
  3. How can I fix this?
Nabih Bawazir
  • 6,381
  • 7
  • 37
  • 70
Mohit Motwani
  • 4,662
  • 3
  • 17
  • 45
  • 2
    `pos_tag` uses sentence structure to deduce what the word class is. Given `'open YouTube'` or `'close YouTube'` it doesn't have enough to go on. The Brown corpus was *manually* tagged and it reports the word *open* as falling into 4 lexical categories: `ADJ`, `V`, `N`, `ADV`. It seems like you're expecting the tagger to understand the virtual assistant context without any input from you. It won't. You could start by hypothesizing that all commands issued to the virtual assistant will begin with a verb, and doing your own tagging on that basis: `open/VERB YouTube/NOUN`. – BoarGules Apr 10 '18 at 14:51
  • That was my initial though too about hypothesizing the first word as a verb but then it won't be generic. Is there a tagged corpus of voice commands? Are there better ways to train a tagger? @BoarGules – Mohit Motwani Apr 11 '18 at 08:56

1 Answers1

0

Concerning question #3

I think this is not a general solution, but it works at least for the context you mention "Do this/that". So, if you put a "to" at the beginning the tagger will tend to "understand" a verb instead an adjetive, noun or adverb!

I took this screenshot using Freeling_demo just to compare interpretations

enter image description here

Specifically, if you want to use Freeling there are java/python API avaliable or you can call it just using command line.

Respect question #2 I think include context work better for complete sentences or large texts, maybe is not the case for command a basic virtual assistant.

Good luck!

Jason Angel
  • 2,233
  • 1
  • 14
  • 14