Python and NLTK: Baseline tagger

Question

I am writing a code for a baseline tagger. Based on the Brown corpus it assigns the most common tag to the word. So if the word "works" is tagged as verb 23 times and as a plural noun 30 times then based on that in the user input sentence it would tagged as plural noun. If the word was not found in the corpus, then it is tagged as a noun by default. The code I have so far returns every tag for the word not just the most frequent one. How can I achieve it only returning the frequent tag per word?

import nltk 
from nltk.corpus import brown

def findtags(userinput, tagged_text):
    uinput = userinput.split()
    fdist = nltk.FreqDist(tagged_text)
    result = []
    for item in fdist.items():
        for u in uinput:
            if u==item[0][0]:
                t = (u,item[0][1])
                result.append(t)
        continue
        t = (u, "NN")
        result.append(t)
    return result

def main():
    tags = findtags("the quick brown fox", brown.tagged_words())
    print tags

if __name__ == '__main__':
    main()

wahaha, i'm going to start asking for payment soon, if i answer all your nltk questions. lolz, just joking, give me a min to type. — alvas, Jan 08 '14 at 10:41
sorry went for lunch, below's the `most_frequent_pos_tagger()` you'll need. — alvas, Jan 08 '14 at 12:36

score 3 · Answer 1 · edited May 23 '17 at 11:52

If it's English, there is a default POS tagger in NLTK which a lot of people have been complaining about but it's a nice quick-fix (more like a band-aid than paracetamol), see POS tagging - NLTK thinks noun is adjective:

>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
>>> sent = "the quick brown fox"
>>> pos_tag(word_tokenize(sent))
[('the', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN')]

If you want to train a baseline tagger from scratch, I recommend you follow an example like this but change the corpus to English one: https://github.com/alvations/spaghetti-tagger

By building a UnigramTagger like in spaghetti-tagger, you should automatically achieve the most common tag for every word.

However, if you want to do it the non machine-learning way, first to count word:POS, What you'll need is some sort of type token ratio. also see Part-of-speech tag without context using nltk:

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from collections import Counter, defaultdict
from itertools import chain

def type_token_ratio(documentstream):
    ttr = defaultdict(list)
    for token, pos in list(chain(*documentstream)):
        ttr[token].append(pos)  
    return ttr

def most_freq_tag(ttr, word):
    return Counter(ttr[word]).most_common()[0][0]

sent1 = "the quick brown fox quick me with a quick ."
sent2 = "the brown quick fox fox me with a brown ." 
documents = [sent1, sent2]

# Calculates the TTR.
documents_ttr = type_token_ratio([pos_tag(word_tokenize(i)) for i in documents])

# Best tag for the word.
print Counter(documents_ttr['quick']).most_common()[0]

# Best tags for a sentence
print [most_freq_tag(documents_ttr, i) for i in sent1.split()]

NOTE: A document stream can be defined as a list of sentences where each sentence contains a list of tokens with/out tags.

cyborg · Answer 2 · 2014-01-08T10:55:01.357

0

Create a dictionary called word_tags whose key is a word (unannotated) and value is a list of tags in descending frequency (based on your fdist.)

Then:

for u in uinput:
    result.append(word_tags[u][0])

edited Jan 08 '14 at 10:55

answered Jan 08 '14 at 10:45

cyborg

9,989
4
38
56

score 0 · Answer 3 · answered Jan 12 '14 at 23:55

You can simply use Counter to find most repeated item in a list:

Python

from collections import Counter
default_tag = Counter(tags).most_common(1)[0][0]

If your question is "how does a unigram-tagger work?" you might be interested to read more NLTK source codes: http://nltk.org/_modules/nltk/tag/sequential.html#UnigramTagger

Anyways, I suggest you to read NLTK book chapter 5 specially: http://nltk.org/book/ch05.html#the-lookup-tagger

Just like the sample in the book you can have a conditional frequency distribution, which returns the best tag for each given word.

cfd = nltk.ConditionalFreqDist(nltk.corpus.brown.tagged_words())

In this case cfd["fox"].max() will return the most likely tag for "fox" according to brown corpus. Then you can make a dictionary of most likely tags for each word of your sentence:

likely_tags = dict((word, cfd[word].max()) for word in "the quick brown fox".split())

Notice that, for new words in your sentence this will return errors. But if you understand the idea you can make your own tagger.

Python and NLTK: Baseline tagger

3 Answers3