How can I add frequency in nltk naivebayes classifier?

Question

I'm now learning naivebayes classifier by using nltk.

In the document(http://www.nltk.org/book/ch06.html) 1.3 document classification, There is an featureset example.

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] [1]

def document_features(document): [2]
    document_words = set(document) [3]
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

So the example of featuresets's form is {('contains(waste)': False, 'contains(lot)': False, ...},'neg')...}

But I want to change word dictionary form from 'contains(waste)': False to 'contains(waste)': 2. I think that that form('contains(waste)': 2) well explain document because it can calculate frequency of world. So the featureset would be {('contains(waste)': 2, 'contains(lot)': 5, ...},'neg')...}

But I'm worried about whether 'contains(waste)': 2 and 'contains(waste)': 1 are totally different word to naivebayesclassifier. Then it can't explain the similarity of 'contains(waste)': 2 and 'contains(waste)': 1.

{'contains(lot)': 1 and 'contains(waste)': 1} and {'contains(waste)': 2 and 'contains(waste)': 1} can be same to program.

Does nltk.naivebayesclassifier can understand the frequency of word?

This is the code I used

def split_and_count_word(data):
    #belongs_to : Main
    #Role : make featuresets from korean words using konlpy.
    #Parameter : dictionary data(dict of contents ex.{'politic':{'parliament': [content,content]}..})
    #Return : list featuresets([{'word':True',...},'politic'] == featureset + category)

    featuresets = []
    twitter = konlpy.tag.Twitter()#Korean word splitter

    for big_cat in data:

        for small_cat in data[big_cat]:
            #save category name needed in featuresets 
            category = str(big_cat[0:3])+'/'+str(small_cat)
            count = 0; print(small_cat)

            for one_news in data[big_cat][small_cat]:
                count+=1; if count%100==0: print(count,end=' ')                
                #one_news is list in list so open it!
                doc = one_news
                #split word as using konlpy
                list_of_splited_word = twitter.morphs(doc[:-63])#delete useless sentences. 
                #get word length is higher than two and get list of splited words
                list_of_up_two_word = [word for word in list_of_splited_word if len(word)>1]
                dict_of_featuresets = make_featuresets(list_of_up_two_word)
                #save 
                featuresets.append((dict_of_featuresets,category))

    return featuresets


def make_featuresets(data):
    #belongs_to : split_and_count_word
    #Role : make featuresets
    #Parameter : list list_of_up_two_word(ex.['비누','떨어','지다']
    #Return : dictionary {word : True for word in data}

    #PROBLEM :(
    #cannot consider the freqency of word
    return {word : True for word in data}

def naive_train(featuresets):
    #belongs_to : Main
    #Role : Learning by naive bayes rule
    #Parameter : list featuresets([{'word':True',...},'pol/pal'])
    #Return : object classifier(nltk naivebayesclassifier object),
    #         list test_set(the featuresets that are randomly selected)

    random.shuffle(featuresets)
    train_set, test_set = featuresets[1000:], featuresets[:1000]
    classifier = naivebayes.NaiveBayesClassifier.train(train_set)

    return classifier,test_set

featuresets = split_and_count_word(data)
classifier,test_set = naive_train(featuresets)

score 1 · Answer 1 · answered Nov 13 '16 at 20:21

1

The nltk's Naive Bayes classifier treats feature values as logically distinct. Values are not limited to True and False, but they are never treated as quantities. If you have feature f=2 and f=3, they count as distinct values. The only way to add quantity to such a model is to sort them into "buckets" like f=1, f="few" (2-5), f="several" (6-10), f="many" (11 or more), for example. (Note: If you go this route, there are algorithms for choosing good value ranges for the buckets.) And even then the model does not "know" that "few" is between "one" and "several". You'll need a different machine learning tool to handle quantity directly.

answered Nov 13 '16 at 20:21

alexis

48,685
16
101
161

Thanks for giving me the idea. Then you mean that I can't add word that already is contained in feature dictionary? For example, the dictionary would be {**"hello":True,"hello":True**,"my":True...}. Then, can you recommend other useful machine learning module? – dizwe Nov 14 '16 at 06:20
As you already pointed out in your comment to @ aberger, no you can't have the same key twice in a dict. Can't directly point you to a quantified solution, sorry. The nltk's [`MaxentClassifier`](http://www.nltk.org/api/nltk.classify.html#nltk.classify.maxent.MaxentClassifier) uses numeric weights, but they are normally created by the API from the "nominal" features you supply; so you'll have to poke around for the right way to use it. Look also at scikit-learn. The best classifier depends on your task, so experiment with a few! – alexis Nov 14 '16 at 08:58
Thanks I'll try it! – dizwe Nov 18 '16 at 01:59

How can I add frequency in nltk naivebayes classifier?

1 Answers1