0

I am working on a simple Naive Bayes Text Classifier which uses the Brown Corpus for test and training data. So far, I have gotten an accuracy of 53% when using the simple approach without any preprocessing. In order to improve my classifier, I've added some preprocessing (stopwords, lemmatizing, stemming, pos-tagging) but my performance seems to get worse (11%). What am I doing wrong? I've only started with Python so I am thankful for any help I can get.

import nltk, random

from nltk.corpus import brown, stopwords
from nltk.stem.porter import PorterStemmer

documents = [(list(brown.words(fileid)), category)
        for category in brown.categories()
        for fileid in brown.fileids(category)]

random.shuffle(documents)

stop = set(stopwords.words('english'))


without_stop = [w for w in brown.words() if w not in stop] 

lowercase = [w.lower() for w in without_stop] # lowercase


porter = PorterStemmer()
stemmed = [porter.stem(w) for w in lowercase] 

wnl = nltk.WordNetLemmatizer()
lemmatized = [wnl.lemmatize(w) for w in stemmed] 
tagged = nltk.pos_tag(lemmatized) 


all_words = nltk.FreqDist(tagged) 

word_features = list(all_words.keys())[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
       features['contains({})'.format(word)] = (word in document_words)
    return features


featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]

classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))

1 Answers1

0

Maybe I am missing something but I can't the classification problem that you are trying to solve.

You are randomly arranging documents which you then split into test and train set after enriching each document with a plethora of additional data from stemming, pos-tagging, etc.

How does the split follow the division between classes? The results you were getting on pure text were better because the problem space was of much smaller rank (no additional features exploding the size of the problem space). Hence with the relatively small Brown corpus the classifier could split the problem.

State your classification problem and align features to it. Then code it.

sophros
  • 14,672
  • 11
  • 46
  • 75