Train customized corpus, with NLTK for Python

Question

I try to train a corpus with my own documents. My documents are structured in the same way as the original movie_reviews corpus data, so 1K positive text files in folder 'pos' and 1K negative text files in folder 'neg'. Each textfile contains 25 lines of tweets, which are cleaned, as in: urls, usernames, capital letters, punctuation removed.

How can I adjust this code to use my own text data instead of the movie_reviews?

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from collections import defaultdict
import numpy as np

# define the split of % training / % test
SPLIT = 0.8

def word_feats(words):
    return dict([(word, True) for word in words])


posids = movie_reviews.fileids('pos')
negids = movie_reviews.fileids('neg')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

cutoff = int(len(posfeats) * SPLIT)

trainfeats = negfeats[:cutoff] + posfeats[:cutoff]
testfeats = negfeats[cutoff:] + posfeats[cutoff:]

print 'Train on %d instances\nTest on %d instances' % (len(trainfeats),len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)
print 'Accuracy:', nltk.classify.util.accuracy(classifier, testfeats)

classifier.show_most_informative_features()

does this help? http://stackoverflow.com/a/5113509/1215687 – Walrus the Cat Apr 24 '15 at 17:07 — Walrus the Cat, Apr 24 '15 at 17:07

score 0 · Answer 1 · answered Apr 26 '15 at 18:22

You can login as a root user and change you directory path to this:

/usr/local/lib/python2.7/dist-packages/nltk/corpus/__init__.py

In this document you can find already existing movie_reviews corpora loaded using LazyCorpusLoader:

movie_reviews = LazyCorpusLoader(
    'movie_reviews', CategorizedPlaintextCorpusReader,
    r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*')

Then try adding some thing similar to this:

My_Movie = LazyCorpusLoader(
    'My_Movie', CategorizedPlaintextCorpusReader,
    r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*')

Where My_Movie is the name which you have created for your movie reviews. Once Everything is done save and exit.

Finally place you corpus in nltk directory where you can find the movie_review corpus.

Try performing this:

from nltk.corpus import My_Movie  # Newly created you own corpus

Hope this will work.

Train customized corpus, with NLTK for Python

1 Answers1