NLTK document classification

Question

In Chapter 6 of the NLTK book, section 2.1 the code calls the movie reviews corpus for document classification. The code in the book is as follows:

from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
         for category in movie_reviews.categories()
         for fileid in movie_reviews.fileids(category)]
 random.shuffle(documents)

I have my own dataset comma separated (text, category) between texts of emails and either positive or negative for the category. Can I call .words() on my own file? Also what does the code mean when it calls movie_reviews.categories(). I am having trouble understanding how to structure the data to get it into the form needed by the code. I have look at the individual corpus files but I can't figure out what to do from here. Any help would be appreciated. Thanks!

hi! have you tried calling `words()` on your file? if so, what happens or what kind of error do you get? what about `categories()`? — arturomp, Dec 18 '13 at 07:40
also, have you looked at this question? http://stackoverflow.com/q/4951751/583834 — arturomp, Dec 20 '13 at 21:35

score 1 · Accepted Answer · answered Dec 18 '13 at 07:42

words() just returns "the given file(s) as a list of words and punctuation symbols" according to the documentation. In that respect, you can definitely call nltk.corpus.words() on any text file you have.

As for categories(), further down in the documentation, it says that it "Return[s] a list of the categories that are defined for this corpus, or for the file(s) if it is given." However, the source for it is a bit more obscure. Notice that different corpora have different ways of indicating their categories. movie_reviews does it through directory names, but abc and reuters have explicit categories in a file. qc has the categories in the same file as with the text.

It might take a bit of experimenting with your own data to see if you can replicate this behaviour, but a reasonable first step would be to add a directory containing a subset of your data to nltk_data/corpora and to play around with the formats you see in other corpora.

NLTK document classification

1 Answers1