In Chapter 6 of the NLTK book, section 2.1 the code calls the movie reviews corpus for document classification. The code in the book is as follows:
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
I have my own dataset comma separated (text, category) between texts of emails and either positive or negative for the category. Can I call .words() on my own file? Also what does the code mean when it calls movie_reviews.categories(). I am having trouble understanding how to structure the data to get it into the form needed by the code. I have look at the individual corpus files but I can't figure out what to do from here. Any help would be appreciated. Thanks!