How to create a corpus for sentiment analysis in NLTK?

Question

I'm looking to use my own created corpus within Visual Studio Code for MacOSX; I have read probably a hundred forums and I can't wrap my head around what I'm doing wrong as I'm pretty new to programming.

This question seems to be the closes thing I can find to what I need to do; however, I am unaware of how to do the following:

"on a Mac it would be in ~/nltk_data/corpora, for instance. And it looks like you also have to append your new corpus to the __init__.py within .../site-packages/nltk/corpus/."

When answering, please be aware I am using Homebrew and don't want to permanently disable using another path if I need to use a stock NLTK corpora data set as well within the same coding.

If needed, I can post my attempt at coding using "PlaintextCorpusReader" along with the provided traceback below, although I would rather not have to use PlaintextCorpusReader at all for seamless use and would rather just use a simple copy+paste for .txt files into an appropriate location I wish to use in accordance with the append coding.

Thank you.

Traceback (most recent call last):
  File "/Users/jordanXXX/Documents/NLP/bettertrainingdata", line 42, in <module>
    short_pos = open("short_reviews/pos.txt", "r").read
IOError: [Errno 2] No such file or directory: 'short_reviews/pos.txt'

EDIT:

Thank you for your responses.

I have taken your advice and moved the folder out of NLTK's corpora.

I've been doing some experimenting with my folder location and I've gotten different tracebacks.

If you are saying the best way to do it is with PlaintextCorpusReader then so be it; however, maybe for my application I'd want to use CategorizedPlaintextCorpusReader?

sys.argv is definitely not what I meant, so I can read up on that later.

First, here is my code without my attempt to use PlaintextCorpusReader which results in the above traceback when the folder "short_reviews" containing the pos.txt and neg.txt files is outside of the NLP folder:

import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle

from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

from nltk.classify import ClassifierI
from statistics import mode

from nltk import word_tokenize

class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

# def main():
#     file = open("short_reviews/pos.txt", "r")
#     short_pos = file.readlines()
#     file.close

short_pos = open("short_reviews/pos.txt", "r").read
short_neg = open("short_reviews/neg.txt", "r").read

documents = []

for r in short_pos.split('\n'):
    documents.append( (r, "pos") )

for r in short_neg.split('\n'):
    documents.append((r, "neg"))

all_words = []

short_pos_words = word.tokenize(short_pos)
short_neg_words = word.tokenize(short_neg)

for w in short_pos_words:
    all_words.append(w. lower())

for w in short_neg_words:
    all_words.append(w. lower())

all_words = nltk.FreqDist(all_words)

However, when I move the folder "short_reviews" containing the text files into the NLP folder using the same code as above but without the use of PlaintextCorpusReader the following occurs:

Traceback (most recent call last):
  File "/Users/jordanXXX/Documents/NLP/bettertrainingdata", line 47, in <module>
    for r in short_pos.split('\n'):
AttributeError: 'builtin_function_or_method' object has no attribute 'split'

When I move the folder "short_reviews" containing the text files into the NLP folder using the code below with the use of PlaintextCorpusReader the following Traceback occurs:

import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle

from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

from nltk.classify import ClassifierI
from statistics import mode

from nltk import word_tokenize

from nltk.corpus import PlaintextCorpusReader
corpus_root = 'short_reviews'
word_lists = PlaintextCorpusReader(corpus_root, '*')
wordlists.fileids()


class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

# def main():
#     file = open("short_reviews/pos.txt", "r")
#     short_pos = file.readlines()
#     file.close

short_pos = open("short_reviews/pos.txt", "r").read
short_neg = open("short_reviews/neg.txt", "r").read

documents = []

for r in short_pos.split('\n'):
    documents.append((r, "pos"))

for r in short_neg.split('\n'):
    documents.append((r, "neg"))

all_words = []

short_pos_words = word.tokenize(short_pos)
short_neg_words = word.tokenize(short_neg)

for w in short_pos_words:
    all_words.append(w. lower())

for w in short_neg_words:
    all_words.append(w. lower())

all_words = nltk.FreqDist(all_words)


Traceback (most recent call last):
  File "/Users/jordanXXX/Documents/NLP/bettertrainingdata2", line 18, in <module>
    word_lists = PlaintextCorpusReader(corpus_root, '*')
  File "/Library/Python/2.7/site-packages/nltk/corpus/reader/plaintext.py", line 62, in __init__
    CorpusReader.__init__(self, root, fileids, encoding)
  File "/Library/Python/2.7/site-packages/nltk/corpus/reader/api.py", line 87, in __init__
    fileids = find_corpus_fileids(root, fileids)
  File "/Library/Python/2.7/site-packages/nltk/corpus/reader/util.py", line 763, in find_corpus_fileids
    if re.match(regexp, prefix+fileid)]
  File "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 141, in match
    return _compile(pattern, flags).match(string)
  File "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
error: nothing to repeat

Please post your code that achieves the error, so that it's helpful for posterity. — alvas, Oct 26 '16 at 22:56
Actually there's a neat trick to use `nltk.data.find` with `LazyCorpusReader` and that will make your corpus looks like it's native to NLTK. I'll upload an answer when i'm freer over the weekend. — alvas, Oct 27 '16 at 00:44
Please, do that. I think it'd still be conductive to learn how the PlaintextCorpusReader functions as well — pythlang, Oct 27 '16 at 00:58
As for `PlaintextCorpusReader`, there's an answer here: http://stackoverflow.com/a/20922201/610569 — alvas, Oct 27 '16 at 01:25
apologies: I'd like to use this custom corpora to yield more reliable results when using positive/negative sentiment analysis on short-texted platforms like Twitter. I think with your guys' help it is routed correctly but I am now getting a UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128) but I can't solve it using codes I've found, if my routing is even correct. I can post another EDIT with my new routing and traceback if you want or create a new thread. I'm still interested in LazyCorpusReader when you have time — pythlang, Oct 27 '16 at 16:42
Don't bother with `LazyCorpusReader`. If you're getting an encoding error, your code is already finding and opening your files. Clearly, your new problem is that some of the files are not ascii. Be sure to google and look around the site for the solution, because your new question will definitely be a duplicate. — alexis, Oct 27 '16 at 17:04
Yes, if your corpus has categories, use a `Categorized...` reader. That was really not clear from your original question. Check the nltk book or source for how to define your categories. — alexis, Oct 27 '16 at 17:07
thank you @alexis; i'd be interested in learning how lazycorpusreader works either way. I'll make another attempt to figure out the ascii/utf thing. thank you @alvas — pythlang, Oct 27 '16 at 17:51
Actually, now i don't get what you want to do... Are you trying to create a new corpus in `nltk` and make it looks like `nltk.corpus.brown` or `nltk.corpus.movie_review`? Or are you encountering unicode errors? If it's the latter, the easy answer is use `python3`, there'll still be encoding pains but lesser. — alvas, Oct 28 '16 at 09:22
well, if i have the option of making the corpus appear as a native corpus such as "nltk.corpus.movie_review" then yes, that'd be ideal. I am encountering unicode errors, however. I'm not sure how to run python3 in visual studio code but it is installed on my computer and am able to access it from the terminal by typing "python3"; I could save the file as file.py then load it into the terminal in python3. here is a [link](https://pythonprogramming.net/static/downloads/short_reviews/positive.txt) to my corpus. there is also a neg.txt — pythlang, Oct 28 '16 at 16:31

alexis · Accepted Answer · 2016-10-26T22:28:00.227

The answer you refer to contains some very poor (or rather, inapplicable) advice. There is no reason to place your own corpus in nltk_data, or to hack nltk.corpus.__init__.py to load it like a native corpus. In fact, do not do these things.

You should use PlaintextCorpusReader. I don't understand your reluctance to do so, but if your files are plain text, it's the right tool to use. Supposing you have a folder NLP/bettertrainingdata, you can build a reader that will load all .txt files in this folder like this:

myreader = nltk.corpus.reader.PlaintextCorpusReader(r"NLP/bettertrainingdata", r".*\.txt")

If you add new files to the folder, the reader will find and use them. If what you want is to be able to use your script with other folders, then just do so-- you don't need a different reader, you need to learn about sys.argv. If you are after a categorized corpus with pos.txt and neg.txt, then you need a CategorizedPlaintextCorpusReader (which see). If it's something else yet that you want, then please edit your question to explain what you are trying to do.

thank you, I am pretty sure that this worked along with alvas' link which I'd found but wasn't sure how to apply without a simple explanation like yours on where to direct the path. — pythlang, Oct 27 '16 at 16:44

How to create a corpus for sentiment analysis in NLTK?

EDIT:

1 Answers1