NLTK custom categorized corpus not reading files

Question

I have created my own corpus, similar to the movie_reviews corpus in nltk (categorized by neg|pos.)

Within the neg and pos folders are txt files.

Code:

from nltk.corpus import CategorizedPlaintextCorpusReader

    mr = CategorizedPlaintextCorpusReader('C:\mycorpus', r'(?!\.).*\.txt',
            cat_pattern=r'(neg|pos)/.*')

When I try to read or interact with one of these files, I am unable to.

e.g. len(mr.categories()) runs, but does not return anything:

>>>

I have read multiple documents and questions on here regarding custom categorized corpus', but I am still unable to use them.

Full code:

import nltk
from nltk.corpus import CategorizedPlaintextCorpusReader

mr = CategorizedPlaintextCorpusReader('C:\mycorpus', r'(?!\.).*\.txt',
        cat_pattern=r'(neg|pos)/.*')

len(mr.categories())

I eventually want to be able to preform a naive bayes algorithm against my data but I am unable to read the content.

Paths: C:\mycorpus\pos

C:\mycorpus\neg

Within the pos file is a 'cv.txt' and the neg contains a 'example.txt'

score 3 · Accepted Answer · answered Feb 15 '18 at 15:44

3

I am using Linux, and the following modification to your code (with toy corpus files) works correctly for me:

import nltk
from nltk.corpus import CategorizedPlaintextCorpusReader

import os


mr = CategorizedPlaintextCorpusReader(
    '/home/ely/programming/nltk-test/mycorpus',
    r'(?!\.).*\.txt',
    cat_pattern=os.path.join(r'(neg|pos)', '.*')
)

print(len(mr.categories()))

This suggests it is a problem with the cat_pattern string using / as a file system delimiter when you're on a Windows system.

Using os.path.join as in my example, or pathlib if using Python 3, would be a good way to solve it so it is OS-agnostic and you don't trip up with the regular expression escape slashes mixed with file system delimiters.

In fact you may way to use this approach for all of the cases of file system delimiters in your argument strings, and it's generally a good habit to get in for making code portable and avoiding strange string munging tech debt.

answered Feb 15 '18 at 15:44

ely

74,674
34
147
228

I will certainly use this in the future to avoid this. I am using Python 3, as such can you provide the syntax for pathlib rather than os.path.join? Thank you for your answer – Yunter Feb 15 '18 at 16:14
@Yunter That syntax is best described [in the docs](https://docs.python.org/3/library/pathlib.html). Essentially, you'll create a `pathlib.Path` object, and then the binary operator `/` will have semantics for file system path resolution, so you could do `Path(r'(neg|pos)') / '.*'`, and even though it uses the `/` operator, it resolves it to the appropriate file system delimiter automatically for you. Really it's not better than `os.path`, just a different syntax for similar operations. Note that you can use `os.path` in Python 2 or Python 3. – ely Feb 15 '18 at 16:31
Apologies - I misinterpreted "or pathlib if using Python 3". Using the 'os.path.join' provided in the code leaves me with the same issue of not returning anything. I will post the contents of the directory above. – Yunter Feb 15 '18 at 16:55

score 1 · Answer 2 · answered Feb 15 '18 at 15:34

1

It seems to me that there is something weird with your

cat_pattern=r'(neg|pos)/.*'

for you are on an MsDOS based system (Windows, I guess) and folder inclusions are indecated with \ , not / (or I don't get it)

answered Feb 15 '18 at 15:34

zar3bski

2,773
7
25
58

Yes I'm on a Windows OS. So I should change the '/.*' to '\.*' ? – Yunter Feb 15 '18 at 15:41
or just cat_pattern=os.path.join(r'(neg|pos)', '.*') as Ely suggests above. This way, you have a non OS-dependent solution – zar3bski Feb 15 '18 at 15:53
Thank you for the answer, I am using Python 3 so will have to use pathlib according to Ely. – Yunter Feb 15 '18 at 16:15

NLTK custom categorized corpus not reading files

2 Answers2