1

I want to process a bunch of text files using NLTK, splitting them on a particular keyword. I am therefore trying to "subclass StreamBackedCorpusView, and override the read_block() method", as suggested by the documentation.

class CustomCorpusView(StreamBackedCorpusView):

    def read_block(self, stream):
        block = stream.readline().split()
        print("wtf")
        return [] # obviously this is only for debugging

class CustomCorpusReader(PlaintextCorpusReader):
    CorpusView = CustomCorpusViewer

However my knowledge of inheritance is rusty, and it seems my overriding is not taken into account. The output of

corpus = CustomCorpusReader("/path/to/files/", ".*")

print(corpus.words())

is identical to the output of

corpus = PlaintextCorpusReader("/path/to/files", ".*")

print(corpus.words())

I guess I'm missing something obvious, but what ?

Skippy le Grand Gourou
  • 6,976
  • 4
  • 60
  • 76

1 Answers1

0

The documentation actually suggests two ways of defining a custom corpus view :

  1. Call the StreamBackedCorpusView constructor, and provide your block reader function via the block_reader argument.
  2. Subclass StreamBackedCorpusView, and override the read_block() method.

It also suggests the first way is easier, and indeed I managed to get it working as the following :

from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.api import *

class CustomCorpusReader(PlaintextCorpusReader):

    def _custom_read_block(self, stream):
        block = stream.readline().split()
        print("wtf")
        return [] # obviously this is only for debugging

    def custom(self, fileids=None):
        return concat(
            [
                self.CorpusView(fileid, self._custom_read_block, encoding=enc)
                for (fileid, enc) in self.abspaths(fileids, True)
            ]
        )


corpus = CustomCorpusReader("/path/to/files/", ".*")

print(corpus.custom())
Skippy le Grand Gourou
  • 6,976
  • 4
  • 60
  • 76