Overriding of CorpusView.read_block() not taken into account

Question

I want to process a bunch of text files using NLTK, splitting them on a particular keyword. I am therefore trying to "subclass StreamBackedCorpusView, and override the read_block() method", as suggested by the documentation.

class CustomCorpusView(StreamBackedCorpusView):

    def read_block(self, stream):
        block = stream.readline().split()
        print("wtf")
        return [] # obviously this is only for debugging

class CustomCorpusReader(PlaintextCorpusReader):
    CorpusView = CustomCorpusViewer

However my knowledge of inheritance is rusty, and it seems my overriding is not taken into account. The output of

corpus = CustomCorpusReader("/path/to/files/", ".*")

print(corpus.words())

is identical to the output of

corpus = PlaintextCorpusReader("/path/to/files", ".*")

print(corpus.words())

I guess I'm missing something obvious, but what ?

Oh there's a way! Let me find some time to answer later on if no one answers =) — alvas, May 14 '19 at 02:52

score 0 · Accepted Answer · answered May 14 '19 at 07:09

The documentation actually suggests two ways of defining a custom corpus view :

Call the StreamBackedCorpusView constructor, and provide your block reader function via the block_reader argument.

Subclass StreamBackedCorpusView, and override the read_block() method.

It also suggests the first way is easier, and indeed I managed to get it working as the following :

from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.api import *

class CustomCorpusReader(PlaintextCorpusReader):

    def _custom_read_block(self, stream):
        block = stream.readline().split()
        print("wtf")
        return [] # obviously this is only for debugging

    def custom(self, fileids=None):
        return concat(
            [
                self.CorpusView(fileid, self._custom_read_block, encoding=enc)
                for (fileid, enc) in self.abspaths(fileids, True)
            ]
        )


corpus = CustomCorpusReader("/path/to/files/", ".*")

print(corpus.custom())

Overriding of CorpusView.read_block() not taken into account

1 Answers1