I'm trying to create my own corpus out of a set of text files. However, I want to do some preprocessing on the text files before they get corpus-ized and I can't figure out how to do that, short of creating a script to run through every single text file first, do the text preprocessing, save a new text file, and then make the corpus on the new, post-processed files. (This seems inefficient now, because I have ~200 mb of files that I would need to read through twice, and is not really scalable if I had a much larger corpus.)
The preprocessing that I want to do is very basic text manipulation:
- Make every word as listed in the corpus lower case
- Remove any items entirely enclosed in brackets, e.g., [coughing]
- Remove digits at the start of each line (they're line numbers from the original transcriptions) which are the first four characters of each line
Critically, I want to do this preprocessing BEFORE the words enter the corpus - I don't want, e.g., "[coughing]" or "0001" as an entry in my corpus, and instead of "TREE" I want "tree."
I've got the basic corpus reader code, but the problem is that I can't figure out how to modify pattern matching as it reads in the files and builds the corpus. Is there a good way to do this?
corpusdir = "C:/corpus/"
newcorpus = PlaintextCorpusReader(corpusdir, '.*')
corpus_words = newcorpus.words() # get words in the corpus
fdist = nltk.FreqDist(corpus_words) # make frequency distribution of the words in the corpus
This answer seems sort of on the right track, but the relevant words are already in the corpus and the poster wants to ignore/strip punctuation before tokenizing the corpus. I want to affect which types of words are even entered (i.e., counted) in the corpus at all.
Thanks in advance!