0

I'm trying to create my own corpus out of a set of text files. However, I want to do some preprocessing on the text files before they get corpus-ized and I can't figure out how to do that, short of creating a script to run through every single text file first, do the text preprocessing, save a new text file, and then make the corpus on the new, post-processed files. (This seems inefficient now, because I have ~200 mb of files that I would need to read through twice, and is not really scalable if I had a much larger corpus.)

The preprocessing that I want to do is very basic text manipulation:

  • Make every word as listed in the corpus lower case
  • Remove any items entirely enclosed in brackets, e.g., [coughing]
  • Remove digits at the start of each line (they're line numbers from the original transcriptions) which are the first four characters of each line

Critically, I want to do this preprocessing BEFORE the words enter the corpus - I don't want, e.g., "[coughing]" or "0001" as an entry in my corpus, and instead of "TREE" I want "tree."

I've got the basic corpus reader code, but the problem is that I can't figure out how to modify pattern matching as it reads in the files and builds the corpus. Is there a good way to do this?

corpusdir = "C:/corpus/"     
newcorpus = PlaintextCorpusReader(corpusdir, '.*') 
corpus_words = newcorpus.words()     # get words in the corpus 
fdist = nltk.FreqDist(corpus_words)  # make frequency distribution of the words in the corpus

This answer seems sort of on the right track, but the relevant words are already in the corpus and the poster wants to ignore/strip punctuation before tokenizing the corpus. I want to affect which types of words are even entered (i.e., counted) in the corpus at all.

Thanks in advance!

Jona
  • 169
  • 1
  • 2
  • 8

1 Answers1

1

I disagree with your inefficiency comment because once the corpus has been processed, you can analyze the processed corpus multiple times without having to run a cleaning function each time. That being said, if you are going to be running this multiple times, maybe you would want to find a quicker option.

As far as I can understand, PlaintextCorpusReader needs files as an input. I used code from Alvas' answer on another question to build this response. See Alvas' fantastic answer on using PlaintextCorpusReader here.

Here's my workflow:

from glob import glob
import re
import os
from nltk.corpus import PlaintextCorpusReader
from nltk.probability import FreqDist as FreqDist

mycorpusdir = glob('path/to/your/corpus/*')

# captures bracket-ed text 
re_brackets = r'(\[.*?\])'
# exactly 4 numbers
re_numbers = r'(\d{4})'

Lowercase everything, remove numbers:

corpus = []
for file in mycorpusdir:
    f = open(file).read()
    # lowercase everything
    all_lower = f.lower()
    # remove brackets
    no_brackets = re.sub(re_brackets, '', all_lower)
    # remove #### numbers
    just_words = re.sub(re_numbers, '', no_brackets)
    corpus.append(just_words)

Make new directory for the processed corpus:

corpusdir = 'newcorpus/'
if not os.path.isdir(corpusdir):
    os.mkdir(corpusdir)

# Output the files into the directory.
filename = 0
for text in corpus:
    with open(corpusdir + str(filename) + '.txt' , 'w+') as fout:
        print(text, file=fout)
    filename += 1

Call PlaintextCorpusReader:

newcorpus = PlaintextCorpusReader('newcorpus/', '.*')

corpus_words = newcorpus.words()
fdist = FreqDist(corpus_words)

print(fdist)
matt_07734
  • 347
  • 2
  • 13
  • Thanks - this is essentially what I ended up doing (preprocessing the files to do my text processing, then saving the preprocessed files as new files, and then running PlaintextCorpusReader on the new files). And I'd seen and upvoted alvas's fantastic answer already, but thanks for the pointer! I think I've come around and agree that running through all the files twice (first to preprocess, then to corpus-ize) isn't *so* inefficient. And your code is a good way to do that. I'm just still interested to see if there's a way to do this text processing on the way *into* the corpus. – Jona May 10 '18 at 21:50