Bi-grams in python with lots of txt files

Question

I have a corpus which includes 70,429 files(296.5 mb). I try to find bi-grams by using whole corpus. I have written the following code;

allFiles = ""
for dirName in os.listdir(rootDirectory):
     for subDir in os.listdir(dirName):
         for fileN in os.listdir(subDir):
             FText = codecs.open(fileN, encoding="'iso8859-9'")
             PText = FText.read()
             allFiles += PText
tokens = allFiles.split()
finder = BigramCollocationFinder.from_words(tokens, window_size = 3)
finder.apply_freq_filter(2)
bigram_measures = nltk.collocations.BigramAssocMeasures()
for k,v in finder.ngram_fd.most_common(100):
    print(k,v)

There is a root directory and the root directory includes subdirectories and each subdirectory includes numerous files. What I have done is;

I read all of the files one by and add the context into to the string called allFiles. Eventually, I split the string into tokens and call the relevant bi-gram functions. The problem is;

I ran the program for a day and couldn't get any results. Is there a more efficient way to find bigrams within a corpus which includes lots of files?

Any advice and suggestions will be greatly appreciated. Thanks in advance.

One thing to try is to process each file during your directory walk in the loop and storing the output of `BigramCollocationFinder`. Might be very memory intensive but possibly faster? — avip, Mar 13 '16 at 20:03

score 1 · Accepted Answer · answered Mar 13 '16 at 22:30

By trying to read a huge corpus into memory at once, you're blowing out your memory, forcing a lot of swap use, and slowing everything down.

The NLTK provides various "corpus readers" that can return your words one by one, so that the complete corpus is never stored in memory at the same time. This might work if I understand your corpus layout right:

from nltk.corpus.reader import PlaintextCorpusReader
reader = PlaintextCorpusReader(rootDirectory, "*/*/*", encoding="iso8859-9")
finder = BigramCollocationFinder.from_words(reader.words(), window_size = 3)
finder.apply_freq_filter(2) # Continue processing as before
...

Addendum: Your approach has a bug: You're taking trigrams that span from the end of one document to the beginning of the next... that's nonsense you want to get rid of. I recommend the following variant, which collects trigrams from each document separately.

document_streams = (reader.words(fname) for fname in reader.fileids())
BigramCollocationFinder.default_ws = 3
finder = BigramCollocationFinder.from_documents(document_streams)

score 0 · Answer 2 · answered Mar 13 '16 at 20:12

Consider parallelizing your process with Python's "Multiprocessing" thread pool (https://docs.python.org/2/library/multiprocessing.html), emitting a dictionary with {word : count} for each file in the corpus into some shared list. After the worker pool completes, merge the dictionaries before filtering by the number of word appearances.

Bi-grams in python with lots of txt files

2 Answers2