I have a corpus which includes 70,429 files(296.5 mb). I try to find bi-grams by using whole corpus. I have written the following code;
allFiles = ""
for dirName in os.listdir(rootDirectory):
for subDir in os.listdir(dirName):
for fileN in os.listdir(subDir):
FText = codecs.open(fileN, encoding="'iso8859-9'")
PText = FText.read()
allFiles += PText
tokens = allFiles.split()
finder = BigramCollocationFinder.from_words(tokens, window_size = 3)
finder.apply_freq_filter(2)
bigram_measures = nltk.collocations.BigramAssocMeasures()
for k,v in finder.ngram_fd.most_common(100):
print(k,v)
There is a root directory and the root directory includes subdirectories and each subdirectory includes numerous files. What I have done is;
I read all of the files one by and add the context into to the string called allFiles
. Eventually, I split the string into tokens and call the relevant bi-gram functions. The problem is;
I ran the program for a day and couldn't get any results. Is there a more efficient way to find bigrams within a corpus which includes lots of files?
Any advice and suggestions will be greatly appreciated. Thanks in advance.