How can we make this python code more efficient to run huge text files?

Question

I have created a python file with the following code. I want the code do the following:

Extract the content from a text file, clean it for punctuation, remove non-alphabetic, shift to lower case
Create Unigrams and Bigrams and combine them
Remove stop words (only after creating Bigrams and not before) and then duplicate words
Show list of words before. and after execution and save the output as text file.

I want to run this code for huge text files.

Can someone help me in making this code more efficient? I'm a newbie and wrote this code with the help of the internet.

Code:

#<<<---------- INPUT TEXT FILE ------------>>>
# load data
filename = 'input.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
#<<<---------- CLEAN TEXT ------------>>>
# split into words
import nltk
from nltk.tokenize import word_tokenize
tokens = nltk.word_tokenize(text)
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
#join words as a sentence
cleantext = " ".join(words)
#<<<---------- CREATE UNIGRAMS ------------>>>
unigrm1 = nltk.word_tokenize(cleantext)
#<<<---------- CREATE BIGRAMS ------------>>>
tokens1 = nltk.word_tokenize(cleantext)
bigrm = nltk.bigrams(tokens1)
bigrm = list(nltk.bigrams(cleantext.split()))
bigrm1 = [' '.join(t) for t in bigrm]
#<<<---------- COMBINE UNIGRAMS & BIGRAMS ------------>>>
ngram1 = unigrm1 + bigrm1
ngram2 = ", ".join(ngram1)
#<<<---------- REMVOE DUPLCIATES IN BIGRAMS ------------>>>
# stop words removal
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text_tokens = word_tokenize(ngram2)
tokens_without_sw = [word for word in text_tokens if not word in stopwords.words()]
words = (" ").join(tokens_without_sw)
words = words.replace(" ,", ",")
words = words.replace(",,,,,", ",")
words = words.replace(",,,", ",")
words = words.replace(",,", ",")
words = words.split(", ")
words.sort()
# remove duplicates
k = [] 
for i in words:   
    # If condition is used to store unique string  
    # in another list 'k'  
    if (words.count(i)>1 and (i not in k)or words.count(i)==1): 
        k.append(i) 
#<<<---------- SHOW NUMBER OF WORDS ------------>>>
countwords = text.split()
print('Number of words in raw file :', len(countwords))
file.close()
print('Number of words in extracted file :', len(k))
file.close()
#<<<---------- SAVE AS OUTPUT TEXT FILE ------------>>>
# save as text output
import sys
file = open('output.txt', 'w+')
sys.stdout = file
print(*map(''.join, k), sep=', ')
file.close()
#<<<---------- END OF CODES ------------>>>

futureExpert · Answer 1 · 2020-04-25T18:29:01.283

This row can be removed as bigrm is reset on the next row.

bigrm = nltk.bigrams(tokens1)

In this section file.close() is called two times, but the file is not open, so file.close() can be discarded in both cases.

#<<<---------- SHOW NUMBER OF WORDS ------------>>>
countwords = text.split()
print('Number of words in raw file :', len(countwords))
print('Number of words in extracted file :', len(k))

Also, sys.stdout should be reset after being used.

orig_stdout = sys.stdout
sys.stdout = file
print(*map(''.join, k), sep=', ')
file.close()
sys.stdout = orig_stdout

At least then you can continue to interact with the terminal after running the code, should be a slight plus :)

How can we make this python code more efficient to run huge text files?

1 Answers1