Finding total number of "stopwords" in a file

Question

I am trying to create a Python program that reads two text files, one containing an article and the other containing a list of "stop words" (one word on each line). I would like to determine how many of these "stop words" are in the specific text file I'm using (the cumulative total of the frequency of each "stop word") containing the article.

I tried creating nested for loops in order to do this where I'm looping through each line of the file containing the article (outer for loop), and within each line, having a for loop (inner for loop) that loops through the list of "stop words", and sees if a "stop word" is in the current line, and if so, how often. At the end, I add how often the word is in the current line to an accumulator that will keep track of the total cumulative amount of stop words found in the file containing the article.

Currently, when I run it, it says there are 0 stop words in the file, which is incorrect.

import string

def main():

    analyzed_file  = open('LearnToCode_LearnToThink.txt', 'r')
    stop_word_file = open('stopwords.txt', 'r')

    stop_word_accumulator = 0

    for analyzed_line in analyzed_file.readlines():

        formatted_line = remove_punctuation(analyzed_line)

        for stop_word_line in stop_word_file.readlines():
            stop_formatted_line = create_stopword_list(stop_word_line)
            if stop_formatted_line in formatted_line:
                stop_word_frequency = formatted_line.count(stop_formatted_line)
                stop_word_accumulator += stop_word_frequency

        print("there are ",stop_word_accumulator, " words")


        stop_word_file.close()
        analyzed_file.close()


def create_stopword_list(stop_word_text):

 clean_words = [] # create an empty list
 stop_word_text = stop_word_text.rstrip() # remove trailing whitespace characters
 new_words = stop_word_text.split() # create a list of words from the text
 for word in new_words: # normalize and add to list
        clean_words.append(word.strip(string.punctuation).lower())
 return clean_words



def remove_punctuation(text):
    clean_words = [] # create an empty list
    text = text.rstrip() # remove trailing whitespace characters
    words = text.split() # create a list of words from the text
    for word in words: # normalize and add to list
        clean_words.append(word.strip(string.punctuation).lower())
    return clean_words


main()

score 1 · Answer 1 · answered Oct 04 '18 at 09:35

You can use NLTK to check stopwords and count it:

from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize 
nltk.download('punkt')

x = r"['Nel mezzo del cammin di nostra vita mi ritrovai per una selva oscura, ché la 
diritta via era smarrita.Ahi quanto a dir qual era è cosa dura esta selva selvaggia 
e aspra e forte che nel pensier rinova la paura! Tant' è amara che poco è più morte; 
ma per trattar del ben ch'i' vi trovai, dirò de l altre cose chi v ho scorte.']"

word_tokens = word_tokenize(x) #splitta i pezzi

stopwords_x = [w for w in word_tokens if w in stopWords]
len(stopwords_x) / len(word_tokens) * 100

'''stopwords_x = [w for w in word_tokens if w in stopwords.words('english)]''' worked for me — parvaneh shayegh, May 20 '21 at 11:52

jonrsharpe · Answer 2 · 2015-10-09T15:12:45.200

0

You have numerous problems:

readlines will only work once - after that, you're at the end of the file and it will return an empty string.
It's absurdly inefficient to recreate the list of stop words for every line in the other file anyway.
one_list in another_list and one_list.count(another_list) don't do what you seem to think they do.

Instead, try something like:

stop_words = get_stop_word_list(stop_words_file_name)

stop_word_count = 0

with open(other_file_name) as other_file:  # note 'context manager' file handling
    for line in other_file:
        cleaned_line = clean(line)
        for stop_word in stop_words:
            if stop_word in cleaned_line:
                stop_word_count += cleaned_line.count(stop_word)

There are more efficient approaches (using e.g. sets and collections.Counters), but that should get you started.

edited Oct 09 '15 at 15:12

answered Oct 09 '15 at 15:06

jonrsharpe

115,751
26
228
437

I suggest replacing the inner for loop with `stop_word_count += sum(map(cleaned_line.count, stop_words))` (maybe replace `map` with `imap`). Is there a reason why you checked if the word was present before calling `count`? – Alex Hall Oct 09 '15 at 15:09
Ok, I'll try that out @jonrsharpe, and let you know if that works and if it doesn't ill post the code of what I revised it to – heyyo9028 Oct 09 '15 at 15:11
@AlexHall mostly to keep it reasonably close to what the OP was currently trying to do! It's only marginally less efficient than what you're suggesting, and really you can do it with only one pass over each line if you use dictionaries. – jonrsharpe Oct 09 '15 at 15:12
@heyyo9028 you're joking, right? Indentation matters in Python, how are we supposed to read it in a comment box? Get a [rubber duck](https://en.wikipedia.org/wiki/Rubber_duck_debugging). – jonrsharpe Oct 09 '15 at 15:18
Ok im going to put what i have now in my original post – heyyo9028 Oct 09 '15 at 15:21
@heyyo9028 this isn't an incremental debugging service. Please put some more effort into understanding what I've told you, reading the related docs and implementing it yourself. **Don't** edit the question, as it invalidates my answer. Please take it away and think about it for a while (at least 24hrs, as a rule). – jonrsharpe Oct 09 '15 at 15:22
By the way, trying what I did with your suggestion caused the rstrip() built in function to not work, according to the traceback after running it. Hence why I was using readlines() to begin with so that the rstrip() would be able to work – heyyo9028 Oct 09 '15 at 15:25
Note that I carefully *didn't* use the same function names as you - you will need to rewrite those helper functions, too. Learn to read and understand the error messages the interpreter gives you, they're often very useful. – jonrsharpe Oct 09 '15 at 15:26
@AlexHall would your suggestion work with basically what I have now if i simply add it in? – heyyo9028 Oct 09 '15 at 15:36
@heyyo9028 no, you're further from working code than that. Alex's suggestion is a tweak to what I've proposed. – jonrsharpe Oct 09 '15 at 15:38
Sorry, I just don't understand how what I have now is totally wrong, because when i was testing the way I did it on each separate file, it counted the correct amount of words for each one – heyyo9028 Oct 09 '15 at 15:39
like I don't see what is possibly different with what you suggested – heyyo9028 Oct 09 '15 at 15:46

Finding total number of "stopwords" in a file

2 Answers2