When counting the occurrence of a string in a file, my code does not count the very first word

Question

Code

def main():
try:
    file=input('Enter the name of the file you wish to open: ')
    thefile=open(file,'r')
    line=thefile.readline()
    line=line.replace('.','')
    line=line.replace(',','')
    thefilelist=line.split()
    thefilelistset=set(thefilelist)
    d={}
    for item in thefilelist:
        thefile.seek(0)
        wordcount=line.count(' '+item+' ')
        d[item]=wordcount
    for i in d.items():
        print(i)   
    thefile.close()
except IOError:
    print('IOError: Sorry but i had an issue opening the file that you specified to READ from please try again but keep in mind to check your spelling of the file you want to open')
main()

Problem

Basically I am trying to read the file and count the number of times each word in the file appears then print that word with the number of times it appeared next to it.

It all works except that it will not count the first word in the file.

File I am using

my practice file that I am testing this code on contains this text:

This file is for testing. It is going to test how many times the words in here appear.

output

('for', 1)
('going', 1)
('the', 1)
('testing', 1)
('is', 2)
('file', 1)
('test', 1)
('It', 1)
('This', 0)
('appear', 1)
('to', 1)
('times', 1)
('here', 1)
('how', 1)
('in', 1)
('words', 1)
('many', 1)

note

If you notice it says that 'This' appears 0 times but it does in fact appear in the file.

any ideas?

score 7 · Answer 1 · answered Nov 10 '15 at 22:04

7

My guess would be this line:

wordcount=line.count(' '+item+' ')

You are looking for "space" + YourWord + "space", and the first word is not preceded by space.

answered Nov 10 '15 at 22:04

Guilherme

721
6
13

1

Yep this is correct; what I was going to suggest as well. – Saroekin Nov 10 '15 at 22:07

emvee · Answer 2 · 2015-11-10T22:36:07.520

I would suggest more use of Python utilities. A big flaw is that you only read one line from the file.

Then you create a set of unique words and then start counting them individually which is highly inefficient; the line is traversed many times: once to create the set and then for each unique word.

Python has a built-in "high performance counter" (https://docs.python.org/2/library/collections.html#collections.Counter) which is specifically meant for use cases like this.

The following few lines replace your program; it also uses "re.split()" to split each line by word boundaries (https://docs.python.org/2/library/re.html#regular-expression-syntax).

The idea is to execute this split() function on each of the lines of the file and update the wordcounter with the results from this split. Also re.sub() is used to replace the dots and commas in one go before handing the line to the split function.

import re, collections

with open(raw_input('Enter the name of the file you wish to open: '), 'r') as file:
    for d in reduce(lambda acc, line: acc.update(re.split("\W", line)) or acc,
                     map(lambda line: re.sub("(\.,)", "", line), file),
                     collections.Counter()).items():
        print d

I see Guido has come along and downvoted a nice, functional programming oriented, solution. I know it's frowned upon in Python but it is still valid, correct and efficient. — emvee, Nov 12 '15 at 10:54

score 3 · Accepted Answer · answered Nov 10 '15 at 22:18

If you want a simple fix it is simple in this line:

wordcount=line.count(' '+item+' ')

There is no space before "This".

I think the are a couple ways to fix it but I recommend using the with block and using .readlines()

I recommend using some more of pythons capabilities. In this case, a couple recommendations. One if the file is more than one line this code won't work. Also if a sentence is words... lastwordofsentence.Firstwordofnextsentence it won't work because they will be next to each other and become one word. Please change your replace to do spaces by that i mean change '' to ' ', as split will replace multiple spaces .

Also, please post whether you are using Python 2.7 or 3.X. It helps with small possible syntax problems.

filename = input('Enter the name of the file you wish to open: ')
# Using a with block like this is cleaner and nicer than try catch
with open(filename, "r") as f:
    all_lines = f.readlines()

d={} # Create empty dictionary

# Iterate through all lines in file
for line in all_lines:

    # Replace periods and commas with spaces
    line=line.replace('.',' ')
    line=line.replace(',',' ')

    # Get all words on this line
    words_in_this_line = line.split() # Split into all words

    # Iterate through all words
    for word in words_in_this_line:
        #Check if word already exists in dictionary
        if word in d: # Word exists increment count
            d[word] += 1
        else: #Word doesn't exist, add it with count 1
            d[word] = 1

# Print all words with frequency of occurrence in file
for i in d.items():
    print(i)

Also just to be sure this will count `this` and `This` as different words. Keep it the same if you want this functionality and if you want to change them to be the same, simply use `word = word.tolower()` or something similiar in the first line of the `for` loop — napkinsterror, Nov 10 '15 at 22:22
Also if you do a basic tutorial on regex with an `import re`. You should not need to replace commas and periods and other punctuations. Simply doing a `re.findAll(r'([\w]+), line)` and then iterating through that would find all words made up of only letters and numbers or `re.findAll(r'([A-Za-z]+)', line)` for just words made up of letters. Learning regex is weird and takes about 10 - 20 minutes, but will make your life easier in the long run. Start here http://regexone.com/ — napkinsterror, Nov 10 '15 at 22:34

score 1 · Answer 4 · answered Nov 10 '15 at 22:06

You check if line contains ' '+item+' ', which means you are searching for a word starting and ending with a space. Because "This" is the first word of the line, it is not surrounded by two spaces.

To fix that, you can use the following code:

wordcount=(' '+line+' ').count(' '+item+' ')

Above code ensures that the first and the last word are counted correctly.

Alex · Answer 5 · 2015-11-10T22:28:19.030

1

The problem is in this line wordcount=line.count(' '+item+' '). The first word will not have a space in front of it. I have also have removed some other redundant statements from your code:

import string

def main():
    try:
        #file=input('Enter the name of the file you wish to open: ')
        thefile=open('C:/Projects/Python/data.txt','r')
        line=thefile.readline()
        line = line.translate(string.maketrans("",""), string.punctuation)
        thefilelist=line.split()
        d={}
        for item in thefilelist:
            if item not in d:
                d[item] = 0
            d[item] = d[item]+1 
        for i in d.items():
            print(i)   
        thefile.close()
    except IOError:
        print('IOError: Sorry but i had an issue opening the file that you specified to READ from please try again but keep in mind to check your spelling of the file you want to open')


main()

edited Nov 10 '15 at 22:28

answered Nov 10 '15 at 22:09

Alex

21,273
10
61
73

Are you sure `line.count(item)` works? Searching for the word `the` inside `... Motherboard ...` for example will increase the counter even if `Motherboard` is definitely not the same word as `the`. – ByteHamster Nov 10 '15 at 22:13
2

It looks like similar words like: `happy` and `unhappy` will not be counted properly. Just a thought. – Dariusz Bączkowski Nov 10 '15 at 22:14
Thanks @ESYSCODER, I have fixed this bug in my code. – Alex Nov 10 '15 at 22:20
I have also modified the code to handle more punctuation characters. – Alex Nov 10 '15 at 22:28

score 0 · Answer 6 · answered Nov 10 '15 at 22:11

0

This do not have space in front ' '.

Quick fix:

line= ' ' + thefile.readline()

But there are many problem in Your code. For example:

What about multi line file?
What about file without . at the end?

answered Nov 10 '15 at 22:11

Dariusz Bączkowski

433
6
15