How to count word occurrences without being constrained to only exact matches

Question

I have a file which has content like below.

Someone says; Hello; Someone responded Hello back
Someone again said; Hello; No response
Someone again said; Hello waiting for response

I have a python script which counts number of times a particular word occurred in a file. Following is the script.

#!/usr/bin/env python

filename = "/path/to/file.txt"

number_of_words = 0
search_string = "Hello"

with open(filename, 'r') as file:
    for line in file:
        words = line.split()
        for i in words:
            if (i == search_string):
                number_of_words += 1

print("Number of words in " + filename + " is: " + str(number_of_words))

I am expecting the output to be 4 since Hello occurs 4 times. But I get the output as 2? Following is the output of the script

Number of words in /path/to/file.txt is: 2

I kind of understand that Hello; is not considered as Hello because of the word not being exactly the one searched for.

Question:
Is there a way I can make my script pick Hello even if it was followed by a comma or semi-colon or a dot? Some simple technique which doesn't require to look for substrings again within the found word.

Possible duplicate of [Does Python have a string 'contains' substring method?](https://stackoverflow.com/questions/3437059/does-python-have-a-string-contains-substring-method) — mkrieger1, Jun 06 '19 at 20:45
This is not a duplicate of contains, `helloes` and `helloing` are different words, but contain hello. — munk, Jun 06 '19 at 20:45

score 1 · Accepted Answer · answered Jun 06 '19 at 20:45

Regex would be a better tool for this, since you want to ignore punctuation. It could be done with clever filtering and .count() methods, but this is more straightforward:

import re
...
search_string = "Hello"
with open(filename, 'r') as file:
    filetext = file.read()
occurrences = len(re.findall(search_string, filetext))

print("Number of words in " + filename + " is: " + str(occurrences))

If you want case-insensitivity, you could change search_string accordingly:

search_string = r"[Hh]ello"

Or if you want explicitly the word Hello but not aHello or Hellon, you could match the \b character beforehand and afterwards (see the documentation for more fun tricks):

search_string = r"\bHello\b"

This one looks to be the simplest. Thanks for the tip about `r"\bHello\b"`. That is useful to know for this kind of a problem. — AdeleGoldberg, Jun 06 '19 at 20:54

score 1 · Answer 2 · answered Jun 06 '19 at 20:47

You can use regex and Counter from collections module:

txt = '''Someone says; Hello; Someone responded Hello back
Someone again said; Hello; No response
Someone again said; Hello waiting for response'''

import re
from collections import Counter
from pprint import pprint

c = Counter()
re.sub(r'\b\w+\b', lambda r: c.update((r.group(0), )), txt)
pprint(c)

Prints:

Counter({'Someone': 4,
         'Hello': 4,
         'again': 2,
         'said': 2,
         'response': 2,
         'says': 1,
         'responded': 1,
         'back': 1,
         'No': 1,
         'waiting': 1,
         'for': 1})

Jack Walsh · Answer 3 · 2019-06-06T20:55:48.250

You can use regular expressions to find the answer.

import re
filename = "/path/to/file.txt"

number_of_words = 0
search_string = "Hello"


with open(filename, 'r') as file:
    for line in file:
        words = line.split()
        for i in words:
            b = re.search(r'\bHello;?\b', i)
            if b:
                number_of_words += 1

print("Number of words in " + filename + " is: " + str(number_of_words))

This will check if either "Hello" or "Hello;" specifically are in the file. You can expand the regex to fit any other needs (such as lowercase).

It will ignore things such as "Helloing" which other examples here may.

If you prefer not using regex... You can check if taking off the last letter makes it a match such as below:

filename = "/path/to/file.txt"

number_of_words = 0
search_string = "Hello"

with open(filename, 'r') as file:
    for line in file:
        words = line.split()
        for i in words:
            if (i == search_string) or (i[:-1] == search_string and i[-1] == ';'):
                number_of_words += 1

print("Number of words in " + filename + " is: " + str(number_of_words))

How to count word occurrences without being constrained to only exact matches

3 Answers3