This is about plain word counting, to collect which words appear in a document and how how often.
I try to write a function were the input is a list of text lines. I go through all lines, split them into words, accumulate the recognized words and finally return the complete list.
First I have a while-loop that goes through all the characters in the list, and but ignores the white spaces. Inside this while loop I also try to recognize what kind of words I have. In this context, there are three kinds of words:
- those starting with a letter;
- those starting with a digit;
- and those which contain only one character which is neither letter nor digit.
I have three if statements which check what kind of character I have. When I know what kind of word I have encountered, I try to extract the word itself. When the word starts with a letter or a digit, I take all consecutive characters of the same kind as part of the word.
But, in the third if statement, when I take care of the case when the current character is neither a letter nor a digit, I get problems.
When I give the input
wordfreq.tokenize(['15, delicious& Tarts.'])
I want the output to be
['15', ',', 'delicious', '&', 'tarts', '.']
When I test the function in the Python Console, it looks like this:
PyDev console: starting.
Python 3.7.4 (v3.7.4:e09359112e, Jul 8 2019, 14:54:52)
[Clang 6.0 (clang-600.0.57)] on darwin
import wordfreq
wordfreq.tokenize(['15, delicious& Tarts.'])
['15', 'delicious', 'tarts']
The function does not take neither the comma, the ampersand nor the dot into account! How do I fix this? See below for the code.
(The lower() method is because I want to ignore capitalization, e.g. 'Tarts' and 'tarts' are really the same words.)
# wordfreq.py
def tokenize(lines):
words = []
for line in lines:
start = 0
while start < len(line):
while line[start].isspace():
start = start + 1
if line[start].isalpha():
end = start
while line[end].isalpha():
end = end + 1
word = line[start:end]
words.append(word.lower())
start = end
elif line[start].isdigit():
end = start
while line[end].isdigit():
end = end + 1
word = line[start:end]
words.append(word)
start = end
else:
words.append(line[start])
start = start + 1
return words