I have a .txt file of "Alice in the Wonderland" and need to strip all the punctuation and make all of the words lower case, so I can find the number of unique words in the file. The wordlist
referred to below is one list of all the individual words as strings from the book, so wordlist
looks like this
["Alice's", 'Adventures', 'in', 'Wonderland', "ALICE'S",
'ADVENTURES', 'IN', 'WONDERLAND', 'Lewis', 'Carroll', 'THE',
'MILLENNIUM', 'FULCRUM', 'EDITION', '3.0', 'CHAPTER', 'I',
'Down', 'the', 'Rabbit-Hole', 'Alice', 'was', 'beginning',
'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her',
'sister', 'on', 'the', 'bank,'
The code i have for the solution so far is
from string import punctuation
def wordcount(book):
for word in wordlist:
no_punc = word.strip(punctuation)
lower_case = no_punc.lower()
newlist = lower_case.split()
print(newlist)
This works for stripping punctuation and making all words lowercase, however the newlist = lower_case.split()
makes an individual list of every word, so I cannot iterate over one big list to find the number of unique words. The reason I did the .split()
is so that when iterated over, python does not count ever letter as a word, rather each word is kept intact since it is its own list item. Any ideas on how I can improve this or a more efficient approach? Here is a sample of the output
['down']
['the']
['rabbit-hole']
['alice']
['was']
['beginning']
['to']
['get']
['very']
['tired']
['of']
['sitting']
['by']
['her']