1

I have a .txt file of "Alice in the Wonderland" and need to strip all the punctuation and make all of the words lower case, so I can find the number of unique words in the file. The wordlist referred to below is one list of all the individual words as strings from the book, so wordlist looks like this


    ["Alice's", 'Adventures', 'in', 'Wonderland', "ALICE'S", 
    'ADVENTURES', 'IN', 'WONDERLAND', 'Lewis', 'Carroll', 'THE', 
    'MILLENNIUM', 'FULCRUM', 'EDITION', '3.0', 'CHAPTER', 'I', 
    'Down', 'the', 'Rabbit-Hole', 'Alice', 'was', 'beginning', 
    'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 
    'sister', 'on', 'the', 'bank,'

   

The code i have for the solution so far is


from string import punctuation
def wordcount(book):

    for word in wordlist:
        no_punc = word.strip(punctuation)
        lower_case = no_punc.lower()
        newlist = lower_case.split()
        print(newlist)

This works for stripping punctuation and making all words lowercase, however the newlist = lower_case.split() makes an individual list of every word, so I cannot iterate over one big list to find the number of unique words. The reason I did the .split() is so that when iterated over, python does not count ever letter as a word, rather each word is kept intact since it is its own list item. Any ideas on how I can improve this or a more efficient approach? Here is a sample of the output


    ['down']
    ['the']
    ['rabbit-hole']
    ['alice']
    ['was']
    ['beginning']
    ['to']
    ['get']
    ['very']
    ['tired']
    ['of']
    ['sitting']
    ['by']
    ['her']

gavmross
  • 21
  • 4

1 Answers1

1

Here is a modification of your code with outputs

from string import punctuation

wordlist = "Alice fell down down down!.. down into, the hole."

single_list = []
for word in wordlist.split(" "):
    no_punc = word.strip(punctuation)
    lower_case = no_punc.lower()
    newlist = lower_case.split()
    #print(newlist)
    single_list.append(newlist[0])

print(single_list)
#to get the unique
single_list_unique = set(single_list)
print(single_list_unique)
print(len(single_list_unique))

and that produces:

['alice', 'fell', 'down', 'down', 'down', 'down', 'into', 'the', 'hole']

and the unique set:

{'fell', 'alice', 'down', 'into', 'the', 'hole'}

and the length of the unique:

6

(This may not be the most efficient approach but it is close to your current code and will suffice for that book of thousands of elements. If this was a backend process serving multiple requests you would optimize it with improvements)

EDIT----------

You may be importing from file using a library that passes in a list, in which case you produce an error AttributeError: 'list' object has no attribute 'split', or you might see the error IndexError: list index out of range because of an empty string. In which case you use this modification:

from string import punctuation

wordlist2 = ["","Alice fell down down down!.. down into, the hole.", "There was only one hole for Alice to fall down into"]


single_list = []
for wordlist in wordlist2:
    for word in wordlist.split(" "):
        no_punc = word.strip(punctuation)
        lower_case = no_punc.lower()
        newlist = lower_case.split()
        #print(newlist)
        if(len(newlist) > 0):
            single_list.append(newlist[0])

print(single_list)
#to get the unique
single_list_unique = set(single_list)
print(single_list_unique)
print(len(single_list_unique))

producing:

['alice', 'fell', 'down', 'down', 'down', 'down', 'into', 'the', 'hole', 'there', 'was', 'only', 'one', 'hole', 'for', 'alice', 'to', 'fall', 'down', 'into']
{'there', 'fall', 'fell', 'alice', 'for', 'down', 'was', 'into', 'the', 'to', 'only', 'hole', 'one'}
13
Vass
  • 2,682
  • 13
  • 41
  • 60
  • One important thing I forgot to mention, in order to handle all of the words individually, they are stored in `wordlist` as one big `list` of individual `strings` so in order to work with `wordlist`, I have to work with it as a `list` object. So running your code @Vass resulted in the error `list has no attribute split` (since `wordlist` is a list) – gavmross Nov 04 '21 at 00:51
  • @gavmross, made an edit where the data of the text is in a list which fixes that error you mention (I reproduced the error and fixed it) – Vass Nov 04 '21 at 00:54
  • Thank you for the edit, however still not working. I edited my question so you can how each words is its own item in the list, so I am still getting `list index out of range` for the `single_list.append(newlist[0])`. I am guessing it has to do with something regarding that `newlist` is updated for every single word, so the last list it is assigned to is ['end'], definitely not sure though. @Vass – gavmross Nov 04 '21 at 01:04
  • @gavmross, I made a modification to the data `wordlist2 = ["","Alice fell down down down!.. down into, the hole.", "There was only one hole for Alice to fall down into"] ` and it produces the error, so an empty string exists and can be fixed by testing for the length, will update – Vass Nov 04 '21 at 01:47