0

I'm working on the code that can analyze the input text. One of the functions I would like to ask for help is that making a list of words used in order of descending frequency.

By referring the similar topics in stack overflow, I was able to retain only alphanumeric characters (remove all quotation / punctuation etc) and put each words into the list.

Here is the list I have now. (variable called word_list)

['Hi', 'beautiful', 'creature', 'Said', 'by', 'Rothchild', 'the', 'biggest', 'enemy', 'of', 'Zun', 'Zun', 'started', 'get', 'afraid', 'of', 'him', 'As', 'her', 'best', 'friend', 'Lia', 'can', 'feel', 'her', 'fear', 'Why', 'the', 'the', 'hell', 'you', 'are', 'here']

(FYI, text file is just random fanfiction I found from the web)

However, I'm having trouble to modify this list to the list in order of descending frequency - for example, there are 3 'the' in that list, so 'the' becomes the first element of the list. next element would be 'of', which occurring 2 times.

I tried several things similar to my case but keep displaying error (Counter, sorted).

Can someone teach me how can I sort the list?

In addition, after sorting the list, how can I retain only 1 copy for repeating ones? (my current idea is using for loop and indexing - compare with previous index, remove if it's same.)

Thank you.

Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51
DS Park
  • 159
  • 1
  • 11

1 Answers1

2

You can use a itertools.Counter for your sorting in different ways:

from collections import Counter

lst = ['Hi', 'beautiful', 'creature', 'Said', 'by', 'Rothchild', 'the', 'biggest', 'enemy', 'of', 'Zun', 'Zun', 'started', 'get', 'afraid', 'of', 'him', 'As', 'her', 'best', 'friend', 'Lia', 'can', 'feel', 'her', 'fear', 'Why', 'the', 'the', 'hell', 'you', 'are', 'here']

c = Counter(lst)  # mapping: {item: frequency}

# now you can use the counter directly via most_common (1.)
lst = [x for x, _ in c.most_common()]
# or as a sort key (2.)
lst = sorted(set(lst), key=c.get, reverse=True)

# ['the', 'Zun', 'of', 'her', 'Hi', 'hell', 'him', 'friend', 'Lia', 
#  'get', 'afraid', 'Rothchild', 'started', 'by', 'can', 'Why', 'fear', 
#  'you', 'are', 'biggest', 'enemy', 'Said', 'beautiful', 'here', 
#  'best', 'creature', 'As', 'feel']

These approaches use either the Counter keys (1.) or set for the removal of duplicates.

However, if you want the sort to be stable with regard to the original list (keep order of occurrence for equal frequency items), you might have to do this, following the collections.OrderedDict based recipe for duplicate removal:

from collections import OrderedDict

lst = sorted(OrderedDict.fromkeys(lst), key=c.get, reverse=True)

# ['the', 'of', 'Zun', 'her', 'Hi', 'beautiful', 'creature', 'Said', 
# 'by', 'Rothchild', 'biggest', 'enemy', 'started', 'get', 'afraid', 
# 'him', 'As', 'best', 'friend', 'Lia', 'can', 'feel', 'fear', 'Why',  
# 'hell', 'you', 'are', 'here']
user2390182
  • 72,016
  • 6
  • 67
  • 89
  • Looks like I'm getting slow, +1 Although I would probably use `list.sort` since it seems that's what OP wanted. – cs95 Dec 11 '18 at 07:45
  • @coldspeed The OP writes his attempts included `sorted`. Also, which `list` are we calling `list.sort` on? The original list would cause more work than needed as the duplicates have not been removed yet. `sorted` has the advantage that it can deal with a `set` or `DictKeys` object turning it into a `list` in one go. – user2390182 Dec 11 '18 at 08:15
  • Thank you so much for the answer. I think I understood clearly about Counter by going over the code you wrote and testing with mine. – DS Park Dec 11 '18 at 18:02
  • Also, thank you for ordereddict which was the thing that I didn't know! – DS Park Dec 11 '18 at 18:02