1

I was trying to implement a code that would allow me to find the 10 most frequent words in a text. I'm new at python, and am more used to languages like C#, java or even C++. Here is what I did:

f = open("bigtext.txt","r")

word_count = {}

Basicaly, my idea is to create a dictionary that contains the number of times that each word is present in my text. If the word is not present, I will add it to the dictionary with the value of 1. If the world is already present in the dictionary, I will increment its value by 1.

for x in f.read().split():
    if x not in word_count:
        word_count[x] = 1
    else:
        word_count[x] += 1

sorted(word_count.values)

Here, I will sort my dictionary by values (since I'm looking for the 10 most frequent worlds, I need the 10 words with the biggest values).

for keys,values in word_count.items():
    values = values + 1
    print(word_count[-values])
    if values == 10:
        break

Here is the part were it all fails. I know now for sure (since I sorted my dictionary by the value of the values). That my 10 most frequent words are the 10 last elements of my dictionary. I want to display those. So I decided to initialize values at 1 and to display my dictionary backward till values = 10 so that I won't need to display more than what I need. But unfortunately, I get this following error:

File "<ipython-input-19-f5241b4c239c>", line 13
    for keys,values in word_count.items()
                                         ^
SyntaxError: invalid syntax

I do know that my mistake is that I didn't display my dictionary backwards correctly. But I don't know how to proceed elsewhere. So if someone can tell me how to properly display my last 10 elements in my dictionary, I would very much appreciate it. Thank You.

wencakisa
  • 5,850
  • 2
  • 15
  • 36
  • 1
    you forgot a colon `:` after `items()` – UnholySheep Mar 03 '18 at 21:56
  • 3
    Take a look at [collections.Counter](https://docs.python.org/2/library/collections.html#collections.Counter). It will do the counting for you ;) – zvone Mar 03 '18 at 21:56
  • Dictionaries can’t be sorted. Your `sorted()` call will return a list of all the values (no keys) from the dictionary, but won’t do anything to the dictionary itself. – Ben Mar 03 '18 at 21:57
  • You're right I did forget the `:` after `items()` now the mistake is the following: 'builtin_function_or_method' object is not iterable – Georges Ridgmont Mar 03 '18 at 21:58
  • I would recommend looking into [`nltk`](https://www.nltk.org/). This will allow you to ignore common stopwords, etc. – user3483203 Mar 03 '18 at 21:59
  • I see Ben. Thank you for your remark. I can't sort a dictionary. Can you give me a more suitable data structure for my problem then please? – Georges Ridgmont Mar 03 '18 at 22:01
  • I recommend using Python NLTK library, take a look at the accepted response on this question https://stackoverflow.com/questions/40669141/python-nltk-counting-word-and-phrase-frequency . Nltk is much larger and will be of help to you in similar future tasks. – Roman Gherta Mar 03 '18 at 22:06
  • 1
    I'll second a comment re collections.Counter. It has built-in function exactly for the task, your entire code will fit in one line. – sbat Mar 03 '18 at 22:09

2 Answers2

0

If you didn’t want to use collections.Counter, you could do something like this:

for word, count in sorted(word_count.items(), key=lambda x: -x[1])[:10]:
    print(word, count)

This gets all the words in the dictionary, along with their counts, into a list of tuples; sorts that list by the 2nd item in each tuple (the count) descending, and then only prints the first (I.e. highest) ten of those.

Ben
  • 6,687
  • 2
  • 33
  • 46
0

I would like to address a big thank you to Ben who told me that I can't sort a dictionary like that.

So this is my final solution (hoping it would help someone else);

my_words = []

for keys, values in word_count.items():
    my_words.append((values,keys))

I created a list and I added to it the values I had in my dictionary with the following word for each value.

my_words.sort(reverse = True)

I then sorted my list according to the value in reverse (so that my 10 most frequent worlds would be the 10 first element of my list)

print("The 10 most frequent words in this text are:")
print()

for key, val in my_words[:10]:
    print (key, val)

I then simply displayed the 10 first elements of my list.

I would also like to thank all of you who told me about NLTK. I will try it later to have a more optimal and accurate solution.

Thank You so much for your help.