3

Currently I'm doing a frequency analysis on a text file that shows the top 100 commonly used words in the text file. Currently I'm using this code:

from collections import Counter
import re
words = re.findall(r'\w+', open('tweets.txt').read().lower())
print Counter(words).most_common (100)

The code above works and the outputs are:

[('the', 1998), ('t', 1829), ('https', 1620), ('co', 1604), ('to', 1247), ('and', 1053), ('in', 957), ('a', 899), ('of', 821), ('i', 789), ('is', 784), ('you', 753), ('will', 654), ('for', 601), ('on', 574), ('thank', 470), ('be', 455), ('great', 447), ('hillary', 440), ('we', 390), ('that', 373), ('s', 363), ('it', 346), ('with', 345), ('at', 333), ('me', 327), ('are', 311), ('amp', 290), ('clinton', 288), ('trump', 287), ('have', 286), ('our', 264), ('realdonaldtrump', 256), ('my', 244), ('all', 237), ('crooked', 236), ('so', 233), ('by', 226), ('this', 222), ('was', 217), ('people', 216), ('has', 210), ('not', 210), ('just', 210), ('america', 204), ('she', 190), ('they', 188), ('trump2016', 180), ('very', 180), ('make', 180), ('from', 175), ('rt', 170), ('out', 169), ('he', 168), ('her', 164), ('makeamericagreatagain', 164), ('join', 161), ('as', 158), ('new', 157), ('who', 155), ('again', 154), ('about', 145), ('no', 142), ('get', 138), ('more', 137), ('now', 136), ('today', 136), ('president', 135), ('can', 134), ('time', 123), ('media', 123), ('vote', 117), ('but', 117), ('am', 116), ('bad', 116), ('going', 115), ('maga', 112), ('u', 112), ('many', 110), ('if', 110), ('country', 108), ('big', 108), ('what', 107), ('your', 105), ('cnn', 105), ('never', 104), ('one', 101), ('up', 101), ('back', 99), ('jobs', 98), ('tonight', 97), ('do', 97), ('been', 97), ('would', 94), ('obama', 93), ('tomorrow', 88), ('said', 88), ('like', 88), ('should', 87), ('when', 86)]

However, I want to display it in a table form with a header "Word" and "Count". I've tried using the prettytable package and came up with this:

from collections import Counter
import re
import prettytable

words = re.findall(r'\w+', open('tweets.txt').read().lower())

for label, data in ('Word', words):
    pt = prettytable(field_names=[label, 'Count'])
    c = Counter(data)
    [pt.add_row(kv) for kv in c.most_common() [:100] ]
    pt.align [label], pt.align['Count'] = '1', 'r'
    print pt

It gives me ValueError: too many values to unpack. My question is, whats wrong with my code and is there a way to display the data using prettytable? Also, how can I mend my code?

Bonus question: Is there a way to leave out certain words while counting the frequency? e.g skip the words: and, if, of etc etc

Thanks.

Leon Z.
  • 542
  • 1
  • 6
  • 11
Vin23
  • 257
  • 1
  • 4
  • 9

2 Answers2

2

Is it what you are trying to do ?

from prettytable import PrettyTable

x = PrettyTable(["Words", "Counts"])

L = [('the', 1998), ('t', 1829), ('https', 1620), ('co', 1604), ('to', 1247), ('and', 1053), ('in', 957), ('a', 899), ('of', 821), ('i', 789), ('is', 784), ('you', 753), ('will', 654), ('for', 601), ('on', 574), ('thank', 470), ('be', 455), ('great', 447), ('hillary', 440), ('we', 390), ('that', 373), ('s', 363), ('it', 346), ('with', 345), ('at', 333), ('me', 327), ('are', 311), ('amp', 290), ('clinton', 288), ('trump', 287), ('have', 286), ('our', 264), ('realdonaldtrump', 256), ('my', 244), ('all', 237), ('crooked', 236), ('so', 233), ('by', 226), ('this', 222), ('was', 217), ('people', 216), ('has', 210), ('not', 210), ('just', 210), ('america', 204), ('she', 190), ('they', 188), ('trump2016', 180), ('very', 180), ('make', 180), ('from', 175), ('rt', 170), ('out', 169), ('he', 168), ('her', 164), ('makeamericagreatagain', 164), ('join', 161), ('as', 158), ('new', 157), ('who', 155), ('again', 154), ('about', 145), ('no', 142), ('get', 138), ('more', 137), ('now', 136), ('today', 136), ('president', 135), ('can', 134), ('time', 123), ('media', 123), ('vote', 117), ('but', 117), ('am', 116), ('bad', 116), ('going', 115), ('maga', 112), ('u', 112), ('many', 110), ('if', 110), ('country', 108), ('big', 108), ('what', 107), ('your', 105), ('cnn', 105), ('never', 104), ('one', 101), ('up', 101), ('back', 99), ('jobs', 98), ('tonight', 97), ('do', 97), ('been', 97), ('would', 94), ('obama', 93), ('tomorrow', 88), ('said', 88), ('like', 88), ('should', 87), ('when', 86)]


for e in L:
    x.add_row([e[0],e[1]])

print x

Here is the result:

+-----------------------+--------+
|         Words         | Counts |
+-----------------------+--------+
|          the          |  1998  |
|           t           |  1829  |
|         https         |  1620  |
|           co          |  1604  |
|           to          |  1247  |
|          and          |  1053  |
|           in          |  957   |
|           a           |  899   |
|           of          |  821   |
|           i           |  789   |
|           is          |  784   |
|          you          |  753   |
|          will         |  654   |
|          for          |  601   |
|           on          |  574   |
|         thank         |  470   |
|           be          |  455   |
|         great         |  447   |
|        hillary        |  440   |
|           we          |  390   |
|          that         |  373   |
|           s           |  363   |
|           it          |  346   |
|          with         |  345   |
|           at          |  333   |
|           me          |  327   |
|          are          |  311   |
|          amp          |  290   |
|        clinton        |  288   |
|         trump         |  287   |
|          have         |  286   |
|          our          |  264   |
|    realdonaldtrump    |  256   |
|           my          |  244   |
|          all          |  237   |
|        crooked        |  236   |
|           so          |  233   |
|           by          |  226   |
|          this         |  222   |
|          was          |  217   |
|         people        |  216   |
|          has          |  210   |
|          not          |  210   |
|          just         |  210   |
|        america        |  204   |
|          she          |  190   |
|          they         |  188   |
|       trump2016       |  180   |
|          very         |  180   |
|          make         |  180   |
|          from         |  175   |
|           rt          |  170   |
|          out          |  169   |
|           he          |  168   |
|          her          |  164   |
| makeamericagreatagain |  164   |
|          join         |  161   |
|           as          |  158   |
|          new          |  157   |
|          who          |  155   |
|         again         |  154   |
|         about         |  145   |
|           no          |  142   |
|          get          |  138   |
|          more         |  137   |
|          now          |  136   |
|         today         |  136   |
|       president       |  135   |
|          can          |  134   |
|          time         |  123   |
|         media         |  123   |
|          vote         |  117   |
|          but          |  117   |
|           am          |  116   |
|          bad          |  116   |
|         going         |  115   |
|          maga         |  112   |
|           u           |  112   |
|          many         |  110   |
|           if          |  110   |
|        country        |  108   |
|          big          |  108   |
|          what         |  107   |
|          your         |  105   |
|          cnn          |  105   |
|         never         |  104   |
|          one          |  101   |
|           up          |  101   |
|          back         |   99   |
|          jobs         |   98   |
|        tonight        |   97   |
|           do          |   97   |
|          been         |   97   |
|         would         |   94   |
|         obama         |   93   |
|        tomorrow       |   88   |
|          said         |   88   |
|          like         |   88   |
|         should        |   87   |
|          when         |   86   |
+-----------------------+--------+

EDIT 1: If you want to leave out certain you could do something like that:

for e in L:
    if e[0]!="and" or e[0]!="if" or e[0]!="of":
        x.add_row([e[0],e[1]])

EDIT 2: to sum up:

from collections import Counter
import re

words = re.findall(r'\w+', open('tweets.txt').read().lower())
counts = Counter(words).most_common (100)

from prettytable import PrettyTable

x = PrettyTable(["Words", "Counts"])

skip_list = ['and','if','or'] # see joe's comment

for e in counts:
    if e[0] not in skip_list:
        x.add_row([e[0],e[1]])

print x
Loïc Poncin
  • 511
  • 1
  • 11
  • 30
  • Yes, something like this. but is it possible to not have the long lists of different words? – Vin23 Feb 22 '17 at 15:31
  • You mean that you want to pick each data from the text file and put it directly in the table ? Can you give me a link of the text file ? I want to see how the data is arranged in the file. – Loïc Poncin Feb 22 '17 at 15:38
  • You can define `skip_list = [‘and’, ‘if’, ‘or’]` and `if e[0] not in skip_list:` – Joe Bobson Feb 22 '17 at 15:42
  • Of course why didn't I thought about this ... Joe's answer is better if you want to leave out specific words – Loïc Poncin Feb 22 '17 at 15:45
  • Sorry I have to admit that I do not really see how to help you not to use a list, this is the first time I use regex and collection. – Loïc Poncin Feb 22 '17 at 16:04
  • Off topic: I am wondering, why are you doing that ? I know that psychologists use that method to study the behaviour of some people. Or are you just interested in programming? – Loïc Poncin Feb 22 '17 at 16:21
  • because the tweets that ive gathered in the txt file is from one certain user. the code u wrote allows me to just change one line in order for the code for a different user. – Vin23 Feb 22 '17 at 16:29
  • `[e[0], e[1]]` -> `e` – Mad Physicist Feb 22 '17 at 16:37
  • Also, this version will create a table with < 100 elements because it will remove the skip words after calling `most_common`. – Mad Physicist Feb 22 '17 at 16:38
  • Yes I didn't think about that. @Vin23 you should follow what MadPhysicist did. – Loïc Poncin Feb 22 '17 at 16:50
2

I am not sure how you expected the for loop you wrote to work. The error you are getting is because you are attempting to iterate over the tuple ('Word', words) which has two elements. The statement for label, data in ('Word', words) attempts to assign 'W' to label, 'o' to data and ends up with 'r' and 'd' left over on the first iteration. Perhaps you meant to zip the items together instead? But then why are you making a new table for each word?

Here is a rewritten version:

from collections import Counter
import re, prettytable

words = re.findall(r'\w+', open('tweets.txt').read().lower())
c = Counter(words)
pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
for row in c.most_common(100):
    pt.add_row(row)
print pt

To skip elements in the most common count, you can simply discard them from the counter before calling most_common. One easy way to do that is to define a list of invalid words, and then to filter them out with a dict comprehension:

bad_words = ['the', 'if', 'of']
c = Counter({k: v for k, v in c.items() if k not in bad_words})

Alternatively, you can do the filtering on the list of words before you make a counter out of it:

words = filter(lambda x: x not in bad_words, words)

I prefer operating on the counter because that requires less work since the data has already been aggregated. Here is the combined code for reference:

from collections import Counter
import re, prettytable

bad_words = ['the', 'if', 'of']
words = re.findall(r'\w+', open('tweets.txt').read().lower())

c = Counter(words)
c = Counter({k: v for k, v in c.items() if k not in bad_words})

pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
for row in c.most_common(100):
    pt.add_row(row)

print(pt)
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • i got an error from your code. File "test4.py", line 7, in pt.set_field_names(["Words", "Counts"]) File "C:\Python27\lib\site-packages\prettytable.py", line 217, in __getattr__ raise AttributeError(name) AttributeError: set_field_names – Vin23 Feb 22 '17 at 16:40
  • @Vin23. I fixed that. – Mad Physicist Feb 22 '17 at 16:41
  • @Vin23. The docs are a bit out of date for the library, my first version was based off of that. – Mad Physicist Feb 22 '17 at 16:42
  • This answer has only one advantage over loics, which is that it makes a table of the 100 most common words *after* the skips were removed, not before. – Mad Physicist Feb 22 '17 at 16:43