0

I want to mount a data structure stating the number of occurences and mapping them at the right order.

For example:

word_1 => 10 occurences

word_2 => 5 occurences

word_3 => 12 occurences

word_4 => 2 ocurrences

and each word has one id to represent it:

kw2id = ['word_1':0, 'word_2':1, 'word_3':2, 'word_4': 3]

so an ordered list would be:

ordered_vocab = [2, 0, 1, 3]

For instance my code is this...:

#build a vocabulary with the number of ocorrences
vocab = {}
count = 0
for line in open(DATASET_FILE):
    for word in line.split():
        if word in vocab:
            vocab[word] += 1
        else:
            vocab[word] = 1
    count += 1
    if not count % 100000:
        print(count, "documents processed")

How can I do this efficiently?

Community
  • 1
  • 1
denis Candido
  • 81
  • 1
  • 7

4 Answers4

3

That's what Counters were made for:

from collections import Counter
cnt = Counter()

with open(DATASET_FILE) as fp:
    for line in fp.readlines():
        for word in line.split():
            cnt[word] += 1

Or (shorter and more "beautiful" using a generator):

from collections import Counter

with open(DATASET_FILE) as fp:
    words = (word for line in fp.readlines() for word in line.split())
    cnt = Counter(words)
Jan
  • 42,290
  • 8
  • 54
  • 79
  • How can I print for example.. the 3 top words of the Counter obj? – denis Candido Oct 24 '17 at 18:48
  • 1
    Nvm, just use an iteration... Thanks a lot, this solve the problem. – denis Candido Oct 24 '17 at 18:49
  • @denisCandido: You're welcome. – Jan Oct 24 '17 at 19:29
  • but in that case that's a triplicate answered a zillion times... I found 3 Q&A in 2 minutes with the same stuff. – Jean-François Fabre Oct 24 '17 at 19:46
  • ce n'est pas le but ultime. if you already seen such a question or a simple google search finds 3 hits, you just shouldn't answer (or flag as dupes & answer in the original questions if your answer is different/better/more up to date). Doesn't work everytime, but sometimes it does: https://stackoverflow.com/questions/9072844/how-can-i-check-if-a-string-contains-any-letters-from-the-alphabet/43002960#43002960 – Jean-François Fabre Oct 24 '17 at 19:58
  • Sorry but I used the search and didn't found any question like this. – denis Candido Oct 25 '17 at 09:31
2

This is a slightly faster version of your code, I'm sorry I don't know numpy very well, but maybe this will help, enumerate and defaultdict(int) are the changes I have made (you do not have to accept this answer, just trying to help)

from collections import defaultdict

#build a vocabulary with the number of ocorrences
vocab = defaultdict(int)
with open(DATASET_FILE) as file_handle:
    for count,line in enumerate(file_handle):
        for word in line.split():
            vocab[word] += 1
        if not count % 100000:
            print(count, "documents processed")

Also defaultdict(int) when starting from 0 appears to be twice as fast as Counter() for an increment in a for loop (running Python 3.44):

from collections import Counter
from collections import defaultdict
import time

words = " ".join(["word_"+str(x) for x in range(100)])
lines = [words for i in range(100000)]

counter_dict = Counter()
default_dict = defaultdict(int)

start = time.time()
for line in lines:
    for word in line.split():
        counter_dict[word] += 1
end = time.time()
print (end-start)

start = time.time()
for line in lines:
    for word in line.split():
        default_dict[word] += 1
end = time.time()
print (end-start)

results:

5.353034019470215
2.554084062576294

If you would like to dispute this claim I refer you to this question: Surprising results with Python timeit: Counter() vs defaultdict() vs dict()

ragardner
  • 1,836
  • 5
  • 22
  • 45
1

You can use collection.Counter. Counter allows you to input a list and it will automatically count the number of occurrences of each element.

from collections import Counter
l = [1,2,2,3,3,3]
cnt = Counter(l)

So what you can do, besides the above answer, it to create a list of words out of the file, and use Counter with a list instead of iterating through each element in the list manually. Note that this method is not suitable if your file is too big compared to your memory.

Ha Vu
  • 89
  • 1
  • 8
0

The String:

>>> a = 'word_1 word_2 word_3 word_2 word_4'

The IDs

>>> d = {'word_1':0, 'word_2':1, 'word_3':2, 'word_4': 3}

To generate word counts:

>>> s = dict(zip(a.split(), map(lambda x: a.split().count(x), a.split())))
>>> s
{'word_1': 1, 'word_2': 2, 'word_3': 1, 'word_4': 1}

To generate ordered list:

>>> a = sorted(s.items(), key=lambda x: x[1], reverse=True)
>>> ordered_list = list(map(lambda x: d[x[0]], a ))
>>> ordered_list
[1, 0, 2, 3]
Sachit Nagpal
  • 486
  • 4
  • 7