I want to mount a data structure stating the number of occurences and mapping them at the right order.
For example:
word_1 => 10 occurences
word_2 => 5 occurences
word_3 => 12 occurences
word_4 => 2 ocurrences
and each word has one id to represent it:
kw2id = ['word_1':0, 'word_2':1, 'word_3':2, 'word_4': 3]
so an ordered list would be:
ordered_vocab = [2, 0, 1, 3]
For instance my code is this...:
#build a vocabulary with the number of ocorrences
vocab = {}
count = 0
for line in open(DATASET_FILE):
for word in line.split():
if word in vocab:
vocab[word] += 1
else:
vocab[word] = 1
count += 1
if not count % 100000:
print(count, "documents processed")
How can I do this efficiently?