Given a corpus/texts as such:
Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .
Although , as you will have seen , the dreaded ' millennium bug ' failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful .
You have requested a debate on this subject in the course of the next few days , during this part @-@ session .
In the meantime , I should like to observe a minute ' s silence , as a number of Members have requested , on behalf of all the victims concerned , particularly those of the terrible storms , in the various countries of the European Union .
I could simply do this to get a dictionary with word frequencies:
>>> word_freq = Counter()
>>> for line in text.split('\n'):
... for word in line.split():
... word_freq[word]+=1
...
But if the aim is to achieve an ordered dictionary from highest to lowest frequency, I will have to do this:
>>> from collections import OrderedDict
>>> sorted_word_freq = OrderedDict()
>>> for word, freq in word_freq.most_common():
... sorted_word_freq[word] = freq
...
Imagine that I have 1 billion keys in the Counter
object, iterating through the most_common()
would have a complexity of going through a corpus (non-unique instances) once and the vocabulary (unique key).
Note: The Counter.most_common()
would call an ad-hoc sorted()
, see https://hg.python.org/cpython/file/e38470b49d3c/Lib/collections.py#l472
Given this, I have seen the following code that uses numpy.argsort()
:
>>> import numpy as np
>>> words = word_freq.keys()
>>> freqs = word_freq.values()
>>> sorted_word_index = np.argsort(freqs) # lowest to highest
>>> sorted_word_freq_with_numpy = OrderedDict()
>>> for idx in reversed(sorted_word_index):
... sorted_word_freq_with_numpy[words[idx]] = freqs[idx]
...
Which is faster?
Is there any other faster way to get such an OrderedDict
from a Counter
?
Other than OrderedDict
, is there other python objects that achieves the same sorted key-value pair?
Assume that memory is not an issue. Given 120 GB of RAM, there shouldn't be much issue to keep 1 billion key-value pairs right? Assume an average of 20 chars per key for 1 billion keys and a single integer for each value.