Just for fun I would like to count the conditional probabilities that a word (from a natural language) appears in a text, depending on the last and next to last word. I.e. I would take a huge bunch of e.g. English texts and count how often each combination n(i|jk)
and n(jk)
appears (where j,k,i
are sucsessive words).
The naive approach would be to use a 3-D array (for n(i|jk)
), using a mapping of words to position in 3 dimensions. The position look-up could be done efficiently using trie
s (at least that's my best guess), but already for O(1000) words I would run into memory constraints. But I guess that this array would be only sparsely filled, most entries being zero, and I would thus waste lots of memory. So no 3-D array.
What data structure would be suited better for such a use case and still be efficient to do a lot of small updates like I do them when counting the appearances of the words? (Maybe there is a completely different way of doing this?)
(Of course I also need to count n(jk)
, but that's easy, because it's only 2-D :)
The language of choice is C++ I guess.