1

Just for fun I would like to count the conditional probabilities that a word (from a natural language) appears in a text, depending on the last and next to last word. I.e. I would take a huge bunch of e.g. English texts and count how often each combination n(i|jk) and n(jk) appears (where j,k,i are sucsessive words).

The naive approach would be to use a 3-D array (for n(i|jk)), using a mapping of words to position in 3 dimensions. The position look-up could be done efficiently using tries (at least that's my best guess), but already for O(1000) words I would run into memory constraints. But I guess that this array would be only sparsely filled, most entries being zero, and I would thus waste lots of memory. So no 3-D array.

What data structure would be suited better for such a use case and still be efficient to do a lot of small updates like I do them when counting the appearances of the words? (Maybe there is a completely different way of doing this?)

(Of course I also need to count n(jk), but that's easy, because it's only 2-D :) The language of choice is C++ I guess.

fuenfundachtzig
  • 7,952
  • 13
  • 62
  • 87

1 Answers1

3

C++ code:

struct bigram_key{
    int i, j;// words - indexes of the words in a dictionary

    // a constructor to be easily constructible
    bigram_key(int a_i, int a_j):i(a_i), j(a_j){}

    // you need to sort keys to be used in a map container
    bool operator<(bigram_key const &other) const{
        return i<other.i || (i==other.i && j<other.j);
    }
};

struct bigram_data{
    int count;// n(ij)
    map<int, int> trigram_counts;// n(k|ij) = trigram_counts[k]
}

map<bigram_key, bigram_data> trigrams;

The dictionary could be a vector of all found words like:

vector<string> dictionary;

but for better lookup word->index it could be a map:

map<string, int> dictionary;

When you read a new word. You add it to the dictionary and get its index k, you already have i and j indexes of the previous two words so then you just do:

trigrams[bigram_key(i,j)].count++;
trigrams[bigram_key(i,j)].trigram_counts[k]++;

For better performance you may search for bigram only once:

bigram_data &bigram = trigrams[bigram_key(i,j)];
bigram.count++;
bigram.trigram_counts[k]++;

Is it understandable? Do you need more details?

Juraj Blaho
  • 13,301
  • 7
  • 50
  • 96
  • A down-to-earth approach, using only STL. Might be the best thing to go for as a start. I like the way of using a map to store the (int,int) tuples. – fuenfundachtzig Dec 10 '10 at 22:42
  • Well, I left the question open to motivate people to give an alternative answer. I am still wondering if there is a more efficient (in terms of memory consumption) way of storing the `n(k|ij)` table. I could image the map introduces quite an overhead? – fuenfundachtzig Dec 13 '10 at 13:03
  • @fuenfundachtzig If the table is sparse, the map will be more efficient (you can assume the probability is zero if a key is absent from the map). If not, the dense data structure that stores all possible outcome probabilities for a lexicographic ordering of inputs will be the most efficient (if the full joint distribution is necessary). If the joint distribution can be decomposed into independent distributions, of course storing those independent distributions will be more efficient (see Lewis Product approximations). These are just implementations of map. So:you should accept the answer. – user Aug 08 '13 at 08:01
  • I might have not thought it thoroughly through, but it seems to me an unordered_map would do the trick more efficiently and elegantly. i.e. just encode the key as " " and use that to get the count of that trigram? this would scale well for NGrams (only O(k) lookup times, where k is the length of the lookup string). more efficient memory-wise too. – Mr.WorshipMe Apr 01 '16 at 12:25