5

I'm trying to implement prediction by analyzing sentences. Consider the following [rather boring] sentences

Call ABC
Call ABC again
Call DEF

I'd like to have a data structure for the above sentences as follows:

Call: (ABC, 2), (again, 1), (DEF, 1)
ABC: (Call, 2), (again, 1)
again: (Call, 1), (ABC, 1)
DEF: (Call, 1)

In general, Word: (Word_it_appears_with, Frequency), ....

Please note the inherent redundancy in this type of data. Obviously, if the frequency of ABC is 2 under Call, the frequency of Call is 2 under ABC. How do I optimize this?

The idea is to use this data when a new sentence is being typed. For example, if Call has been typed, from the data, it's easy to say ABC is more likely to be present in the sentence, and offer it as the first suggestion, followed by again and DEF.

I realise this is one of a million possible ways of implementing prediction, and I eagerly look forward to suggestions of other ways to do it.

Thanks

WeNeigh
  • 3,489
  • 3
  • 27
  • 40
  • I'm fairly sure that there is no well-established answer because your goal is not tangible enough. Basically, this is an AI problem, and AI solutions usually have their own quirks that people can live with; however, without knowing your exact context, it's hard to tell what quirks would be acceptable. For this reason, I'm voting to close your question. (It's a very interesting one, just not suited for Stack Overflow in my opinion.) – zneak Nov 11 '11 at 20:41
  • That said, you could use a tree representation for your words, and have each branch of the tree hold a probability. This could work well if the input is repetitive and the syntax relatively fixed, but you'll have trouble matching natural language like that. – zneak Nov 11 '11 at 20:44
  • I could use a tree, yes, but I would like to eliminate the redundancy in the data: The frequency of word1 occurring with word2 will obviously be the same as word2 occurring with word1. Also, the input is continuous, so probabilities are out of the question. – WeNeigh Nov 14 '11 at 19:57

3 Answers3

1

Maybe using a bidirectional graph. You can store the words as nodes, with edges as frequencies.

Mike Dinescu
  • 54,171
  • 16
  • 118
  • 151
0

You can use the following data structure too:

Map<String, Map<String, Long>>
James Jithin
  • 10,183
  • 5
  • 36
  • 51
  • Guava has implemented this in the Table class. http://docs.guava-libraries.googlecode.com/git-history/v10.0.1/javadoc/com/google/common/collect/Table.html – John B Nov 11 '11 at 20:45
0

I would consider one of two options:

Option 1:

class Freq {
    String otherWord;
    int freq;
}

Multimap<String, Freq> mymap;

or maybe a Table

Table<String, String, int>

Given the above Freq: you might want to do bi-directional mapping:

class Freq{
    String thisWord;
    int otherFreq;
    Freq otherWord;
}

This would allow for very quick updating of data pairs.

John B
  • 32,493
  • 6
  • 77
  • 98