4

I have a large set of sentences (10,000) in a file. The file contains one sentence per file. In the entire set, I want to find out which words occur together in a sentence and their frequency.

Sample sentences:

"Proposal 201 has been accepted by the Chief today.", 
"Proposal 214 and 221 are accepted, as per recent Chief decision",     
"This proposal has been accepted by the Chief.",
"Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.",     
"Proposal 214, ValueMania, has been accepted by the Chief."};

I would like to code the following output. I should be able to provide three starting words as parameters to program: "Chief, accepted, Proposal"

Chief accepted Proposal            5
Chief accepted Proposal has        3
Chief accepted Proposal has been   3

... 
...
for all combinations.

I understand that the combinations might be huge.

I have searched online but could not find. I have written some code but cant get my head around it. Maybe someone who knows the domain might know.

ReadFileLinesIntoArray rf = new ReadFileLinesIntoArray();

            try {
                String[] tmp = rf.readFromFile("c:/scripts/SelectedSentences.txt");
                for (String t : tmp){
                      String[] keys = t.split(" ");
                      String[] uniqueKeys;
                      int count = 0;
                      System.out.println(t);
                      uniqueKeys = getUniqueKeys(keys);
                        for(String key: uniqueKeys)
                        {
                            if(null == key)
                            {
                                break;
                            }           
                            for(String s : keys)
                            {
                                if(key.equals(s))
                                {
                                    count++;
                                }               
                            }
                            System.out.println("Count of ["+key+"] is : "+count);
                            count=0;
                        }
                }
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }

private static String[] getUniqueKeys(String[] keys) {
        String[] uniqueKeys = new String[keys.length];

        uniqueKeys[0] = keys[0];
        int uniqueKeyIndex = 1;
        boolean keyAlreadyExists = false;

        for (int i = 1; i < keys.length; i++) {
            for (int j = 0; j <= uniqueKeyIndex; j++) {
                if (keys[i].equals(uniqueKeys[j])) {
                    keyAlreadyExists = true;
                }
            }

            if (!keyAlreadyExists) {
                uniqueKeys[uniqueKeyIndex] = keys[i];
                uniqueKeyIndex++;
            }
            keyAlreadyExists = false;
        }
        return uniqueKeys;
    }

Could someone help in coding this please?

  • Yes, you would have a very large set of permutations. You can use a Map, like TreeMap to store the keys in the map as the unique string and the value of the map as a count. Alternatively, you could create your own small datastructure to store the name/value information. – pczeus Mar 11 '16 at 05:39
  • What does the output of 3 mean for Chief, accepted, Proposal? Does that mean there are 3 sentences where those 3 words occur in the sentence? Does capitalization matter? – stackoverflowuser2010 Mar 11 '16 at 05:58
  • sorry "Chief accepted proposal" should be 5 and "Chief accepted proposal has" should be 3...will edit – Jonathan Grey Mar 11 '16 at 06:06
  • @JonathanGrey: Why does "Chief accepted proposal" have a value of 5? – stackoverflowuser2010 Mar 11 '16 at 06:17
  • may be one question you should consider, what are you going to do with the permutations? depending on that, do you actually need to generate all of them? given the set of dimensions, can you build a graph with explicit sink/top node and operate on that, with edges denoting occurrence count – abasu Mar 11 '16 at 06:18
  • Why don't you try `Apriori Algorithm` to achieve this, I did same for finding frequent patterns in dataset of files.. – ELITE Mar 11 '16 at 06:21
  • This I want use for training and finding a text classification model and use later for text classification. I want to use the permutations to find the most common words used with the three parameters I supplied, i.e. "Chief accepts proposal." All these words which occur together would give me a better model..I would then feed this model in another program in again another dataset of sentences to classify sentences. – Jonathan Grey Mar 11 '16 at 08:02
  • "Chief accepted proposal" have a value of 5 because these 3 words has occurred together in 5 sentences. – Jonathan Grey Mar 11 '16 at 08:09

1 Answers1

0

You can apply standard information retrieval data structures, particularly an inverted index. Here is how you do it.

Consider your original sentences. Number them with some integer identifier, like so:

  1. "Proposal 201 has been accepted by the Chief today.",
  2. "Proposal 214 and 221 are accepted, as per recent Chief decision",
  3. "This proposal has been accepted by the Chief.",
  4. "Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.",
  5. "Proposal 214, ValueMania, has been accepted by the Chief."

For every pair of words that you encounter in a sentence, add it to an inverted index that maps the pair to a set (a group of unique items) of sentence identifiers. For a sentence of length N, there are N-choose-2 pairs.

The appropriate Java data structure will be Map<String, Map<String, Set<Integer>>. Order the pairs alphabetically so that the pair "has" and "Proposal" will occur only as ("has", "Proposal") and not ("Proposal", "has").

This map will contain the following:

"has", "Proposal" --> Set(1, 5)
"accepted", "Proposal" --> Set(1, 2, 5)
"accepted", "has" --> Set(1, 3, 5)
etc.

For example, the word pair "has" and "Proposal" has a set of (1, 5), meaning that they were found in sentences 1 and 5.

Now suppose you want to look up the number of co-occurrences of the words in the list of "accepted", "has", and "Proposal". Generate all pairs from this list and intersect their respective lists (using Java's Set.retainAll()). The result here will be final set with (1, 5). Its size is 2, meaning there are two sentences that contain "accepted", "has", and "Proposal".

To generate all pairs, simply iterate through your map as needed. To generate all word tuples of size N, you will need to iterate and the use recursion as needed.

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217