I have a large set of sentences (10,000) in a file. The file contains one sentence per file. In the entire set, I want to find out which words occur together in a sentence and their frequency.
Sample sentences:
"Proposal 201 has been accepted by the Chief today.",
"Proposal 214 and 221 are accepted, as per recent Chief decision",
"This proposal has been accepted by the Chief.",
"Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.",
"Proposal 214, ValueMania, has been accepted by the Chief."};
I would like to code the following output. I should be able to provide three starting words as parameters to program: "Chief, accepted, Proposal"
Chief accepted Proposal 5
Chief accepted Proposal has 3
Chief accepted Proposal has been 3
...
...
for all combinations.
I understand that the combinations might be huge.
I have searched online but could not find. I have written some code but cant get my head around it. Maybe someone who knows the domain might know.
ReadFileLinesIntoArray rf = new ReadFileLinesIntoArray();
try {
String[] tmp = rf.readFromFile("c:/scripts/SelectedSentences.txt");
for (String t : tmp){
String[] keys = t.split(" ");
String[] uniqueKeys;
int count = 0;
System.out.println(t);
uniqueKeys = getUniqueKeys(keys);
for(String key: uniqueKeys)
{
if(null == key)
{
break;
}
for(String s : keys)
{
if(key.equals(s))
{
count++;
}
}
System.out.println("Count of ["+key+"] is : "+count);
count=0;
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
private static String[] getUniqueKeys(String[] keys) {
String[] uniqueKeys = new String[keys.length];
uniqueKeys[0] = keys[0];
int uniqueKeyIndex = 1;
boolean keyAlreadyExists = false;
for (int i = 1; i < keys.length; i++) {
for (int j = 0; j <= uniqueKeyIndex; j++) {
if (keys[i].equals(uniqueKeys[j])) {
keyAlreadyExists = true;
}
}
if (!keyAlreadyExists) {
uniqueKeys[uniqueKeyIndex] = keys[i];
uniqueKeyIndex++;
}
keyAlreadyExists = false;
}
return uniqueKeys;
}
Could someone help in coding this please?