1

I want to find the frequency of a multiple-token-string or phrase inside a document. Its not the word/single-term frequency that I am looking for, its always will be multiple-term and the number of terms are dynamic ...

ex : searching the frequency of "words with friends" inside a document!

Any help/pointer will be much appreciated.

Thanks Debjani

user430354
  • 51
  • 1
  • 4
  • Are you saying that there will be multiple phrases to search by and you want to know the frequency of each of the phrases? – Ali Aug 12 '11 at 10:09

2 Answers2

3

You can read the document line by line using Buffered Reader, and then use split function to get the frequency of word/token

int count=0;
while ((strLine = br.readLine()) != null)   {
     count+ = (strLine.split("words with friends").length-1);     
}
return count;

EDIT: And if you want to perform case-insensitive search, then you can use

Pattern myPattern = Pattern.compile("words with friends", Pattern.CASE_INSENSITIVE);
int count=0;
while ((strLine = br.readLine()) != null)   {
     count+ = (myPattern.split(strLine).length-1);    
}
return count;
Ankur
  • 12,676
  • 7
  • 37
  • 67
  • Yes, that's why the whole document is to be read into a `String` and then `split()`ed. – asgs Aug 12 '11 at 10:20
  • @stivlo but that would mean there is an `end of line` in between – Ankur Aug 12 '11 at 10:20
  • 1
    @Ankur I think it may be a mistake viewing it as a very strict match on an exact string, rather than just looking for a particular sequence of words in the document. You'd also perhaps need to take case into consideration - would 'Words with friends' be an acceptable match for the 'words with friends' example in the question? – Anthony Grist Aug 12 '11 at 10:31
  • @Ankur : I am getting the document content as a String, in that case how can I use your method ? and yes I want it to be case-insensitive. – user430354 Aug 15 '11 at 03:45
  • @user430354: you can use my second method, you can use `Pattern`, the `strLine` passed is a string – Ankur Aug 15 '11 at 15:08
1

Why not use regex? Regex is optimized for this sort of task.

http://download.oracle.com/javase/1.5.0/docs/api/java/util/regex/Matcher.html

James Scriven
  • 7,784
  • 1
  • 32
  • 36