3

I am trying to count the number of distinct words in the text, using Java.

The word can be a unigram, bigram or trigram noun. These three are already found out by using Stanford POS tagger, but I'm not able to calculate the words whose frequency is greater than equal to one, two, three, four and five, and their counts.

J. Martin
  • 1,683
  • 2
  • 17
  • 33
mahi
  • 31
  • 1
  • 1
  • 2
  • You may find a general algorithm here: [Counting words in Java](http://stackoverflow.com/questions/1983586/how-to-count-words-in-java). Is this homework? – Atreys Jun 23 '11 at 13:20
  • In my opinion, this question may need to be rewritten and greatly expanded, because it seems that you are not asking for a programmatic way of establishing a word count in the normal sense. For instance: Collectible green Cars is not 3 words, but one in this sense? I.e. these three words refer to one thing, those cars that are collectible and green at the same time? – J. Martin Apr 28 '12 at 14:16

3 Answers3

4

I might not be understanding correctly, but if all you need to do is count the number of distinct words in a given text depending on where/how you are getting the words you need to count from the text, you could use a Java.Util.Scanner and then add the words to an ArrayList and if the word already exists in the list don't add it and then the size of the list would be the number of Distinct words, something like the example below:

public ArrayList<String> makeWordList(){
    Scanner scan = new Scanner(yourTextFileOrOtherTypeOfInput);
    ArrayList<String> listOfWords = new ArrayList<String>();

       String word = scan.next(); //scanner automatically uses " " as a delimeter
       if(!listOfWords.contains(word)){ //add the word if it isn't added already
            listOfWords.add(word);
    }

    return listOfWords; //return the list you made of distinct words
}

public int getDistinctWordCount(ArrayList<String> list){
    return list.size();
}

now if you actually have to count the number of characters in the word first before you add it to the list then you would just need to add some statements to check the length of the word string before adding it to the list. for example:

if(word.length() <= someNumber){
//do whatever you need to
}

Sorry if i'm not understanding the question and just gave some crappy unrelated answer =P but I hope it helps in some way!

if you needed to keep track of how often you see the same word, even though you only want to count it once, you could make a variable that keeps track of that frequency and put it in a list such that the index of the frequency count is the same as the index in the ArrayList so you know which word the frequency corresponds too or better yet use a HashMap where the key is the distinct word and the value is its frequency (basically use the same code as above but instead of ArrayList use HashMap and add in some variable to count the frequency:

 public HashMap<String, Integer> makeWordList(){
        Scanner scan = new Scanner(yourTextFileOrOtherTypeOfInput);
        HashMap<String, Integer> listOfWords = new HashMap<String, Integer>();
        Scanner scan = new Scanner(sc);
        while(cs.hasNext())
       {
            String word = scan.next(); //scanner automatically uses " " as a delimeter
            int countWord = 0;
            if(!listOfWords.containsKey(word))
            {                             //add word if it isn't added already
                listOfWords.put(word, 1); //first occurance of this word
            }
            else
            {
                countWord = listOfWords.get(word) + 1; //get current count and increment
                //now put the new value back in the HashMap
                listOfWords.remove(word); //first remove it (can't have duplicate keys)
                listOfWords.put(word, countWord); //now put it back with new value
            }
       }
        return listOfWrods; //return the HashMap you made of distinct words
    }

public int getDistinctWordCount(HashMap<String, Integer> list){
       return list.size();
}

//get the frequency of the given word
public int getFrequencyForWord(String word, HashMap<String, Integer> list){
    return list.get(word);
}
yeaaaahhhh..hamf hamf
  • 746
  • 2
  • 13
  • 34
Wolfcow
  • 2,745
  • 1
  • 17
  • 10
  • What is variable "sc" and "cs" ?? – Jonathan Laliberte Feb 21 '17 at 17:30
  • 1
    it's been a long time since i even looked at this. I think sc was just supposed to represent another input, could be another file or command line etc. and then cs, i'm really not sure where that came from, it probably should say scan.hasNext() - sorry for the confusion. could have just been typing error too. lol – Wolfcow Jul 19 '17 at 19:20
3

You can use a Multiset

  • split the string on space
  • create a new multiset from the result

Something like

String[] words = string.split(" ");
Multiset<String> wordCounts = HashMultiset.create(Arrays.asList(words));
Bozho
  • 588,226
  • 146
  • 1,060
  • 1,140
1

There can be a many solutions for this problem, but one hat helped me, was as simple as below:

public static int countDistinctWords(String str){
        Set<String> noOWoInString = new HashSet<String>();
        String[] words = str.split(" ");
        //noOWoInString.addAll(words);
    for(String wrd:words){
        noOWoInString.add(wrd);
    }
    return noOWoInString.size();
}

Thanks, Sagar

Sagar
  • 173
  • 5
  • 8