TreeSet to find k most frequent words in a book?

Question

The commonly occurring question of finding k most frequent words in a book ,(words can dynamically be added), is usually solved using combination of trie and heap.

However, I think even using a TreeSet should suffice and be cleaner with log(n) performance for insert and retrievals.

The treeset would contain a custom object:

class MyObj implements Comparable{
  String value;
  int count;

 public int incrementCount(){count++;}

 //override equals and hashcode to make this object unique by string 'value'

 //override compareTo to compare count
}

Whenever we insert object in the treeset we first check if the element is already present in the treeset if yes then we get the obj and increment the count variable of that object.

Whenever, we want to find the k largest words , we just iterate over the first k elements of the treeset

What are your views on the above approach? I feel this approach is easier to code and understand and also matches the time complexity of the trie and heap approach to get k largest elements

EDIT: As stated in one of the answers , incrementing count variable after myobj has been inserted wouldn't re-sort the treeset/treemap. So ,after incrementing the count , I will additionally need to remove and reinsert the object in the treeset/treemap

Thiyagu · Answer 1 · 2018-03-31T13:08:07.377

Once you enter an object into the TreeSet, if the properties used in the comparison of the compareTo method changes, the TreeSet (or the underlying TreeMap) does not reorder the elements. Hence, this approach does not work as you expect.

Here's a simple example to demonstrate it

public static class MyObj implements Comparable<MyObj> {
    String value;
    int count;

    MyObj(String v, int c) {
        this.value = v;
        this.count = c;
    }
    public void incrementCount(){
        count++;
    }

    @Override
    public int compareTo(MyObj o) {
        return Integer.compare(this.count, o.count); //This does the reverse. Orders by freqency
    }
}
 public static void main(String[] args) {
    Set<MyObj> set = new TreeSet<>();
    MyObj o1 = new MyObj("a", 1);
    MyObj o2 = new MyObj("b", 4);
    MyObj o3 = new MyObj("c", 2);
    set.add(o1);
    set.add(o2);
    set.add(o3);
    System.out.println(set);
   //The above prints [a-1, c-2, b-4]

   //Increment the count of c 4 times
    o3.incrementCount();
    o3.incrementCount();
    o3.incrementCount();
    o3.incrementCount();
    System.out.println(set);
   //The above prints [a-1, c-6, b-4]

As we can see the object corresponding to c-6 does not get pushed to the last.

   //Insert a new object
    set.add(new MyObj("d", 3));
    System.out.println(set);
   //this prints [a-1, d-3, c-6, b-4] 
}

EDIT:
Caveats/Problems:

Using count when comparing two words would remove one word if both words have the same frequency. So, you need to compare the actual words if their frequencies are same.
It would work if we remove and reinsert the object with the updated frequency. But for that, you need to get that object(MyObj instance for a specified value to know the frequency so far) from the TreeSet. A Set does not have a get method. Its contains method just delegates to the underlying TreeMap's containsKey method which identifies the object by using the compareTo logic (and not equals). The compareTo function also takes into account the frequency of the word, so we cannot identify the word in the set to remove it (unless we iterate the whole set on each add)

How about if I reinsert the object? (remove and insert) – Manas Saxena Mar 31 '18 at 12:12 — Manas Saxena, Mar 31 '18 at 12:12

score 1 · Answer 2 · answered Mar 31 '18 at 13:41

A TreeMap should work if you remove and insert the object, with an integer key as a frequency and a list of MyObj as a value, the keys are sorted by frequency. An update of the above code demonstrate it:

public class MyObj  {
String value;
int count;

MyObj(String v, int c) {
    this.value = v;
    this.count = c;
}

public int getCount() {
    return count;
}

public void incrementCount() {
    count++;
}



@Override
public String toString() {
    return value + " " + count;
}

public static void put(Map<Integer, List<MyObj>> map, MyObj value) {
    List<MyObj> myObjs = map.get(value.getCount());
    if (myObjs == null) {
        myObjs = new ArrayList<>();
        map.put(value.getCount(),myObjs);
    }
    myObjs.add(value);
}

public static void main(String[] args) {
    TreeMap<Integer, List<MyObj>> set = new TreeMap<>();
    MyObj o1 = new MyObj("a", 1);
    MyObj o2 = new MyObj("b", 4);
    MyObj o3 = new MyObj("c", 2);
    MyObj o4 = new MyObj("f", 4);

    put(set,o1);
    put(set,o2);
    put(set,o3);
    System.out.println(set);

    put(set,o4);
    System.out.println(set);
}

}

TreeSet to find k most frequent words in a book?

2 Answers2