Comparator and Priority Queues

Question

I'm in the process of coding Huffman Code where I import a file, generate huffman code for each character, then output the binary to a file. To import the characters I am using a scanner that reads each character, puts it in a node that has values of the read character and a frequency of 1. Then, the node is added to a PriorityQueue. Since the Node class has a compareTo method that compares only frequency, how can I implement a comparator to this specific PriorityQueue that compares the characters when sorting in queue?

Literal example: The queue of characters should be sorted as follows:

[A:1][A:1][A:1][B:1][C:1]
Next step:
[A:1][A:2][B:1][C:1]
Final:
[A:3][B:1][C:1]

Here are some snippets:

protected class Node implements Comparable<Node>{
    Character symbol;
    int frequency;
    
    Node left = null;
    Node right = null;
    @Override
    public int compareTo(Node n) {
        return n.frequency < this.frequency ? 1 : (n.frequency == this.frequency ? 0 : -1);
    }
    
    public Node(Character c, int f){
        this.symbol = c;
        this.frequency = f;
    }
    public String toString(){
        return "["+this.symbol +","+this.frequency+"]";
    }

This is the PriorityQueue that needs a custom comparator:

public static PriorityQueue<Node> gatherFrequency(String file) throws Exception{
    File f = new File(file);
    Scanner reader = new Scanner(f);
    PriorityQueue<Node> PQ = new PriorityQueue<Node>();
    while(reader.hasNext()){
        for(int i = 0; i < reader.next().length();i++){
            PQ.add(new Node(reader.next().charAt(0),1));
        }
    }
    if(PQ.size()>1){ //during this loop the nodes should be compared by character value
        while(PQ.size() > 1){
            Node a = PQ.remove();
            Node b = PQ.remove();
            if(a.symbol.compareTo(b.symbol)==0){
                Node c = new Node(a.symbol, a.frequency + b.frequency);
                PQ.add(c);
            }
            else break;
        }
        return PQ;
    }
    return PQ;
    
}

This is the new method I created using a HashMap:

public static Collection<Entry<Character,Integer>> gatherFrequency(String file) throws Exception{
        File f = new File(file);
        Scanner reader = new Scanner(f);
        HashMap<Character, Integer> map = new HashMap<Character, Integer>();
        while(reader.hasNext()){
            for(int i = 0; i < reader.next().length();i++){
                Character key = reader.next().charAt(i);
                if(map.containsKey(reader.next().charAt(i))){
                    int freq = map.get(key);
                    map.put(key, freq+1);
                }
                else{
                    map.put(key, 1);
                }
            }
        }
        return map.entrySet();
    }

This appear to be far more complicated than it needs to be. Shouldn't all `A` be counted even if they are not consecutive. — Peter Lawrey, Apr 15 '11 at 15:28
They will always be consecutive if they are in a PriorityQueue that sorts by Character value — Trevor Arjeski, Apr 15 '11 at 15:31

score 2 · Accepted Answer · answered Apr 15 '11 at 15:36

2

The standard approach to implementing Huffman trees is to use a hashmap (in Java, you'd probably use a HashMap<Character, Integer>) to count the frequency for each letter, and insert into the priority queue one node for each letter. So when constructing the Huffman tree itself, you start out with a priority queue that is already in the "final" state that you showed. The Huffman algorithm then repeatedly extracts two nodes from the priority queue, constructs a new parent node for those two nodes, and inserts the new node into the priority queue.

answered Apr 15 '11 at 15:36

Aasmund Eldhuset

37,289
4
68
81

@Aasmund Thanks a lot. I'll look around for implementation of HashMap – Trevor Arjeski Apr 15 '11 at 15:44
@trevorma, HashMap is built in. A more efficient collection would be TIntIntHashMap as it uses primtives. – Peter Lawrey Apr 15 '11 at 15:48
@TrevorMA: It's a part of the [Java Collections API](http://download.oracle.com/javase/1.4.2/docs/api/java/util/HashMap.html). Just use `import java.util.HashMap;`. – Aasmund Eldhuset Apr 15 '11 at 15:49
@Peter How do I import TIntIntHashMap? – Trevor Arjeski Apr 15 '11 at 15:57
@TrevorMA, Download it from http://trove.starlight-systems.com/ (found using google), add the jar to your classpath and `import gnu.trove.TIntIntHashMap;` – Peter Lawrey Apr 15 '11 at 16:00
1

@TrevorMA: Unless you really need the slight performance improvement offered by `TIntIntHashMap`, I recommend that you use the standard `HashMap`, in particular since you haven't used it before (this will be a good opportunity to learn it, and you'll come across `HashMap` _much_ more often than `TIntIntHashMap`). – Aasmund Eldhuset Apr 15 '11 at 16:11
@Aasund Hey, I appreciate the guidance. I'm reading about HashMap as we speak. I am just trying to plan out how I will increment the values as I traverse the file, as the documentation does not show a set method. – Trevor Arjeski Apr 15 '11 at 16:21
1

@TrevorMA: True; that can be a little confusing. The `put()` method is used both to place something in the hashmap for the first time, and also to replace an existing value. Let's say that you have a character stored in the variable `c`; then you'll first need to check if the hashmap contains `c` as a key. If it does, you can read the current frequency with `get()`, compute frequency + 1 and update the frequency in the hashmap with `put()`. If the key is not there, you can add it with the frequency 1. – Aasmund Eldhuset Apr 15 '11 at 16:30
@Aasmund Perfect. I have on last question if you don't mind, then I will stop being a bother. Once I have collected all the characters and their frequencies, how do I return a Collection that can be used in the construction of my Huffman Tree? I know that I can return a Collection by using the .values() method. – Trevor Arjeski Apr 15 '11 at 16:43
1

@TrevorMA: No problem. Use the `entrySet()` method, which gives you a collection of map entries, where each entry contains both the key and the value. – Aasmund Eldhuset Apr 15 '11 at 16:47
1

@TrevorMA: Glad it helped. By the way, remember to somehow include a description of the Huffman tree in the output file, since the bit sequence that is produced by the algorithm isn't of much use if you don't have the tree... ;-) – Aasmund Eldhuset Apr 15 '11 at 16:54
@Aasmund Would you mind looking at my new method? I keep getting a NoSuchElementException. – Trevor Arjeski Apr 15 '11 at 17:44
1

@TrevorMA: After doing `Character key = reader.next().charAt(i);`, you call `reader.next()` once more, thus reading one more character. That character will likely be different from the one that is now stored in `key`, and you ask if the _second_ character is present in the dictionary - and if it is, you try to update `key`. (Edit: I see now that you call `next()` many times - each call to next gives you the next string from the input... Use `next()` only once in the entire `while` loop body.) – Aasmund Eldhuset Apr 15 '11 at 17:50
@Aasumnd Thanks, I debugged and realized that I called next() too many time as well. I haven't used java in a while so I'm trying to get back into things. Thanks again. – Trevor Arjeski Apr 15 '11 at 17:56
@Aasmund ONE final question, I promise. I'm still a little foggy about getting my collection entries into the Node objects in order to be placed in the priorityqueue – Trevor Arjeski Apr 15 '11 at 18:57
1

@TrevorMA: Let's say that you store the return value from `gatherFrequency()` in a variable called `entries`. Then you could iterate through it with a foreach loop and create nodes based on the entries: `for (Entry entry : entries) { PQ.add(new Node(entry.getKey(), entry.getValue())); }`. – Aasmund Eldhuset Apr 15 '11 at 19:10
@Trevor Arjeski: Thanks - or maybe I've just been programming for too long ;-) – Aasmund Eldhuset Apr 15 '11 at 20:28
A hash map?! Seems like a waste of time. You can count the frequencies much faster in a simple array of 256 counts directly indexed by the character value. – Mark Adler Jan 12 '23 at 15:56
@MarkAdler: True - if you're dealing with ASCII or some other 7- or 8-bit character set. Unicode characters might require as much as a 32-bit integer to represent (so even my choice of `Character` was debatable). If you stick to a character (sub)set that fits in 16 bits, a 65536-entry array isn't too bad, but using 2^32 for all of Unicode is probably not advisable. And unless very high performance is required, my personal opinion is to use types that provide as much semantics as possible about their intended usage. And hashmap operations are O(1), though certainly slower by a constant factor. – Aasmund Eldhuset Jan 13 '23 at 00:23
@MarkAdler: After double-checking Wikipedia, I see that Unicode only covers a range of 1114112 code points (I'm just used to thinking of having to use an `int` to store a "full" character). Even so, most texts only use a fraction of the character, and so the array would be very sparsely populated and waste a lot of space. Which certainly is sometimes the tradeoff one does want to make, but not always :-) – Aasmund Eldhuset Jan 13 '23 at 00:29

Comparator and Priority Queues

1 Answers1