Determining frequencies for a Huffman compressed file

Question

I am having issues with determining the priority for each node that contains a char In order to decompress my compressed file.

So When I compress a file it gives me a txt file containing something like this:

I compressed: Hello world this is a test.

@111^a@10000^#@10001^d@10011^e@1010^H@0000^h@10110^i@1101^l@001^.@0001^o@1100^r@10010^s@011^t@010^w@10111^%

00001010001001110011110111110010010001100111110101011011010111111101011111100001110101010011010000110001

The first two lines have the binary representation of each character contained in the compressed file.

The second two lines are the actual compressed txt.

The compression class creates a tree by setting the priority of nodes equal to the number of occurrences.

However, The compression class does not write the number of occurrences for each character to the output file.

To determine priority I was thinking I could maybe do It by the length of the binary string for each character. If the length of the string is bigger then its used less frequently. On the other hand if it is smaller it is used more. However this didn't seem to construct the tree how I wanted it to as it gave me the wrong output.

I was able to get the right output by editing what the compressed file writes to the output file. Basically just passing in the characters frequencies as well. But how would I do this without having those values is my main question.

I was also thinking that I could just make the tree based on the actual Binary string for each character. Something like create a dummy root node. If charAt(i) in the string equals 0 go left else go right. I think my code may be a bit off for this because I get null pointer exceptions while trying to traverse the tree. Ill post the code below.

This is a short simple version version if needed I could post more

public class Decompress {
private static Node root;
private static HashMap<Character, String> values = new HashMap<Character, String>();
private static HashMap<Character, Integer> freq = new HashMap<Character, Integer>();

public Decompress() {
    root = null;
}

private static class Node implements Comparable {
    public Character value;
    public Integer number;
    public Node left;
    public Node right;

    // necessary in order for the priority queue to work
    // since it uses the compareTo to determine priority.
    public int compareTo(Object o) {
        Node other = (Node) o;
        if (other.number < number) {
            return 1;
        }
        if (other.number == number) {
            return 0;
        }
        return -1;
    }

public static void main(String args[]) throws IOException {
    BufferedReader fin = new BufferedReader(new FileReader("output" + ".txt"));
    String binaryDigits = insertListHelper(fin); // contains the compressed txt
    root = createTree(binaryDigits);  // Grabs the root node from method
    Node hold = root;
    // code for traversing the tree to find the character
    for (int i = 0; i < binaryDigits.length(); i++) {
        if (binaryDigits.charAt(i) == '1') {
            root = root.right;
        } else if (binaryDigits.charAt(i) == '0') {
            root = root.left;
        }
        if (root.left == null && root.right == null) {
            System.out.println(root.value);
            root = hold;
        }
    }
}

// works when I have the correct frequency
public static Node createTree(String binaryDigit) {
    PriorityQueue<Node> pq = new PriorityQueue<Node>();
    // insert all 1 node trees into pq
    Set<Character> s = values.keySet();
    for (Character c : s) {
        Node temp = new Node();
        temp.value = c;
        temp.number = values.get(c).length();
        temp.left = null;
        temp.right = null;
        pq.add(temp);
    }

    Node eof = new Node();
    eof.value = '#';
    eof.number = 1;
    eof.left = null;
    eof.right = null;
    pq.add(eof);

    while (pq.size() > 1) {
        Node left = pq.poll();
        Node right = pq.poll();
        Node temp = new Node();
        temp.value = null;
        temp.number = left.number + right.number;
        temp.left = left;
        temp.right = right;
        pq.add(temp);
    }
    return pq.peek();
}

// does not work any suggestions? 
public static Node createTree2() {
    String[] binaryRep = new String[values.size()];
    int k = 0;
    int lengthOfStr = 0;
    Set<Character> s1 = values.keySet();
    for (Character c : s1) {
        binaryRep[k] = values.get(c);
        System.out.println(c + " String : " + binaryRep[k]);

        Node root = new Node();
        root.value = 'R';
        root.left = null;
        root.right = null;
        Node hold = root;
        lengthOfStr = binaryRep[k].length();
        for (int i = 0; i < binaryRep[k].length(); i++) {
            if (binaryRep[k].charAt(i) == '1' && root.right != null) {
                root = root.right;
            } else if (binaryRep[k].charAt(i) == '0' && root.left != null) {
                root = root.left;
            } else if (binaryRep[k].charAt(i) == '1' && root.right == null && lengthOfStr == 0) {
                // found our place to insert
                Node temp = new Node();
                temp.left = null;
                temp.right = null;
                temp.number = 1;
                temp.value = c;
                root.right = temp;
                // move forward to the temp var
                root = root.right;
                root = hold;
                lengthOfStr--;
            } else if (binaryRep[k].charAt(i) == '0' && root.left == null && lengthOfStr == 0) { // should be a leaf
                                                                                                    // node
                // found our place to insert
                Node temp = new Node();
                temp.left = null;
                temp.right = null;
                temp.number = 0;
                temp.value = c;
                root.left = temp;
                // move forward to the temp var
                root = root.right;
                root = hold;
                lengthOfStr--;
            } else if (binaryRep[k].charAt(i) == '1' && root.right == null) {
                // found our place to insert
                Node temp = new Node();
                temp.left = null;
                temp.right = null;
                temp.number = 1;
                temp.value = null;
                root.right = temp;
                // move forward to the temp var
                root = root.right;
                lengthOfStr--;
            } else if (binaryRep[k].charAt(i) == '0' && root.left == null) {
                // found our place to insert
                Node temp = new Node();
                temp.left = null;
                temp.right = null;
                temp.number = 0;
                temp.value = null;
                root.left = temp;
                // move forward to the temp var
                root = root.left;
                lengthOfStr--;
            }
        }
        k++;
    }
    return root;
}


}

score 0 · Answer 1 · answered Aug 04 '19 at 23:07

You are on the right track thinking that you could do it by the length of the binary string for each character. Many programs that do Huffman compression store a short header at the beginning of the file (or at the beginning of big chunks of the file). Typically the compressor and the decompressor agree to use "canonical Huffman codes". Then the header only needs to store which symbols are actually used in the plaintext, and the length (in bits) of the Huffman code for each of those symbols.

It is not possible to decode a Huffman-compressed file given only the symbols used in the plaintext and their ranking. If you have 5 different symbols, the Huffman tree will contain 5 different bit sequences -- however, the exact bit sequences generated by the Huffman algorithm depend on the exact frequencies. One document may have symbol counts of { 10, 10, 20, 40, 80 }, leading to Huffman bit sequences { 0000 0001 001 01 1 }. Another document may have symbol counts of { 40, 40, 79, 79, 80 }, leading to Huffman bit sequences { 000 001 01 10 11 }. Even though both situations have exactly 5 unique symbols, ranked in the same order, the actual Huffman code for the most-frequent symbol is very different in these two compressed documents -- the Huffman code "1" in one document, the Huffman code "11" in another document. ( Maximum number of different numbers, Huffman Compression ).

Determining frequencies for a Huffman compressed file

1 Answers1