Decoding a Huffman Tree from a File

Question

I'm trying to decode a Huffman code.

I got a dictionary of chars, with an int value and a binary value of it and the binary values of the words, and it looks like this:

10,000;121,001;13,010;33,011;100,1000;32,1001;104,1010;101,1011;111,1100;108,1101;119,1110;114,1111;10101011001100111101100111111011000011011010000

... where the numbers like 10 - 121 -13 -33 and others are int values of a character, next to them are the binary value of the char, and then the sequence of 1 and 0 are the code message.

After I read it from a file txt, I split it in strings array so I can have a hashmap with the char as a key and the binary value as value.

Then I save it in an array of nodes so I can take them easily, the problem is this:

When I try to convert the binary message to char using the dictionary, I get a message like this:

1!y1y111!y11111!

When it should be this:

hey world!!

This is the method I'm using:

void decompress() throws HuffmanException, IOException {
    File file = FilesManager.chooseUncompressedFile();
    if (file == null) {
        throw new HuffmanException("No file");
    }
    FileReader read = new FileReader(file);
    BufferedReader buff = new BufferedReader(read);
    String auxText;
    StringBuilder compressFromFile = new StringBuilder();
    do {
        auxText = buff.readLine();
        if (auxText != null) {
            compressFromFile.append(auxText);
        }
    } while (auxText != null);
    String[] auxSplit1 = compressFromFile.toString().split(" ");
    String rest1 = auxSplit1[1];
    String[] auxSplit2 = rest1.split(";");
    System.out.println(auxSplit2[2]);
    HashMap<Integer, String> map = new HashMap<>();
    String[] tomapAux;
    for (int i = 0; i < auxSplit2.length - 2; i++) {
        tomapAux = auxSplit2[i].split(",");

        map.put(Integer.valueOf(tomapAux[0]), tomapAux[1]);
    }
    ArrayList<CharCode> charCodeArrayList = new ArrayList<>();

    map.forEach((k, v) -> charCodeArrayList.add(new CharCode((char) k.intValue(), v)));

    charCodeArrayList.sort(new Comparator<CharCode>() {
        @Override
        public int compare(CharCode o1, CharCode o2) {
            return extractInt(o1.getCode()) - extractInt(o2.getCode());
        }

        int extractInt(String s) {
            String num = s.replaceAll("\\D", "");
            return num.isEmpty() ? 0 : Integer.parseInt(num);
        }
    });

    for (int i = 0; i < charCodeArrayList.size(); i++) {
        System.out.println("Pos " + i + " char: " + charCodeArrayList.get(i).getChr() + " code: " + charCodeArrayList.get(i).getCode());
    }
    String st = auxSplit2[auxSplit2.length - 1];
    System.out.println("before: " + st);
    String newChar = String.valueOf(charCodeArrayList.get(0).getChr());
    String oldChar = charCodeArrayList.get(0).getCode();
    for (CharCode aCharCodeArrayList : charCodeArrayList) {
        st = st.replace(oldChar, newChar);
        newChar = String.valueOf(aCharCodeArrayList.getChr());
        oldChar = aCharCodeArrayList.getCode();
    }
    System.out.println("after : " +st);

}

And this is the class CharCode:

public class CharCode implements Comparable<CharCode> {
private char chr;
private String code;

public CharCode(char chr, String code) {
    this.chr = chr;
    this.code = code;
}

public char getChr() {
    return chr;
}

public String getCode() {
    return code;
}

@Override
public int compareTo(CharCode cc) {
    return ((int) this.chr) - ((int) cc.getChr());
}

}

And this is what I see in the console:

and this is what I see in the console

So if anyone can help me on improving my method so I can get a hey world!! and not 1!y1y111!y11111! !!01, that would be great!

in the loop that fills your map: why do you use `for (int i = 0; i < auxSplit2.length - 2; i++)` instead of `for (int i = 0; i < auxSplit2.length - 1; i++)` ? you're skipping the last huffman code — mangusta, Jul 09 '18 at 02:11
@mangusta i use it so i can split the `10,000;121,001;13,010;33,011;etc; ` from the `10101011001100111101100111111011000011011010000` — Esteban, Jul 09 '18 at 02:50
index of this: `10101011001100111101100111111011000011011010000` is `auxSplit2.length-1`, everything else that comes before, are the huffman codes, so your loop condition must be `for (int i = 0; i < auxSplit2.length - 1; i++)` — mangusta, Jul 09 '18 at 02:54
@mangusta yep, i use that for just to separete the string "father", that have all of that in the same sentence, and then the `10,000;121,001;13,010;33,011;etc;` i save it in map of CodeChar, it work like i see a 10 and then a 000, so i save the int 10 as a key and the 000 as string value in the map — Esteban, Jul 09 '18 at 03:00
I'm not sure what you're talking about, but you definitely need to change the loop condition in the way I told you — mangusta, Jul 09 '18 at 03:02
@mangusta yeah, i already change it, but still got as result `1!y1y111!y11111!`, it's a little different from the last result, thanks — Esteban, Jul 09 '18 at 03:06
please check my answer below, I think you misunderstood the decoding process — mangusta, Jul 09 '18 at 04:01

mangusta · Accepted Answer · 2018-07-13T01:36:04.997

The problem with your program is that you're decoding in a wrong way: you take the first Huffman code, replace all of its occurences in a given string, then you do the same with the next Huffman code, and so on.

That's not the way of decoding Huffman-encoded string. In order to decode a Huffman-encoded string, you need to check if the PREFIX of the string is the same with some Huffman code. This is done by comparing the prefix of the string with Huffman codes one by one.

In your case:
iteration 1: 10101011001100111101100111111011000011011010000
we check 000 - not a prefix
we check 001 - not a prefix
we check 010 - not a prefix
we check 011 - not a prefix
we check 1000 - not a prefix
we check 1001 - not a prefix
we check 1010 - found a prefix! and it corresponds to letter h

Now we remove this prefix from the original string and so our string is
1011001100111101100111111011000011011010000

iteration 2: 1011001100111101100111111011000011011010000
suitable prefix is 1011 which is letter e

iteration 3: 001100111101100111111011000011011010000
suitable prefix is 001 which is letter y

iteration 4: 100111101100111111011000011011010000
suitable prefix is 1001 which is space character

and so on, until nothing remains from the original string.

The modified code looks as follows:

while(st.length() > 0)
{   

    for(int i_map = 0; i_map < charCodeArrayList.size(); i_map++)
    {
        CharCode cc = charCodeArrayList.get(i_map);

        if(st.startsWith(cc.getCode()))
        {
            System.out.println("found: " +  cc.getChr());
            st = st.substring(cc.getCode().length()); 
            break;
        }//end if

    }//end for      

}//end while

ooooohhhh i seeee, i was doing it so wrong, it's so logical now, thank you very much, you help me a lot @mangusta, you are the best — Esteban, Jul 09 '18 at 04:16

Decoding a Huffman Tree from a File

1 Answers1