-1

It's my first question in stack overflow. it's long but I have explained it in detail and I think it's understandable.

I'm writing huffman code by c++ and saved characters and codes in a table like this:

Text: AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE

Table: (Made by huffman tree) Table

Now, I want to save this table to a file in the best way.

I can't save like this: A1B001C010D001E000

When it change to bits: 01000001101000010001010000110100100010000101000101000

Because I can't decode this.

If I save table in normal way, every character use 8 bit for saving it's code.

While my characters have 1bit or 3bit code. (In this case.)

this way use much storage.

My idea is add a separator character and set a code for it.

If we add a separator character and make huffman tree and write codes, have a table like this. table2

Now, we can write codes in this way.

A0SepB110SepC100SepD1111sepE1110sep.

binary= 0100000101010100001011010101000011100101010001001111101010001011110101

I decode it in this way:

sep = 101.

  • Read 8 bit : 01000001 -> it's A.

rest = 01010100001011010101000011100101010001001111101010001011110101.

  • Read 1 bit : 0 (unlike sep1)
  • Read 1 bit : 1 (like sep1), Read 1 bit : 0 (like sep2), Read 1 bit : 1 (like sep3(end))
  • Sep was found so A = everything was befor sep = 0;

rest = 0100001011010101000011100101010001001111101010001011110101.

  • Read 8 bit : 01000010 -> it's B.

rest = 11010101000011100101010001001111101010001011110101.

  • Read 1 bit : 1 (like sep1)- Read 1 bit : 1 (unlike sep2)
  • Read 1 bit : 0 (unlike sep1)
  • Read 1 bit : 1 (like sep1) - Read 1 bit : 0 (like sep2) - Read 1 bit :1 (like sep3(end))
  • Sep was found so B = everything was befor sep = 110;

And so on ...

This way still use a little storage for separator ( number of characters * separator size )

My question: Is there a way to save first table in a file and use less storage?

For example like this: A1B001C010D001E000.

mohammad
  • 3
  • 3

2 Answers2

1

Don't save the table with the codes. Just save the lengths. See Canonical Huffman Code.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
1

You can store the lengths of the codes (as Mark said) as a 256 byte header at the start of your compressed data. Each byte stores the length of the code, and because you're working with bytes with 256 possible values, and the huffman tree can only be of a certain depth (number of possible values - 1) you only need 8 bits to store the codes.

The first byte would store the code length of the value 0x00, the second byte stores the code length of 0x01, and so on and so forth.

However, if compressing English text, there is a better way to store the table. Store the shape of the tree, 0s for nodes and 1s for leaves. Then, after you store the nodes and the leaves, you store the values of the leaves.

The tree for AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE looks like this:

        *
      /   \
     *     A
   /   \
  *     *
 / \   / \
E   D C   B

So you would store the shape of the tree as such: 000110111EDCBA

The reason why storing the huffman codes in this way is better for when you are compressing English text is that storing the shape of the tree costs 10n - 1 bits (where n is the number of unique characters in the data you are trying to compress) while storing the code lengths costs a flat 2048 bits. Therefore, for numbers of unique characters less than 205, storing the shape of the tree is more efficient, and because the average English string of text isn't going to have all that many of the possible 256 possible ASCII characters, you're usually better off storing the tree shape.

If you aren't just compressing text, and you're compressing more general data where there is a high likelihood that the number of unique characters could be greater than or equal to 205, you should probably use the code length storing format, or include 1 bit at the start of your header that says whether there's going to be a tree or a bunch of code lengths, and then write your decoder to decode either one depending on what that bit is set to.

fortytoo
  • 452
  • 2
  • 11