2

I have compressed a binary file using Huffman encoding. Now i am trying to find the compression efficiency.

In my Binary file i have symbols(bunch of 0 & 1) and frequency(repetition of symbols). suppose i have :

symbol :0 freq : 173
symbol :1 freq : 50
symbol :2 freq : 48
symbol :3 freq : 45 

size of file would be (173+50+48+45)*8=2528 (If my way of calculating the size is correct? please correct me if i am wrong. (On debugging i get 2536) (8 more i don't know why ?)

After compression i got encoding like this

symbol : 0 Code : 1
symbol : 1 Code : 00
symbol : 2 Code : 011
symbol : 3 Code : 010

Could some one please tell me how to get Huffman compression efficiency of this binary file using these information ? (I tried searching on google but there is no sample of binary file they have some frequency of float type which i am not able to understand how to relate them with my Binary file). Thanks a lot for this . Algorithm (c/c++/c#) to do that is also appreciated.

Sss
  • 1,519
  • 8
  • 37
  • 67
  • Read http://stackoverflow.com/questions/4340610/any-theoretical-limit-to-compression and the related links – Tarik Apr 07 '14 at 11:56

3 Answers3

5

Given your symbol table:

symbol : 0 Code : 1
symbol : 1 Code : 00
symbol : 2 Code : 011
symbol : 3 Code : 010

and your byte counts:

symbol :0 freq : 173
symbol :1 freq : 50
symbol :2 freq : 48
symbol :3 freq : 45 

You then multiply the number of occurrences of each symbol by the number of bits for that symbol. For example, symbol 0 requires 1 bit to encode, so the number of bits would be 173. You have:

(1 * 173) + (2 * 50) + (3 * 48) + (3 * 45)

That count is in bits. Divide by 8 to give you the number of bytes, and round up. That will tell you how many bytes for the encoded data.

You also have to store the Huffman table, which in this case you could do in 8 bytes. Actually, 9 bytes because you have to store the size. The general case of storing the Huffman tree is somewhat more involved.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • sorry i couldn't understand this sentence "You also have to store the Huffman table" ? what is huffman table now and how to store it ? – Sss Apr 07 '14 at 12:16
  • Well, to be able to decode your file, you need to store the huffman table along with the actual data. You also need to have a structure to indicate where it ends. – Tarik Apr 07 '14 at 12:31
  • @Tarik i don't have to decode the file, just encoding and then compression ratio those two have to printed. Do i still need Huffman table ? Actually what huffman table consists of ? Does it contains Symbols + encoding ? – Sss Apr 07 '14 at 12:38
  • 1
    @user234839: The Huffman table is the list of symbols and their encodings. Those are required if somebody wants to decompress the file. Is there anything about this assignment that you *did* understand? – Jim Mischel Apr 07 '14 at 12:43
  • @JimMischel I have alreadu done it but don't know them by names. see here http://prntscr.com/37su18 Is the table in right hand side called "Huffman table"(which contains coding) and in left table called data (contains data) . Am i right ? please correct if i am wrong. Thanks a lot – Sss Apr 07 '14 at 12:46
  • @user234839 The fact that you do not have to decode does not mean you should not include the huffman table along with the data. The reason being that we could create a compression scheme using a huge decoding table with very large symbols and then store use only few bytes to represent the compressed data, thereby claiming high compression ratios. I believe that the total length of both the symbol table and compressed data should be considered when calculating the compression ratio. – Tarik Apr 08 '14 at 06:56
  • @Tarik I am not able to understand these technical terms even after seraching on google like "Symbol Table" and "dictionary" and "Data" . I have done every thing. Could you please see on the screen shot of my project and tell me which one is "symbol table", "dictionary" and "data" http://prntscr.com/37su18 (So that i would be able to proceed further, And would be a great help for me)Thanks you so much. – Sss Apr 08 '14 at 07:54
  • It's what you call encoding in your question "symbol : 0 Code : 1 symbol : 1 Code : 00 symbol : 2 Code : 011 symbol : 3 Code : 010" – Tarik Apr 08 '14 at 07:58
  • @Tarik so you mean length of these code mutilpied by their frequency (Frequency *CodeLength) when done for all symbols then their addition is called "Symbol table" . Am i right? and compession efficiency would be The result obtained by previous step/(divided by) the orginal length of file we had before compression. Am i going right ? – Sss Apr 08 '14 at 08:11
2

Once you have your Huffman table you can calculate the size of the compressed image in bits by multiplying the bit encoding length of each symbol with that symbol's frequency.

On top of that you then need to add the size of the Huffman tree itself, which is of course needed to un-compress.

So for you example the compressed length will be

173 * 1 + 50 * 2 + 48 * 3 + 45 * 3 = 173 + 100 + 144 + 135 = 552 bits ~= 70 bytes.

The size of the table depends on how you represent it.

  • Thanks for the reply, Do you mean 1*173 +2*50+3*48+3*45 ? and is the result obtained is in bits or bytes ? (and size of the file is 2528 bytes). – Sss Apr 07 '14 at 12:11
  • now for calculating efficiency , do we need to do 70/2528 ? – Sss Apr 07 '14 at 12:12
  • sorry one more thing i couldn't understand this sentence "store the Huffman table" ? what is huffman table now and how to store it ? – Sss Apr 07 '14 at 12:17
  • You need to communicate your encoding to the de-compressor - otherwise it will not know how to interpret the bits. – 500 - Internal Server Error Apr 07 '14 at 12:22
0

compression rate = ((1 * 173) + (2 * 50) + (3 * 48) + (3 * 45)) / (173+50+48+45) = 1.746 bits entropy rate = sum of Plog2P = 1.439 bits

koalagreener
  • 121
  • 1
  • 5