Why is huffman encoded text bigger than actual text?

Question

I am trying to understand how Huffman coding works and it is supposed to compress data to take less memory than actual text but when I encode for example

"Text to be encoded"

which has 18 characters the result I get is

"100100110100101110101011111000001110011011110010101100011"

Am I supposed to divide those result bits by 8 since character has 8 bits?

Actual result is `10010011 01001011 10101011 11100000 11100110 11110010 10110001 00000001` - **8** ASCII characters (technically, you should not *divide* by 8, but *group* by 8 bit chunks). More accurate is `"Text to be encoded" == 18 * 8 = 144 bits` before and `57` bits after the compression — Dmitry Bychenko, Jan 08 '18 at 21:59
"Text to be encoded" is a string. Each character in the uncompressed string is represented by an 8-bit ASCII character making the total uncompressed string 18*8=144 bits. The Huffman code is 57 bits. — jodag, Jan 08 '18 at 22:01

Dmitry Bychenko · Accepted Answer · 2018-01-08T22:51:56.770

You should compare the same units (bits as in the after the compession or characters as in the text before), e.g.

before: "Text to be encoded" == 18 * 8 bits = 144 bits
                             == 18 * 7 bits = 126 bits (in case of 7-bit characters)
after:  100100110100101110101011111000001110011011110010101100011 = 57 bits

so you have 144 (or 126) bits before and 57 bits after the compression. Or

before: "Text to be encoded" == 18 characters
after:   10010011 
         01001011
         10101011
         11100000
         11100110
         11110010
         10110001
         00000001 /* the last chunk is padded */ == 8 characters

so you have 18 ascii characters before and only 8 one byte characters after the compression. If characters are supposed to be 7-bit (0..127 range Ascii table) we have 9 characters after the compression:

after:  1001001 'I'
        1010010 'R'
        1110101 'u'
        0111110 '>'
        0000111 '\0x07'
        0011011 '\0x1B'
        1100101 'e'
        0110001 'l'
        0000001 '\0x01'

"only 8 ascii characters after" --> More like 8 _bytes_ as [ASCII](https://en.wikipedia.org/wiki/ASCII) is only defined for values 0-127. — chux - Reinstate Monica, Jan 08 '18 at 22:29

Why is huffman encoded text bigger than actual text?

1 Answers1