I have built a huffman encoder in Python, but because I'm storing the bits (which represent the characters) as strings, the encoded text is larger than the original. How can I use actual bits to compress text properly?
-
Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation. [on topic](http://stackoverflow.com/help/on-topic) and [how to ask](http://stackoverflow.com/help/how-to-ask) apply here. StackOverflow is not a design, coding, research, or tutorial service. Python has built-in bit operations; where are you stuck with those? – Prune Feb 01 '18 at 17:47
1 Answers
You can convert a str
of 1s and 0s to an int
type variable like this:
>>> int('10110001',2)
177
And you can convert int
s back to str
s of 1s and 0s like this:
>>> format(177,'b')
'10110001'
Also, note that you can write int
literals in binary using a leading 0b
, like this:
>>> foo = 0b10110001
>>> foo
177
Now, before you say "No, I asked for bits, not ints!" think about that for a second. An int
variable isn't stored in the computer's hardware as a base-10 representation of the number; it's stored directly as bits.
EDIT: Stefan Pochmann points out that this will drop leading zeros. Consider:
>>> code = '000010110001'
>>> bitcode = int(code, 2)
>>> format(bitcode, 'b')
'10110001'
So how do you keep the leading zeros? There are a few ways. How you go about it will likely depend on whether you want to type cast each character into an int
first and then concatenate them, or concatenate the strings of 1s and 0s before type casting the whole thing as an int
. The latter will probably be much simpler. One way that will work well for the latter is to store the length of the code and then use that with this syntax:
>>> format(bitcode, '012b')
'000010110001'
where '012b'
tells the format function to pad the left of the string with enough zeros to ensure a minimum length of 12. So you can use it in this way:
>>> code = '000010110001'
>>> code_length = len(code)
>>> bitcode = int(code, 2)
>>> format(bitcode, '0{}b'.format(code_length))
'000010110001'
Finally, if that {}
and second format
is unfamiliar to you, read up on string formatting.

- 121
- 5