I'm working on a huffman encoder/decoder in Python, and am experiencing some unexpected (at least for me) behavior in my code. Encoding the file is fine, the problem occurs when decoding the file. Below is the associated code:
def decode(cfile):
with open(cfile,"rb") as f:
enc = f.read()
len_dkey = int(bin(ord(enc[0]))[2:].zfill(8) + bin(ord(enc[1]))[2:].zfill(8),2) # length of dictionary
pad = ord(enc[2]) # number of padding zeros at end of message
dkey = { int(k): v for k,v in json.loads(enc[3:len_dkey+3]).items() } # dictionary
enc = enc[len_dkey+3:] # actual message in bytes
com = []
for b in enc:
com.extend([ bit=="1" for bit in bin(ord(b))[2:].zfill(8)]) # actual encoded message in bits (True/False)
cnode = 0 # current node for tree traversal
dec = "" # decoded message
for b in com:
cnode = 2 * cnode + b + 1 # array implementation of tree
if cnode in dkey:
dec += dkey[cnode]
cnode = 0
with codecs.open("uncompressed_"+cfile,"w","ISO-8859-1") as f:
f.write(dec)
The first with open(cfile,"rb") as f
call runs very quickly for all file sizes (tested sizes are 1.2MB, 679KB, and 87KB), but the part that slows down the code significantly is the for b in com
loop. I've done some timing and I honestly don't know what's going on.
I've timed the whole decode
function on each file, as shown below:
87KB 1.5 sec
679KB 6.0 sec
1.2MB 384.7 sec
first of all, I don't even know how to assign this complexity. Next, I've timed a single run through of the problematic loop, and got that the line cnode = 2*cnode + b + 1
takes 2e-6 seconds while the if cnode in dkey
line takes 0.0 seconds (according to time.clock() on OSX). So it seems as if the arithmetic is slowing down my program significantly...? Which I feel like doesn't make sense.
I actually have no idea what is going on, and any help at all would be super welcome