0

I'm working on a huffman encoder/decoder in Python, and am experiencing some unexpected (at least for me) behavior in my code. Encoding the file is fine, the problem occurs when decoding the file. Below is the associated code:

def decode(cfile):
    with open(cfile,"rb") as f:
        enc = f.read()
        len_dkey = int(bin(ord(enc[0]))[2:].zfill(8) + bin(ord(enc[1]))[2:].zfill(8),2) # length of dictionary
        pad = ord(enc[2]) # number of padding zeros at end of message
        dkey = { int(k): v for k,v in json.loads(enc[3:len_dkey+3]).items() } # dictionary
        enc = enc[len_dkey+3:] # actual message in bytes
        com = []
        for b in enc:
            com.extend([ bit=="1" for bit in bin(ord(b))[2:].zfill(8)]) # actual encoded message in bits (True/False)
    cnode = 0 # current node for tree traversal
    dec = "" # decoded message
    for b in com:
        cnode = 2 * cnode + b + 1 # array implementation of tree
        if cnode in dkey:
            dec += dkey[cnode]
            cnode = 0

    with codecs.open("uncompressed_"+cfile,"w","ISO-8859-1") as f:
        f.write(dec)

The first with open(cfile,"rb") as f call runs very quickly for all file sizes (tested sizes are 1.2MB, 679KB, and 87KB), but the part that slows down the code significantly is the for b in com loop. I've done some timing and I honestly don't know what's going on.

I've timed the whole decode function on each file, as shown below:

87KB      1.5 sec
679KB     6.0 sec
1.2MB   384.7 sec

first of all, I don't even know how to assign this complexity. Next, I've timed a single run through of the problematic loop, and got that the line cnode = 2*cnode + b + 1 takes 2e-6 seconds while the if cnode in dkey line takes 0.0 seconds (according to time.clock() on OSX). So it seems as if the arithmetic is slowing down my program significantly...? Which I feel like doesn't make sense.

I actually have no idea what is going on, and any help at all would be super welcome

jaswon
  • 3
  • 3
  • 1
    Quite hard to tell, it looks more like a memory allocation problem then a arithmetic one. Things to check are: 1) print the length of com before entering the loop - does it go exponentially big with the 1.2MB file? 2) Check the contents of dkey - does this dictionary contain a large amount of data for a given key? perhaps print the size of the item on each iteration and see if there is a difference for the 1.2MB file. 3) take a look at task manager / top to see if the process is using a large amount of memory – Matt Young Apr 08 '16 at 15:35
  • Does your decoder actually work? Does the output look like you think it should? Particularly, is the decoded output as long as it should be? – user2357112 Apr 08 '16 at 20:56
  • to answer @user2357112, yes, the decoder works as expected, the output file is the same as the input file (before encoding). to answer @matt-young, the length of com scales linearly with file size, each value in `dkey` is one byte, and i've double checked, and Python goes from running 40MB of memory to 100MB of memory (I'm running it in terminal), which admittedly is strange – jaswon Apr 08 '16 at 22:30

1 Answers1

0

I found a solution to my problem, but I am still left with confusion afterwards. I solved the problem by changing the dec from "" to [], and then changing the dec += dkey[cnode] line to dec.append(dkey[cnode]). This resulted in the following times:

87KB    0.11 sec
679KB   0.21 sec
1.2MB   1.01 sec

As you can see, this has immensely cut down the time, so in that aspect, this was a success. However, I am still confused as to why python's string concatenation seems to be the problem here.

jaswon
  • 3
  • 3