As an exercise I'm trying to encode some symbols using Huffman trees, but using my own class instead of the built in data types with Python.
Here is my node class:
class Node(object):
left = None
right = None
weight = None
data = None
code = ''
length = len(code)
def __init__(self, d, w, c):
self.data = d
self.weight = w
self.code = c
def set_children(self, ln, rn):
self.left = ln
self.right = rn
def __repr__(self):
return "[%s,%s,(%s),(%s)]" %(self.data,self.code,self.left,self.right)
def __cmp__(self, a):
return cmp(self.code, a.code)
def __getitem__(self):
return self.code
and here is the encoding function:
def encode(symbfreq):
tree = [Node(sym,wt,'') for sym, wt in symbfreq]
heapify(tree)
while len(tree)>1:
lo, hi = sorted([heappop(tree), heappop(tree)])
lo.code = '0'+lo.code
hi.code = '1'+hi.code
n = Node(lo.data+hi.data,lo.weight+hi.weight,lo.code+hi.code)
n.set_children(lo, hi)
heappush(tree, n)
return tree[0]
(Note, that the data
field will eventually contain a set()
of all the items in the children of a node. It just contains a sum for the moment whilst I get the encoding correct).
Here is the previous function I had for encoding the tree:
def encode(symbfreq):
tree = [[wt, [sym, ""]] for sym, wt in symbfreq]
heapq.heapify(tree)
while len(tree)>1:
lo, hi = sorted([heapq.heappop(tree), heapq.heappop(tree)], key=len)
for pair in lo[1:]:
pair[1] = '0' + pair[1]
for pair in hi[1:]:
pair[1] = '1' + pair[1]
heapq.heappush(tree, [lo[0] + hi[0]] + lo[1:] + hi[1:])
return sorted(heapq.heappop(tree)[1:], key=lambda p: (len(p[-1]), p))
However I've noticed that my new procedure is incorrect: it gives the top nodes the longest codewords instead of the final leaves, and doesn't produce the same tree for permutations of input symbols i.e. the following don't produce the same tree (when run with new encoding function):
input1 = [(1,0.25),(0,0.25),(0,0.25),(0,0.125),(0,0.125)]
input2 = [(0,0.25),(0,0.25),(0,0.25),(1,0.125),(0,0.125)]
I'm finding I'm really bad at avoiding this kind of off-by-one/ordering bugs - how might I go about sorting this out in the future?