3

As an exercise I'm trying to encode some symbols using Huffman trees, but using my own class instead of the built in data types with Python.

Here is my node class:

class Node(object):
    left = None
    right = None
    weight = None
    data = None
    code = ''
    length = len(code)

    def __init__(self, d, w, c):
        self.data = d
        self.weight = w
        self.code = c

    def set_children(self, ln, rn):
        self.left = ln
        self.right = rn

    def __repr__(self):
        return "[%s,%s,(%s),(%s)]" %(self.data,self.code,self.left,self.right)

    def __cmp__(self, a):
        return cmp(self.code, a.code)

    def __getitem__(self):
        return self.code

and here is the encoding function:

def encode(symbfreq):
    tree = [Node(sym,wt,'') for sym, wt in symbfreq]
    heapify(tree)
    while len(tree)>1:
        lo, hi = sorted([heappop(tree), heappop(tree)])
        lo.code = '0'+lo.code
        hi.code = '1'+hi.code
        n = Node(lo.data+hi.data,lo.weight+hi.weight,lo.code+hi.code)
        n.set_children(lo, hi)
        heappush(tree, n)
    return tree[0]

(Note, that the data field will eventually contain a set() of all the items in the children of a node. It just contains a sum for the moment whilst I get the encoding correct).

Here is the previous function I had for encoding the tree:

def encode(symbfreq):
    tree = [[wt, [sym, ""]] for sym, wt in symbfreq]
    heapq.heapify(tree)
    while len(tree)>1:
        lo, hi = sorted([heapq.heappop(tree), heapq.heappop(tree)], key=len)
        for pair in lo[1:]:
            pair[1] = '0' + pair[1]
        for pair in hi[1:]:
            pair[1] = '1' + pair[1]
        heapq.heappush(tree, [lo[0] + hi[0]] + lo[1:] + hi[1:])
    return sorted(heapq.heappop(tree)[1:], key=lambda p: (len(p[-1]), p))

However I've noticed that my new procedure is incorrect: it gives the top nodes the longest codewords instead of the final leaves, and doesn't produce the same tree for permutations of input symbols i.e. the following don't produce the same tree (when run with new encoding function):

input1 = [(1,0.25),(0,0.25),(0,0.25),(0,0.125),(0,0.125)]
input2 = [(0,0.25),(0,0.25),(0,0.25),(1,0.125),(0,0.125)]

I'm finding I'm really bad at avoiding this kind of off-by-one/ordering bugs - how might I go about sorting this out in the future?

BartoszKP
  • 34,786
  • 15
  • 102
  • 130
Tom Kealy
  • 2,537
  • 2
  • 27
  • 45
  • Can you post some of the bad output? Especially for the last part. – Patrick Collins Nov 15 '13 at 17:13
  • Also at a glance I'd suspect that your error is here `lo, hi = sorted([heappop(tree), heappop(tree)])`. Both of the elements here are Nodes, and you've defined `__cmp__` for Nodes as `cmp(self.code, a.code)` but you haven't set the codes yet so it's always comparing two empty strings. I'm not sure if a heap is really something you need here, aren't you just treating the elements as a list, sorted from smallest to largest? I forget how Huffman encoding works so I'm not sure if that's the right approach. – Patrick Collins Nov 15 '13 at 17:21
  • @PatrickCollins I used a heap in my original 'class free' code as it always pops the lowest weight items first (that's how a Huffman Tree is created). The issue was going from the first code to a class, where I made a mistake I couldn't correct - because I did't understand __cmp__. – Tom Kealy Nov 18 '13 at 11:12
  • Again, if the only reason you're using the data structure is to make sure its sorted, I don't see what the advantage over using a sorted list is. The point of the tree data structure in the original algorithm is to give you faster *decoding*, not encoding. – Patrick Collins Nov 19 '13 at 09:24
  • @PatrickCollins I'm using it as the data structure for a statistics algorithm - which won't work for a sorted list. – Tom Kealy Nov 19 '13 at 10:37

2 Answers2

2

There's more than one oddity ;-) in this code, but I think your primary problem is this:

def __cmp__(self, a):
    return cmp(self.code, a.code)

Heap operations use the comparison method to order the heap, but for some reason you're telling it to order Nodes by the current length of their codes. You almost certainly want the heap to order them by their weights instead, right? That's how Huffman encoding works.

def __cmp__(self, a):
    return cmp(self.weight, a.weight)

For the rest, it's difficult to follow because 4 of your 5 symbols are the same (four 0 and one 1). How can you possibly tell whether it's working or not?

Inside the loop, this is strained:

lo, hi = sorted([heappop(tree), heappop(tree)])

Given the repair to __cmp__, that's easier as:

lo = heappop(tree)
hi = heappop(tree)

Sorting is pointless - the currently smallest element is always popped. So pop twice, and lo <= hi must be true.

I'd say more ;-), but at this point I'm confused about what you're trying to accomplish in the end. If you agree __cmp__ should be repaired, make that change and edit the question to give both some inputs and the exact output you're hoping to get.

More

About:

it gives the top nodes the longest codewords instead of the final leaves,

This isn't an "off by 1" thing, it's more of a "backwards" thing ;-) Huffman coding looks at nodes with the smallest weights first. The later a node is popped from the heap, the higher the weight, and the shorter its code should be. But you're making codes longer & longer as the process goes on. They should be getting shorter & shorter as the process goes on.

You can't do this while building the tree. In fact the codes aren't knowable until the tree-building process has finished.

So, rather than guess at intents, etc, I'll give some working code you can modify to taste. And I'll include a sample input and its output:

from heapq import heappush, heappop, heapify

class Node(object):
    def __init__(self, weight, left, right):
        self.weight = weight
        self.left = left
        self.right = right
        self.code = None

    def __cmp__(self, a):
        return cmp(self.weight, a.weight)

class Symbol(object):
    def __init__(self, name, weight):
        self.name = name
        self.weight = weight
        self.code = None

    def __cmp__(self, a):
        return cmp(self.weight, a.weight)

def encode(symbfreq):
    # return pair (sym2object, tree), where
    # sym2object is a dict mapping a symbol name to its Symbol object,
    # and tree is the root of the Huffman tree
    sym2object = {sym: Symbol(sym, w) for sym, w in symbfreq}
    tree = sym2object.values()
    heapify(tree)
    while len(tree) > 1:
        lo = heappop(tree)
        hi = heappop(tree)
        heappush(tree, Node(lo.weight + hi.weight, lo, hi))
    tree = tree[0]

    def assigncode(node, code):
        node.code = code
        if isinstance(node, Node):
            assigncode(node.left, code + "0")
            assigncode(node.right, code + "1")
    assigncode(tree, "")

    return sym2object, tree

i = [('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)]
s2o, t = encode(i)
for v in s2o.values():
    print v.name, v.code

That prints:

a 010
c 00
b 011
e 11
d 10

So, as hoped, the symbols with the highest weights have the shortest codes.

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • thanks! This is an algorithm for coin weighing problems, where the counterfeit is given the value 1. Could I modify the dict to be a tuple as I have repeated values. – Tom Kealy Nov 18 '13 at 11:10
  • Sure! No essential use of the dict is made; a dict mapping symbols to Symbol objects would simply be *convenient* for most applications. You could, for example, change `sym2object = ...` to `symbols = tuple(Symbol(sym, w) for sym, w in symbfreq)`, and then replace the next line with `tree = list(symbols)`. If you edit your question to give *exact* examples of inputs and their desired outputs, I can try to give you exact code to get that. – Tim Peters Nov 18 '13 at 18:52
  • thanks for the offer, but I actually managed to get what I needed working myself today. – Tom Kealy Nov 18 '13 at 20:51
  • Sorry to bother you again but the sorting was supposed to make sure that the relative position of the original array wasn't changed by the encoding procedure (i.e. the data in the 0th bin is placed in the leftmost leaf). How would I accomplish this? I've edited the question to reflect what I need. – Tom Kealy Nov 26 '13 at 16:40
  • Please open a new question - this one is already long and confused ;-) And rather than try to *describe* what you want, give examples of the *exact* inputs and outputs you want. I don't understand your description. – Tim Peters Nov 26 '13 at 16:46
  • that's exactly what I'm doing. – Tom Kealy Nov 26 '13 at 16:48
  • this is as concise as I could make it. http://stackoverflow.com/questions/20223488/assinging-codewords-regardless-of-permutation. Sorry, much thanks for any help. – Tom Kealy Nov 26 '13 at 17:07
1

I suspect the problem is in segment: -

lo.code = '0'+lo.code
hi.code = '1'+hi.code

both hi & low can be intermediate nodes where as you need to update the codes at the leaves where the actual symbols are. I think you should not maintain any codes at construction of the huffman tree but get the codes of individual symbols after the construction of huffman tree by just traversing.

Heres my pseudo code for encode: -

 tree = ConstructHuffman(); // Same as your current but without code field

 def encode(tree,symbol): 
     if tree.data == symbol : 
         return None
     else if tree.left.data.contains(symbol) :
         return '0' + encode(tree.left,symbol)
     else : 
        return '1' + encode(tree.right,symbol)

Calculate codes for all symbol using above encode method and then u can use them for encoding.

Note: Change comparision function from comparing codes to comparing weights.

Vikram Bhat
  • 6,106
  • 3
  • 20
  • 19