2

I decided to implement a small Huffman encoding script. After testing it on a small probablity list, I get the correct results when printing the tree during construction, and wrong results if I don't. What could be the cause of the problem?

Here is my code:

from __future__ import division
import heapq


class LeafNode:
    def __init__(self,symbol,prob):
        self.symbol = symbol
        self.prob = prob
    def __repr__(self):
        return "(%s: %s)" % (self.symbol, self.prob)
class InternalNode:
    def __init__(self,prob,left,right):
        self.prob = prob
        self.left = left
        self.right= right
    def __repr__(self):
        return "(internal : %s)" % (self.prob)

def getDict(seq):
    d = dict()
    for symbol in seq:
        if symbol in d:
            d[symbol] += 1
        else:
            d[symbol] = 1
    return d

def returnProbList(seq):
    data = getDict(seq)
    sum_of_all = sum(data.values())
    l = sorted(data.items(), key=lambda x:x[1])
    return [LeafNode(x[0], x[1]/sum_of_all) for x in l]

def createTree(probs):
    heapq.heapify(probs)
    while len(probs) > 1:
        a = heapq.heappop(probs)
        b = heapq.heappop(probs)
        print a,b #removing this shows wrong results.
        f = InternalNode(a.prob+b.prob,a,b)
        heapq.heappush(probs,f)
    return probs[0]

def printSymbols(tree, seq = ''):    
    if isinstance(tree, InternalNode):
        printSymbols(tree.left, seq+'0')
        printSymbols(tree.right, seq+'1')
    else:
        print tree.symbol, seq

s = "This is some short text I have written. It seems that space is the most common symbol."

#l = returnProbList(s)
l = []
l.append(LeafNode('a4',0.05))
l.append(LeafNode('a3',0.2))
l.append(LeafNode('a2',0.35))
l.append(LeafNode('a1',0.4))



#print l
tree = createTree(l)
printSymbols(tree)

Debuging it using pdb has given me even different results.

#Without print
a4 00
a3 01
a2 10
a1 11
#With print
a1 0
a4 100
a3 101
a2 11
#With pdb
a1 0
a2 10
a3 110
a4 111
Bartlomiej Lewandowski
  • 10,771
  • 14
  • 44
  • 75

1 Answers1

2

This really has nothing to do with printing. Your problem is here:

heapq.heappush(probs,f)

f is an instance of your InternalNode class, but the class doesn't define any ordering. Therefore Python defaults to ordering InternalNode instances by memory address. This isn't at all what you want, and memory addresses vary depending on what else you do (print, run PDB, create or delete other objects ...).

The easiest fix is to add a __cmp__ method to your class:

    def __cmp__(a, b):
        return cmp(a.prob, b.prob)

Then the output will be consistent.

EDIT: hmm, you're also getting memory-address ordering for LeafNode instances, so add a __cmp__ to that too.

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • I never said that printing was the reason. I was actually wondering how could heapq know the ordering, but I expected a cmp function as a parameter. Thanks for help. – Bartlomiej Lewandowski Nov 10 '13 at 22:45
  • FYI, it's probably more common to use `(priority, object)` 2-tuples with `heapq` (see the docs for refinements). In your application, though, comparing by `.prob` attributes directly (i.e., using a custom `__cmp__`) really seems better. – Tim Peters Nov 11 '13 at 00:56
  • so do tuples have a __comp__ method based on the second element? – Bartlomiej Lewandowski Nov 11 '13 at 10:01
  • 1
    Sequences (including tuples) use [lexicographic comparison](http://www.network-theory.co.uk/docs/pytut/ComparingSequencesandOtherTypes.html). Click there for more explanation and examples. – Tim Peters Nov 11 '13 at 16:22